Getting data into R

When you first open R, you’re greeted with a screen similar to the following:

R version 2.10.0 (2009-10-26)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

I’ve been trying to encourage my students to use R for some of their work, but in the process, I sort of forgot that for most people, starting up a program and just being greeted with a command prompt might be somewhat intimidating. So after several of my students indicated that they had downloaded and installed R but had no idea what to do next, I thought I would write about some of the very basic ways to get started. I recognize that for some huge datasets, the suggestions here are not the best, but for me, and for most of my students, the datasets that we would be working with are actually quite small.

Part one: Entering your data directly in R

For really small sets of data or for quick calculations, you might just go ahead and enter your data directly in R. The easiest way to do this is to start by entering each set of data as objects. In the following example, we’re going to create a data frame (a table in R) with the names of some of my students and the scores they’ve received on three assignments in a fictional course titled “Data Analysis with R for NGO Workers”.

> Student = c("Gajanan", "Hari", "Priya", "Raj", "Shreekanth", "Shreerang",
+ "Soni", "Vinay")
> Assignment.1 = c(93, 98, 90, 70, 80, 82, 75, 77)
> Assignment.2 = c(90, 87, 83, 88, 78, 87, 79, 84)
> Assignment.3 = c(97, 92, 85, 90, 77, 70, 90, 93)

Notice that by entering this information, you don’t receive any “acknowledgement” from R than anything has happened. It just shows you the prompt again. To see the entry, you would need to then type the name of the object that you have created, for instance, “Student”, “Assignment.1″, “Assignment.2″, or “Assignment.3″ in order to see the values. Notice also that R is case-sensitive. So, typing “student” would result in an error since we had entered the name with an upper-case “S”. Notice also that if you have not completed your statement (as in the first object we were creating) R adds a little “+” at the start of the line to indicate to you that your statement is incomplete.

Here’s what we get when we type “Assignment.2″

> Assignment.2
[1] 90 87 83 88 78 87 79 84

The “[1]” at the start of the second line above is the index of the first value. Consider the following:

> set.seed(123); sample(300, 30)
 [1]  87 236 122 263 279  14 156 262 162 133 278 132 196 165  30 257  70  12  93
[20] 269 250 194 179 276 181 195 150 163  79  40

In this case, the number “269″ is the twentieth number in this list of numbers. (See Simple sampling with R and Sampling with replacement in R for a basic introduction to sampling.) Knowing the index of a number can be useful when you need to know the position of a certain value since occasionally, you want to select just a single value from a vector. Type the following and compare it to what you got when you typed “Assignment.2″:

> Assignment.2[3]
[1] 83

R has returned the third value from the object you created.

If you want to create a simple table of these four sets of data you’ve created, you use the “data.frame” function. The first line below creates a data frame called “R.For.NGOs” and the second one tells R to display it. The third line opens up R’s built-in spreadsheet, which I generally don’t use except to quickly scan data.

> R.For.NGOs = data.frame(Student, Assignment.1, Assignment.2, Assignment.3)
> R.For.NGOs
     Student Assignment.1 Assignment.2 Assignment.3
1    Gajanan           93           90           97
2       Hari           98           87           92
3      Priya           90           83           85
4        Raj           70           88           90
5 Shreekanth           80           78           77
6  Shreerang           82           87           70
7       Soni           75           79           90
8      Vinay           77           84           93
> fix(R.For.NGOs)

Now, try using the index feature and see what happens.

> R.For.NGOs[3]
  Assignment.2
1           90
2           87
3           83
4           88
5           78
6           87
7           79
8           84

This is probably not what you expected, right? R has returned just the third column. Once data is in a table or a matrix, R needs both a column and a row index to return a specific value. Let’s say we wanted Gajanan’s score for the third assignment. For this, Assignment 3 is the fourth column, and Gajanan is the first row, so we need to reference the index “[1,4]“. If we wanted only Priya’s scores, she’s the third row, so we would need to reference the index “[,3]“. (Note that you do not write “[0,3]“.)

> R.For.NGOs[1,4]
[1] 97
> R.For.NGOs[,3]
[1] 90 87 83 88 78 87 79 84

Part 2: Using a spreadsheet for data entry

As you can see above, it is not too difficult to create your data right in R. However, if you had a dataset that has a lot of records (like the one I used in Quickly reshaping data from ), you would be silly to use R. For such data, it makes much more sense to use something like OpenOffice.org Calc or Microsoft Excel or some other spreadsheet interface. For starters, you would be more comfortable with the interface. Additionally, there may be more opportunities to quickly check your data over for errors, and you might even be able to set up data-entry rules to prevent incorrect values from being entered.

How do you get your data into R if it’s in an Excel file or another spreadsheet? There are several ways. The most common one I use is to just save my data as a comma separated value (CSV) file and open that in R. The second most common approach I use is to copy the data and use R’s “clipboard” feature to get the data into R. Here’s how you’d proceed for each of these approaches. I’ll assume that you’ve used “File > Change dir…” to have R working out of your “My Documents” folder and that your CSV file is saved in that folder. Here, I’m going to create an object in R called “Book.Sales” using a file called “BookSales.csv” stored in my “My Documents” folder. This file has the data starting on the second row; the first row contains the column names.

> Book.Sales = read.csv("BookSales.csv", header=T)

If you don’t want to change the working directory, you can also enter the full path to the file. For example “c:\\data\\file.csv” or “c:/data/file.csv” can be used to access a file called “file.csv” in a folder named “data” on drive C. Also, if you know the URL of a CSV file, you can access the file directly by typing the URL in place of the file name. This is the option I usually use for my online examples.

Some advice: do not use spreadsheets with merged cells and lots of blank cells at the top. Instead, create a new CSV file where the first row contains the column names and the data starts on the next line. That makes putting your data into R very easy….

If you prefer to go the “cut-and-paste” way, just open up your spreadsheet, copy the cells you’re interested in, and type the following (we’ll assume it is the same dataset):

> Book.Sales = read.delim("clipboard", header=T)

The main difference between the read.table and read.csv options is that the CSV option looks for a comma separating each value, while the read.delim looks for a tab character.

Once you’ve overcome the hurdle of getting data into R, playing with your data will be much more fun!


Related posts (possibly):

  1. Quickly reshaping data from “wide” to “long” formats in R A lot of the times, students at the Academy enter...
  2. A little spark for presenting your data For some reason, I’ve been obsessing over the presentation of...
  3. Am I inconsistent? I won’t pretend that I don’t have any illegal software...
  4. It’s a choropleth party with R, and everyone’s invited Map party time. For some reason this happens every once...
  5. R is like a giant calculator for grownups One of the things that is interesting about R is...
This entry was posted in (all categories), Geekiness, Useless Knowledge and tagged , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus