A lot of the times, students at the Academy enter data in a “wide” format (since it is a very natural way to enter data in a spreadsheet). Let’s say, for example, that they were collecting data for a household, and for each person, they were collecting information on three variables. Assume also that they were only collecting information about five household members. They might end up with a first row of column names something like “HouseholdID” | “member.01″ | “member.02″ | “member.03″ | “member.04″ | “member.05″ | “variable1.01″ | “variable1.02″ | “variable1.03″ | “variable1.04″ | “variable1.05″ | “variable2.01″ | “variable2.02″ … and so on. Sometimes, however, we may find it more useful to have our data in a “long” format. This post tells you how to quickly do that using R.
Here’s an example spreadsheet with some nonsense data entered.
The data is fine in this format, but it makes it difficult to, say, look at everyone who, for ItemC, has “red” as their choice. Below is an example of the same data in the spreadsheet above in long format, from which we can easily determine the answer to such a question.
Now, you can get to those results with some painstaking cut-and-paste work, or you can do the smart thing and load your data into R and be done with your work in just a few lines of code. Here’s how.
Your first step is to get the data into R. How you do this depends on the current format of your data. This example uses a comma separated value (CSV) file stored using Google Spreadsheets. You can use the same URL, or you can use a locally stored file. (I’ve shortened the URL with a cool WordPress plugin called Pretty Link.) I’ll call the data frame “aa”. Since I know that the CSV file has a header row, I’ll add that argument in for R so that it will add column names automatically.
> aa = read.csv("http://news.mrdwab.com/reshape", header=T)
At this point, if you want to see what you’ve imported, you can use the "fix" function which will open up R’s data browser. I’m only interested in seeing the column names and seeing how many columns there are so instead, I’ve just used "names(aa)" which, as you can see below, lets me know that there are 21 columns in this data frame.
1 2 3 4 5 6
> names(aa)  "Unique.var" "UnitID.01" "UnitID.02" "UnitID.03" "UnitID.04"  "UnitID.05" "ItemA.01" "ItemA.02" "ItemA.03" "ItemA.04"  "ItemA.05" "ItemB.01" "ItemB.02" "ItemB.03" "ItemB.04"  "ItemB.05" "ItemC.01" "ItemC.02" "ItemC.03" "ItemC.04"  "ItemC.05"
The following is where the reshaping happens. We need to tell R what data is being reshaped (in this example, "aa"), which direction we want ("wide" or "long"), and which columns "vary" (in this example, columns 2 through 21; this is where the "names" command comes in useful. As we can easily see in the step above, there are 21 columns in this data frame.
> bb = reshape(aa, direction="long", varying=2:21)
Note: The code presented above is the simplest form of using the “reshape” command in R. Depending on how you have named the variables in your actual dataset, you might need to add a few arguments. For example, R will automatically use the period as a separator, but if your variables are named “variableA_01″, “variableA_02″ and so on, you will need to specify this information to R by adding
sep="_"to the reshape command.
At this point, you can use a simple “fix” command to see what your data frame looks like.
> fix(bb) # "bb" is the name of the data frame we want to see, right?
It should look like this:
This is OK, but not great. First, you can see that R has sorted the data frame in a somewhat strange way. For example, I cannot see all of the information about Unique.var “1″ together since the first set of records is on the first row of the data (omitting the header), and the second set of records is on the twentieth row of data. There are also a lot of cells with NA values which, if removed, would make the dataset easier to view.
> cc = bb[with(bb, order(Unique.var)),] # Order the data frame by "Unique.var" > dd = na.omit(cc) # Get rid of the NA cells.
After running those two lines, you’ll have almost what we had in the second spreadsheet in this post. From there, it’s just a matter of saving your final data frame as a CSV file, opening that in a spreadsheet program (I just use Google Docs), and delete the columns you don’t need (in this example, the first [unnamed] column, “time”, and “id”). Of course, when you are doing this, you don’t need to assign a new name to your data frames at each step. I only do it so that I can easily compare what is going on from change to change.