Quickly reshaping data from “wide” to “long” formats in R

A lot of the times, students at the Academy enter data in a “wide” format (since it is a very natural way to enter data in a spreadsheet). Let’s say, for example, that they were collecting data for a household, and for each person, they were collecting information on three variables. Assume also that they were only collecting information about five household members. They might end up with a first row of column names something like “HouseholdID” | “member.01″ | “member.02″ | “member.03″ | “member.04″ | “member.05″ | “variable1.01″ | “variable1.02″ | “variable1.03″ | “variable1.04″ | “variable1.05″ | “variable2.01″ | “variable2.02″ … and so on. Sometimes, however, we may find it more useful to have our data in a “long” format. This post tells you how to quickly do that using R.

Here’s an example spreadsheet with some nonsense data entered.

The data is fine in this format, but it makes it difficult to, say, look at everyone who, for ItemC, has “red” as their choice. Below is an example of the same data in the spreadsheet above in long format, from which we can easily determine the answer to such a question.

Now, you can get to those results with some painstaking cut-and-paste work, or you can do the smart thing and load your data into R and be done with your work in just a few lines of code. Here’s how.

Your first step is to get the data into R. How you do this depends on the current format of your data. This example uses a comma separated value (CSV) file stored using Google Spreadsheets. You can use the same URL, or you can use a locally stored file. (I’ve shortened the URL with a cool WordPress plugin called Pretty Link.) I’ll call the data frame “aa”. Since I know that the CSV file has a header row, I’ll add that argument in for R so that it will add column names automatically.


> aa = read.csv("http://news.mrdwab.com/reshape", header=T)

At this point, if you want to see what you’ve imported, you can use the "fix" function which will open up R’s data browser. I’m only interested in seeing the column names and seeing how many columns there are so instead, I’ve just used "names(aa)" which, as you can see below, lets me know that there are 21 columns in this data frame.


> names(aa)
 [1] "Unique.var" "UnitID.01"  "UnitID.02"  "UnitID.03"  "UnitID.04"
 [6] "UnitID.05"  "ItemA.01"   "ItemA.02"   "ItemA.03"   "ItemA.04"
[11] "ItemA.05"   "ItemB.01"   "ItemB.02"   "ItemB.03"   "ItemB.04"
[16] "ItemB.05"   "ItemC.01"   "ItemC.02"   "ItemC.03"   "ItemC.04"
[21] "ItemC.05"

The following is where the reshaping happens. We need to tell R what data is being reshaped (in this example, "aa"), which direction we want ("wide" or "long"), and which columns "vary" (in this example, columns 2 through 21; this is where the "names" command comes in useful. As we can easily see in the step above, there are 21 columns in this data frame.


> bb = reshape(aa, direction="long", varying=2:21)

Note: The code presented above is the simplest form of using the “reshape” command in R. Depending on how you have named the variables in your actual dataset, you might need to add a few arguments. For example, R will automatically use the period as a separator, but if your variables are named “variableA_01″, “variableA_02″ and so on, you will need to specify this information to R by adding sep="_" to the reshape command.

At this point, you can use a simple “fix” command to see what your data frame looks like.


> fix(bb) # "bb" is the name of the data frame we want to see, right?

It should look like this:

This is OK, but not great. First, you can see that R has sorted the data frame in a somewhat strange way. For example, I cannot see all of the information about Unique.var “1″ together since the first set of records is on the first row of the data (omitting the header), and the second set of records is on the twentieth row of data. There are also a lot of cells with NA values which, if removed, would make the dataset easier to view.


> cc = bb[with(bb, order(Unique.var)),] # Order the data frame by "Unique.var"
> dd = na.omit(cc) # Get rid of the NA cells.

After running those two lines, you’ll have almost what we had in the second spreadsheet in this post. From there, it’s just a matter of saving your final data frame as a CSV file, opening that in a spreadsheet program (I just use Google Docs), and delete the columns you don’t need (in this example, the first [unnamed] column, “time”, and “id”). Of course, when you are doing this, you don’t need to assign a new name to your data frames at each step. I only do it so that I can easily compare what is going on from change to change.


Related posts (possibly):

  1. Getting data into R When you first open R, you’re greeted with a screen...
  2. A little spark for presenting your data For some reason, I’ve been obsessing over the presentation of...
  3. Using the reshape package in R for pivot-table-like functionality A little more than a week ago, I wrote about...
  4. R is like a giant calculator for grownups One of the things that is interesting about R is...
  5. It’s a choropleth party with R, and everyone’s invited Map party time. For some reason this happens every once...
This entry was posted in (all categories), Geekiness, Useless Knowledge and tagged , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

One Trackback

  1. By R is like a giant calculator for grownups on June 30, 2010 at 12:40 pm

    [...] the fun things about it is how interactive it can be. While my examples so far have been a little bit more involved, it can be useful to spend some time just getting acquainted with how R performs [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus