Reshaping data in R revisited

A year ago, I wrote a post about reshaping data from a wide format to a long format. I thought that considering how much time had passed, it would be good to revisit R’s in-built reshape functions. For these examples, I’ve copied the Stata examples from the UCLA Academic Technology Services’s “Reshape data wide to long” page. Since the data is provided in Stata dta files, you need to first load the “foreign” package to be able to read the data in R.

This first example is very basic. There are four variables, the first one being the unique id, and the remaining three being the measures over three years. The basic reshape command in R needs you to specify the data that is being reshaped, the ultimate “direction” (wide or long), and which variables are the ones to be reshaped. The default character R expects for “sep” is a period–in other words, it expects that your variables are named in the form of “faminc.96” and so on. However, your variables may be named in other ways, for example “faminc-96”, “faminc_96”, or (as in this example) “faminc96”. If your variable naming pattern is anything other than what R expects as its default, you also need to specify the separating character. In the case of this dataset, there is no separating character, so you simply use sep=“”.

library(foreign) # Lets us use Stata files directly
# Example 1: Very basic reshape
# Use "read.dta" instead of "read.csv" or "read.table"
faminc = read.dta("http://www.ats.ucla.edu/stat/stata/modules/faminc.dta")
faminc
##   famid faminc96 faminc97 faminc98
## 1     3    75000    76000    77000
## 2     1    40000    40500    41000
## 3     2    45000    45400    45800
l.faminc = reshape(faminc, direction="long",
                   varying=2:4, sep="", idvar="famid")
l.faminc[order(l.faminc$famid),]
##      famid time faminc
## 1.96     1   96  40000
## 1.97     1   97  40500
## 1.98     1   98  41000
## 2.96     2   96  45000
## 2.97     2   97  45400
## 2.98     2   98  45800
## 3.96     3   96  75000
## 3.97     3   97  76000
## 3.98     3   98  77000

In the second example at UCLA ATS, the unique identifier is the combination of the first two variables. Since R assumes that whatever you have not specified as varying is going to be your identifying variable, it is not always required that you specify anything for “idvar”.

# Example 2: Two identifying variables
kidshtwt = read.dta("http://www.ats.ucla.edu/stat/stata/modules/kidshtwt.dta")
kidshtwt
##   famid birth ht1 ht2 wt1 wt2
## 1     1     1 2.8 3.4  19  28
## 2     1     2 2.9 3.8  21  28
## 3     1     3 2.2 2.9  20  23
## 4     2     1 2.0 3.2  25  30
## 5     2     2 1.8 2.8  20  33
## 6     2     3 1.9 2.4  22  33
## 7     3     1 2.2 3.3  22  28
## 8     3     2 2.3 3.4  20  30
## 9     3     3 2.1 2.9  22  31
# Note the use of "timevar" to name the "times" column more appropriately.
# Also, we want to exclude the weight data from our reshape.
l.kidsht = reshape(kidshtwt[-c(5, 6)], direction="long", varying=3:4,
                   sep="", idvar=1:2, timevar="age")
# There are other ways to do that previous step. Here is one:
#      l.kidsht = reshape(kidshtwt, direction="long", idvar=1:2,
#                         varying=3:4, drop=5:6, sep="", timevar="age")
# Let's sort the data first by the family id then by the birth order
l.kidsht[order(l.kidsht$famid, l.kidsht$birth),]
##       famid birth age  ht
## 1.1.1     1     1   1 2.8
## 1.1.2     1     1   2 3.4
## 1.2.1     1     2   1 2.9
## 1.2.2     1     2   2 3.8
## 1.3.1     1     3   1 2.2
## 1.3.2     1     3   2 2.9
## 2.1.1     2     1   1 2.0
## 2.1.2     2     1   2 3.2
## 2.2.1     2     2   1 1.8
## 2.2.2     2     2   2 2.8
## 2.3.1     2     3   1 1.9
## 2.3.2     2     3   2 2.4
## 3.1.1     3     1   1 2.2
## 3.1.2     3     1   2 3.3
## 3.2.1     3     2   1 2.3
## 3.2.2     3     2   2 3.4
## 3.3.1     3     3   1 2.1
## 3.3.2     3     3   2 2.9

The third example at UCLA’s page is pretty straightforward. It uses the same data we just loaded in example 2, but we are reshaping all four measured variables.

l.kidshtwt = reshape(kidshtwt, direction="long", idvar=1:2,
                     varying=3:6, sep="", timevar="age")
l.kidshtwt[order(l.kidshtwt$famid, l.kidshtwt$birth),]
##       famid birth age  ht wt
## 1.1.1     1     1   1 2.8 19
## 1.1.2     1     1   2 3.4 28
## 1.2.1     1     2   1 2.9 21
## 1.2.2     1     2   2 3.8 28
## 1.3.1     1     3   1 2.2 20
## 1.3.2     1     3   2 2.9 23
## 2.1.1     2     1   1 2.0 25
## 2.1.2     2     1   2 3.2 30
## 2.2.1     2     2   1 1.8 20
## 2.2.2     2     2   2 2.8 33
## 2.3.1     2     3   1 1.9 22
## 2.3.2     2     3   2 2.4 33
## 3.1.1     3     1   1 2.2 22
## 3.1.2     3     1   2 3.3 28
## 3.2.1     3     2   1 2.3 20
## 3.2.2     3     2   2 3.4 30
## 3.3.1     3     3   1 2.1 22
## 3.3.2     3     3   2 2.9 31

The fourth example was the most tricky one for me at first. In that example, the variables are not distinuished by “time” (numerically) but rather, by a character. As you can see, the variable names are “famid”, “named”, “incd”, “namem”, and “incm” – in other words, income and name for dad (variable ending in “d”) and mom (variable ending in “m”) for each family.

# Example 3: Non-numeric identifiers for the variables
dadmomw = read.dta("http://www.ats.ucla.edu/stat/stata/modules/dadmomw.dta")
dadmomw
##   famid named  incd namem  incm
## 1     1  Bill 30000  Bess 15000
## 2     2   Art 22000   Amy 18000
## 3     3  Paul 25000   Pat 50000
r.dadmomw = reshape(dadmomw, direction="long", idvar=1,
                    varying=2:5, sep="", v.names=c("name", "inc"),
                    timevar="dadmom", times=c("dad", "mom"))
r.dadmomw[order(r.dadmomw$famid),]
##       famid dadmom  name  inc
## 1.dad     1    dad 30000 Bill
## 1.mom     1    mom 15000 Bess
## 2.dad     2    dad 22000  Art
## 2.mom     2    mom 18000  Amy
## 3.dad     3    dad 25000 Paul
## 3.mom     3    mom 50000  Pat

Stata’s commands are certainly more direct (see below for what you would do for the last example in Stata). R’s commands sometimes tend to be a bit verbose, but in some ways, that might also help you remember what you’re doing. (I still don’t know what “i” and “j” in the Stata reshape commands stand for.) If you can afford the ~ $2,500 price tag, Stata is also faster.

. use http://www.ats.ucla.edu/stat/stata/modules/dadmomw, clear
. list
. use http://www.ats.ucla.edu/stat/stata/modules/dadmomw, clear
. list
. reshape long name  inc, i(famid) j(dadmom) string
. list
comments powered by Disqus