A little more than a week ago, I wrote about creating pivot tables in Microsoft Excel and OpenOffice.org. I also mentioned that I would explain how to do similar calculations by using R. This post will explain how to achieve similar results in R by using the reshape package.
I had initially started experimenting with the reshape package several months ago when I was trying to figure out how to reshape data from wide to long formats. However, once I started experimenting with it, I realized I had misunderstood what the reshape package was designed to do. Now that I finally have a grasp of what can be done using the package, I thought I would share what I’ve found using a few examples.
Part 1: Contingency tables using built-in functions
There are a lot of different ways to tabulate data in R. In this post, I’ll start by demonstrating how to use table, xtabs, and ftable before making things more interesting and using the reshape package to flip data around.
We’ll start with R’s built-in functions. First, we’ll get some data into R. We’ll start with a basic table from Wikipedia that was used to demonstrate cross-tabulation. I’ve already gone and put the data in a CSV file that can be loaded using the code below.
handedness = read.csv("http://news.mrdwab.com/data-handedness")
table(handedness)
## Handedness
## Gender left-handed right-handed
## Female 1 5
## Male 2 4
Did you see how easy it was to tabulate frequencies of the data? Unfortunately, not all data is that easy to work with. Now, we’ll add a couple of columns to the data.
set.seed(123)
fav.col = sample(c("red", "green", "blue"), 12, replace = T)
fav.shape = sample(c("square", "triangle", "circle"), 12, replace = T)
handedness.plus = cbind(handedness, fav.col, fav.shape)
We can use the table function again to see what our data look like. In the following line, we are using favorite color (fav.col) as our rows, Gender as our columns, and separate tables for left-handed and right-handed respondents.
table(handedness.plus$fav.col, handedness.plus$Gender, handedness.plus$Handedness)
## , , = left-handed
##
##
## Female Male
## blue 0 1
## green 1 1
## red 0 0
##
## , , = right-handed
##
##
## Female Male
## blue 2 2
## green 2 1
## red 1 1
##
This is OK, but the code is somewhat cumbersome. The xtabs function will give you the same result, but with some simpler code. Notice that we don’t need to use the “$” notation in the following code since the last element in our command identifies the data.
xtabs(~fav.col + Gender + Handedness, handedness.plus)
## , , Handedness = left-handed
##
## Gender
## fav.col Female Male
## blue 0 1
## green 1 1
## red 0 0
##
## , , Handedness = right-handed
##
## Gender
## fav.col Female Male
## blue 2 2
## green 2 1
## red 1 1
##
The ftable (flat table) function is even more flexible. Instead of having two separate tables, for instance, you can have a single table with the same information. With the ftable function, you mention the data, and then identify what data you want to use for rows, and what data you want to use for your columns. In our data, we have four columns named “Gender“, “Handedness“, “fav.col“, and “fav.shape“. If we wanted a flat table version of the previous output, we want the third column as our rows, and our columns are determined by the first and third columns. Here are three examples and their output.
ftable(handedness.plus, row.vars = 3, col.vars = c(1, 2))
## Gender Female Male
## Handedness left-handed right-handed left-handed right-handed
## fav.col
## blue 0 2 1 2
## green 1 2 1 1
## red 0 1 0 1
ftable(handedness.plus, row.vars = 1, col.vars = c(2, 4))
## Handedness left-handed right-handed
## fav.shape circle square triangle circle square triangle
## Gender
## Female 1 0 0 4 1 0
## Male 0 1 1 1 2 1
ftable(handedness.plus, row.vars = 1:3, col.vars = 4)
## fav.shape circle square triangle
## Gender Handedness fav.col
## Female left-handed blue 0 0 0
## green 1 0 0
## red 0 0 0
## right-handed blue 1 1 0
## green 2 0 0
## red 1 0 0
## Male left-handed blue 0 0 1
## green 0 1 0
## red 0 0 0
## right-handed blue 1 0 1
## green 0 1 0
## red 0 1 0
Part 2: Advanced results using the reshape package
These flat frequency tables are informative, but they may not provide you all the information you actually want. For instance, what if you wanted more pivot-table-like results, where you were interested in not just frequencies, but maybe also sums and averages. If that’s the case, then you really need to figure out how to use the reshape package. Here are some examples.
First, we’ll load the package, load some data, and preview the rows in our data. (We’ll use the same data from the Pivot Tables in Excel and OpenOffice.org Calc post.)
library(reshape)
## Loading required package: plyr
## Attaching package: 'reshape'
## The following object(s) are masked from 'package:plyr':
##
## rename, round_any
book.sales = read.csv("http://news.mrdwab.com/data-booksales")
names(book.sales)
## [1] "Representative" "Region" "Month" "Publisher"
## [5] "Subject" "Sales" "Margin" "Quantity"
In order to use the reshape package, you need to “melt” the data. The melting function asks for id variables (id.vars) and measured variables (measured). In the case of the book sales dataset, the measured variables are “Sales“, “Margin“, and “Quantity” (columns 6 through 8). Using that information, we melt the data using the following code.
m.book.sales = melt(book.sales, id.vars = 1:5)
Now, it’s time to play around and see what we can do with reshape. The counterpart to “melt” is “cast“. We first mention what dataset we are working with (m.book.sales) and then the reshape function that we want to use. We use the “~” symbol to define relationships. The “~” is usually read as “is described by”, but it’s best understood with some experimentation.
cast(m.book.sales, Region ~ variable)
## Aggregation requires fun.aggregate: length used as default
## Region Sales Margin Quantity
## 1 E 15 15 15
## 2 N 122 122 122
## 3 S 91 91 91
## 4 W 77 77 77
Well, that’s not really useful. But the notice provided, “Aggregation requires fun.aggregate: length used as default” is helpful. What that means is that we need to tell R what functions we want to use, for instance, “sum” or “mean“. Let’s try this again.
cast(m.book.sales, Region ~ variable, sum)
## Region Sales Margin Quantity
## 1 E 2267 1082 135
## 2 N 18328 8552 1130
## 3 S 13276 6231 813
## 4 W 12079 5758 690
Much better. But, can we make more complicated tables? Here’s one sorted first by region, then by representative.
cast(m.book.sales,
Region + Representative ~ variable, sum)
## Region Representative Sales Margin Quantity
## 1 E Kunj 1309 606.4 82
## 2 E Rajesh 958 475.5 53
## 3 N Gajanan 5349 2551.3 307
## 4 N Mallesh 6932 3273.2 445
## 5 N Priya 4791 2168.9 309
## 6 N Ravi 1257 558.9 69
## 7 S Mahanta 3781 1698.0 239
## 8 S Raj 3464 1690.5 183
## 9 S Soni 6032 2842.1 391
## 10 W Shreekanth 4065 1930.8 222
## 11 W Shreerang 3570 1675.5 213
## 12 W Sree Hari 4443 2151.7 255
Notice that the above code is a convenient flat table. Also, the above table works well because each representative only sells in one region. But what if we had used “Region + Representative” instead?
cast(m.book.sales,
Region + Publisher ~ variable, sum)
## Region Publisher Sales Margin Quantity
## 1 E Bloomsbury 261.0 137.0 18
## 2 E McGraw-Hill 1116.0 507.4 58
## 3 E Routledge 120.0 64.8 6
## 4 E SAGE 610.0 289.4 43
## 5 E Viva 160.0 83.2 10
## 6 N Bloomsbury 2812.5 1267.8 237
## 7 N McGraw-Hill 3901.8 1783.7 193
## 8 N Penguin 864.0 403.6 54
## 9 N Routledge 3741.0 1690.6 174
## 10 N SAGE 5484.8 2629.5 380
## 11 N Viva 1524.2 777.1 92
## 12 S Bloomsbury 1335.0 595.0 114
## 13 S McGraw-Hill 2455.8 1123.6 125
## 14 S Penguin 972.0 429.3 67
## 15 S Routledge 2688.0 1292.6 123
## 16 S SAGE 4640.6 2206.8 320
## 17 S Viva 1185.1 583.3 64
## 18 W Bloomsbury 787.5 371.2 63
## 19 W McGraw-Hill 4299.0 1944.2 203
## 20 W Penguin 414.0 188.8 29
## 21 W Routledge 1834.0 904.8 85
## 22 W SAGE 3540.3 1733.5 246
## 23 W Viva 1204.0 615.6 64
Notice that now things are not so tidy. Books from a given publisher are sold in multiple districts. By sticking another “~” into the code after “variable“, we can get some separated tables. Compare the following table with the previous one.
cast(m.book.sales,
Region ~ variable ~ Publisher, sum)
## , , Publisher = Bloomsbury
##
## variable
## Region Sales Margin Quantity
## E 261.0 137.0 18
## N 2812.5 1267.8 237
## S 1335.0 595.0 114
## W 787.5 371.2 63
##
## , , Publisher = McGraw-Hill
##
## variable
## Region Sales Margin Quantity
## E 1116 507.4 58
## N 3902 1783.7 193
## S 2456 1123.6 125
## W 4299 1944.2 203
##
## , , Publisher = Penguin
##
## variable
## Region Sales Margin Quantity
## E 0 0.0 0
## N 864 403.6 54
## S 972 429.3 67
## W 414 188.8 29
##
## , , Publisher = Routledge
##
## variable
## Region Sales Margin Quantity
## E 120 64.8 6
## N 3741 1690.6 174
## S 2688 1292.6 123
## W 1834 904.8 85
##
## , , Publisher = SAGE
##
## variable
## Region Sales Margin Quantity
## E 610 289.4 43
## N 5485 2629.5 380
## S 4641 2206.8 320
## W 3540 1733.5 246
##
## , , Publisher = Viva
##
## variable
## Region Sales Margin Quantity
## E 160 83.2 10
## N 1524 777.1 92
## S 1185 583.3 64
## W 1204 615.6 64
##
What if we wanted totals? To get that, we need to add “margins” to our code.
cast(m.book.sales, Region ~ variable, sum,
margins=c("grand_col", "grand_row"))
## Region Sales Margin Quantity (all)
## 1 E 2267 1082 135 3484
## 2 N 18328 8552 1130 28011
## 3 S 13276 6231 813 20320
## 4 W 12079 5758 690 18527
## 5 (all) 45951 21623 2768 70341
And, what if we wanted both sum and mean? We can set up multiple functions as follows.
cast(m.book.sales,
Region ~ variable, c(sum, mean))
## Region Sales_sum Sales_mean Margin_sum Margin_mean Quantity_sum
## 1 E 2267 151.1 1082 72.12 135
## 2 N 18328 150.2 8552 70.10 1130
## 3 S 13276 145.9 6231 68.47 813
## 4 W 12079 156.9 5758 74.78 690
## Quantity_mean
## 1 9.000
## 2 9.262
## 3 8.934
## 4 8.961
Of course, we can also subset our data so that we just get information on selected variables. You may remember from the original book sales dataset that the “variables” are “Sales“, “Margin“, and “Quantity“. We can use that information to cast different summary tables. Here are two more examples.
cast(m.book.sales,
Region + Representative ~ variable,
c(length, sum, mean),
subset = variable %in% "Sales")
## Region Representative Sales_length Sales_sum Sales_mean
## 1 E Kunj 8 1309 163.6
## 2 E Rajesh 7 958 136.9
## 3 N Gajanan 33 5349 162.1
## 4 N Mallesh 47 6932 147.5
## 5 N Priya 35 4791 136.9
## 6 N Ravi 7 1257 179.6
## 7 S Mahanta 24 3781 157.5
## 8 S Raj 23 3464 150.6
## 9 S Soni 44 6032 137.1
## 10 W Shreekanth 26 4065 156.4
## 11 W Shreerang 22 3570 162.3
## 12 W Sree Hari 29 4443 153.2
cast(m.book.sales,
Region + Representative ~ variable,
c(sum, mean),
subset = variable %in% c("Sales", "Quantity"))
## Region Representative Sales_sum Sales_mean Quantity_sum Quantity_mean
## 1 E Kunj 1309 163.6 82 10.250
## 2 E Rajesh 958 136.9 53 7.571
## 3 N Gajanan 5349 162.1 307 9.303
## 4 N Mallesh 6932 147.5 445 9.468
## 5 N Priya 4791 136.9 309 8.829
## 6 N Ravi 1257 179.6 69 9.857
## 7 S Mahanta 3781 157.5 239 9.958
## 8 S Raj 3464 150.6 183 7.957
## 9 S Soni 6032 137.1 391 8.886
## 10 W Shreekanth 4065 156.4 222 8.538
## 11 W Shreerang 3570 162.3 213 9.682
## 12 W Sree Hari 4443 153.2 255 8.793
Who needs pivot tables in Excel?
And you ENJOY doing this?!?!
Wouldn’t you?
Just wait until my current project is unveiled….
Thanks for your sharing! It at least gives me some ideas when difficulties keep coming to me~~
You’re welcome. I think solving problems in R helps generate new ideas anyway…
Would be great if you could comment on whether or not ‘reshape’ can be used to create output similar to that of ftable … that is, how to create something that looks like a pivot table with categorical variables on both the rows and columns … but with values based on functions (i.e. sum)
Mauricio,
I’m not sure I understand entirely what you’re looking for. Can you provide an example of the raw data and describe what you want to end up with?
I agree,
melt/recast/reshapeis essentially a better (harder to use but better) version of Pivot tables / OLAP cubes. I wouldn’t say your other post was wrong exactly, just not using the full power ofplyr.I think in the other one you could have just transposed the data, no?
I still need to get to grips with
plyr(pun intended?), but it appears it could also help with the wide-to-long reshape example. That example isn’t a simple transpose example though–if it were, the dimensions of the output would be the same as the input, but with the lengths of the rows and columns switched.