I had initially started experimenting with the reshape package several months ago when I was trying to figure out how to reshape data from wide to long formats. However, once I started experimenting with it, I realized I had misunderstood what the reshape package was designed to do. Now that I finally have a grasp of what can be done using the package, I thought I would share what I’ve found using a few examples.
Part 1: Contingency tables using built-in functions
There are a lot of different ways to tabulate data in R. In this post, I’ll start by demonstrating how to use table, xtabs, and ftable before making things more interesting and using the reshape package to flip data around.
We’ll start with R’s built-in functions. First, we’ll get some data into R. We’ll start with a basic table from Wikipedia that was used to demonstrate cross-tabulation. I’ve already gone and put the data in a CSV file that can be loaded using the code below.
Did you see how easy it was to tabulate frequencies of the data? Unfortunately, not all data is that easy to work with. Now, we’ll add a couple of columns to the data.
We can use the table function again to see what our data look like. In the following line, we are using favorite color (fav.col) as our rows, Gender as our columns, and separate tables for left-handed and right-handed respondents.
>table(handedness.plus$fav.col, handedness.plus$Gender,
handedness.plus$Handedness)
, , = left-handed
Female Male
blue 01
green 11
red 00
, , = right-handed
Female Male
blue 22
green 21
red 11
This is OK, but the code is somewhat cumbersome. The xtabs function will give you the same result, but with some simpler code. Notice that we don’t need to use the “$” notation in the following code since the last element in our command identifies the data.
>xtabs(~ fav.col+ Gender + Handedness, handedness.plus)
, , Handedness = left-handed
Gender
fav.col Female Male
blue 01
green 11
red 00
, , Handedness = right-handed
Gender
fav.col Female Male
blue 22
green 21
red 11
The ftable (flat table) function is even more flexible. Instead of having two separate tables, for instance, you can have a single table with the same information. With the ftable function, you mention the data, and then identify what data you want to use for rows, and what data you want to use for your columns. In our data, we have four columns named “Gender“, “Handedness“, “fav.col“, and “fav.shape“. If we wanted a flat table version of the previous output, we want the third column as our rows, and our columns are determined by the first and third columns. Here are three examples and their output.
>ftable(handedness.plus, row.vars=3, col.vars=c(1, 2))
Gender Female Male
Handedness left-handed right-handed left-handed right-handed
fav.col
blue 0212
green 1211
red 0101>ftable(handedness.plus, row.vars=1, col.vars=c(2, 4))
Handedness left-handed right-handed
fav.shape circle square triangle circle square triangle
Gender
Female 001212
Male 101211>ftable(handedness.plus, row.vars=1:3, col.vars=4)
fav.shape circle square triangle
Gender Handedness fav.col
Female left-handed blue 000
green 001
red 000
right-handed blue 200
green 002
red 010
Male left-handed blue 100
green 001
red 000
right-handed blue 200
green 001
red 010
Part 2: Advanced results using the reshape package
These flat frequency tables are informative, but they may not provide you all the information you actually want. For instance, what if you wanted more pivot-table-like results, where you were interested in not just frequencies, but maybe also sums and averages. If that’s the case, then you really need to figure out how to use the reshape package. Here are some examples.
In order to use the reshape package, you need to “melt” the data. The melting function asks for id variables (id.vars) and measured variables (measured). In the case of the book sales dataset, the measured variables are “Sales“, “Margin“, and “Quantity” (columns 6 through 8). Using that information, we melt the data using the following code.
Now, it’s time to play around and see what we can do with reshape. The counterpart to “melt” is “cast“. We first mention what dataset we are working with (m.book.sales) and then the reshape function that we want to use. We use the “~” symbol to define relationships. The “~” is usually read as “is described by”, but it’s best understood with some experimentation.
> cast(m.book.sales, Region ~ variable)
Aggregation requires fun.aggregate:length used as default
Region Sales Margin Quantity
1 E 1515152 N 1221221223 S 9191914 W 777777
Well, that’s not really useful. But the notice provided, “Aggregation requires fun.aggregate: length used as default” is helpful. What that means is that we need to tell R what functions we want to use, for instance, “sum” or “mean“. Let’s try this again.
> cast(m.book.sales, Region ~ variable, sum)
Region Sales Margin Quantity
1 E 2267.01081.841352 N 18328.38552.2211303 S 13276.56230.628134 W 12078.85758.08690
Much better. But, can we make more complicated tables? Here’s one sorted first by region, then by representative.
> cast(m.book.sales, Region + Representative ~ variable, sum)
Region Representative Sales Margin Quantity
1 E Kunj 1309.00606.37822 E Rajesh 958.00475.47533 N Gajanan 5348.952551.273074 N Mallesh 6931.703273.154455 N Priya 4790.652168.953096 N Ravi 1257.00558.85697 S Mahanta 3780.701698.012398 S Raj 3463.801690.531839 S Soni 6032.002842.0839110 W Shreekanth 4065.201930.8422211 W Shreerang 3570.201675.5521312 W Sree Hari 4443.402151.69255
Notice that the above code is a convenient flat table. Also, the above table works well because each representative only sells in one region. But what if we had used “Region + Representative” instead?
> cast(m.book.sales, Region + Publisher ~ variable, sum)
Region Publisher Sales Margin Quantity
1 E Bloomsbury 261.0137.03182 E McGraw-Hill 1116.0507.45583 E Routledge 120.064.8064 E SAGE 610.0289.36435 E Viva 160.083.20106 N Bloomsbury 2812.51267.792377 N McGraw-Hill 3901.81783.681938 N Penguin 864.0403.56549 N Routledge 3741.01690.6017410 N SAGE 5484.82629.4938011 N Viva 1524.2777.109212 S Bloomsbury 1335.0595.0111413 S McGraw-Hill 2455.81123.6212514 S Penguin 972.0429.306715 S Routledge 2688.01292.6012316 S SAGE 4640.62206.7732017 S Viva 1185.1583.326418 W Bloomsbury 787.5371.216319 W McGraw-Hill 4299.01944.2520320 W Penguin 414.0188.822921 W Routledge 1834.0904.788522 W SAGE 3540.31733.4624623 W Viva 1204.0615.5664
Notice that now things are not so tidy. Books from a given publisher are sold in multiple districts. By sticking another “~” into the code after “variable“, we can get some separated tables. Compare the following table with the previous one.
> cast(m.book.sales, Region ~ variable ~ Publisher, sum)
, , Publisher = Bloomsbury
variable
Region Sales Margin Quantity
E 261.0137.0318
N 2812.51267.79237
S 1335.0595.01114
W 787.5371.2163
, , Publisher = McGraw-Hill
variable
Region Sales Margin Quantity
E 1116.0507.4558
N 3901.81783.68193
S 2455.81123.62125
W 4299.01944.25203
, , Publisher = Penguin
variable
Region Sales Margin Quantity
E 00.000
N 864403.5654
S 972429.3067
W 414188.8229
, , Publisher = Routledge
variable
Region Sales Margin Quantity
E 12064.806
N 37411690.60174
S 26881292.60123
W 1834904.7885
, , Publisher = SAGE
variable
Region Sales Margin Quantity
E 610.0289.3643
N 5484.82629.49380
S 4640.62206.77320
W 3540.31733.46246
, , Publisher = Viva
variable
Region Sales Margin Quantity
E 160.083.2010
N 1524.2777.1092
S 1185.1583.3264
W 1204.0615.5664
What if we wanted totals? To get that, we need to add “margins” to our code.
> cast(m.book.sales, Region ~ variable, sum, margins=c("grand_col", "grand_row"))
Region Sales Margin Quantity (all)1 E 2267.01081.841353483.842 N 18328.38552.22113028010.523 S 13276.56230.6281320320.124 W 12078.85758.0869018526.885(all)45950.621622.76276870341.36
And, what if we wanted both sum and mean? We can set up multiple functions as follows.
> cast(m.book.sales, Region ~ variable, c(sum, mean))
Region Sales_sum Sales_mean Margin_sum Margin_mean Quantity_sum Quantity_mean
1 E 2267.0151.13331081.8472.122671359.0000002 N 18328.3150.23208552.2270.1001611309.2622953 S 13276.5145.89566230.6268.468358138.9340664 W 12078.8156.86755758.0874.780266908.961039
Of course, we can also subset our data so that we just get information on selected variables. You may remember from the original book sales dataset that the “variables” are “Sales“, “Margin“, and “Quantity“. We can use that information to cast different summary tables. Here are two more examples.
> cast(m.book.sales, Region + Representative ~ variable,
c(length, sum, mean), subset= variable %in%"Sales")
Region Representative Sales_length Sales_sum Sales_mean
1 E Kunj 81309.00163.62502 E Rajesh 7958.00136.85713 N Gajanan 335348.95162.08944 N Mallesh 476931.70147.48305 N Priya 354790.65136.87576 N Ravi 71257.00179.57147 S Mahanta 243780.70157.52928 S Raj 233463.80150.60009 S Soni 446032.00137.090910 W Shreekanth 264065.20156.353811 W Shreerang 223570.20162.281812 W Sree Hari 294443.40153.2207> cast(m.book.sales, Region + Representative ~ variable,
c(sum, mean), subset= variable %in%c("Sales", "Quantity"))
Region Representative Sales_sum Sales_mean Quantity_sum Quantity_mean
1 E Kunj 1309.00163.62508210.2500002 E Rajesh 958.00136.8571537.5714293 N Gajanan 5348.95162.08943079.3030304 N Mallesh 6931.70147.48304459.4680855 N Priya 4790.65136.87573098.8285716 N Ravi 1257.00179.5714699.8571437 S Mahanta 3780.70157.52922399.9583338 S Raj 3463.80150.60001837.9565229 S Soni 6032.00137.09093918.88636410 W Shreekanth 4065.20156.35382228.53846211 W Shreerang 3570.20162.28182139.68181812 W Sree Hari 4443.40153.22072558.793103
Thanks for your sharing! It at least gives me some ideas when difficulties keep coming to me~~
http://news.mrdwab.com mrdwab
You’re welcome. I think solving problems in R helps generate new ideas anyway…
Mauricio
Would be great if you could comment on whether or not ‘reshape’ can be used to create output similar to that of ftable … that is, how to create something that looks like a pivot table with categorical variables on both the rows and columns … but with values based on functions (i.e. sum)
http://news.mrdwab.com mrdwab
Mauricio,
I’m not sure I understand entirely what you’re looking for. Can you provide an example of the raw data and describe what you want to end up with?
I agree, melt/recast/reshape is essentially a better (harder to use but better) version of Pivot tables / OLAP cubes. I wouldn’t say your other post was wrong exactly, just not using the full power of plyr. I think in the other one you could have just transposed the data, no?
http://news.mrdwab.com mrdwab
I still need to get to grips with plyr (pun intended?), but it appears it could also help with the wide-to-long reshape example. That example isn’t a simple transpose example though–if it were, the dimensions of the output would be the same as the input, but with the lengths of the rows and columns switched.
Using the reshape package in R for pivot-table-like functionality
A little more than a week ago, I wrote about creating pivot tables in Microsoft Excel and OpenOffice.org. I also mentioned that I would explain how to do similar calculations by using R. This post will explain how to achieve similar results in R by using the reshape package.
I had initially started experimenting with the reshape package several months ago when I was trying to figure out how to reshape data from wide to long formats. However, once I started experimenting with it, I realized I had misunderstood what the reshape package was designed to do. Now that I finally have a grasp of what can be done using the package, I thought I would share what I’ve found using a few examples.
Part 1: Contingency tables using built-in functions
There are a lot of different ways to tabulate data in R. In this post, I’ll start by demonstrating how to use
table,xtabs, andftablebefore making things more interesting and using the reshape package to flip data around.We’ll start with R’s built-in functions. First, we’ll get some data into R. We’ll start with a basic table from Wikipedia that was used to demonstrate cross-tabulation. I’ve already gone and put the data in a CSV file that can be loaded using the code below.
Did you see how easy it was to tabulate frequencies of the data? Unfortunately, not all data is that easy to work with. Now, we’ll add a couple of columns to the data.
We can use the table function again to see what our data look like. In the following line, we are using favorite color (
fav.col) as our rows,Genderas our columns, and separate tables for left-handed and right-handed respondents.This is OK, but the code is somewhat cumbersome. The
xtabsfunction will give you the same result, but with some simpler code. Notice that we don’t need to use the “$” notation in the following code since the last element in our command identifies the data.The
ftable(flat table) function is even more flexible. Instead of having two separate tables, for instance, you can have a single table with the same information. With theftablefunction, you mention the data, and then identify what data you want to use for rows, and what data you want to use for your columns. In our data, we have four columns named “Gender“, “Handedness“, “fav.col“, and “fav.shape“. If we wanted a flat table version of the previous output, we want the third column as our rows, and our columns are determined by the first and third columns. Here are three examples and their output.Part 2: Advanced results using the reshape package
These flat frequency tables are informative, but they may not provide you all the information you actually want. For instance, what if you wanted more pivot-table-like results, where you were interested in not just frequencies, but maybe also sums and averages. If that’s the case, then you really need to figure out how to use the reshape package. Here are some examples.
First, we’ll load the package, load some data, and preview the rows in our data. (We’ll use the same data from the Pivot Tables in Excel and OpenOffice.org Calc post.)
In order to use the reshape package, you need to “
melt” the data. The melting function asks for id variables (id.vars) and measured variables (measured). In the case of the book sales dataset, the measured variables are “Sales“, “Margin“, and “Quantity” (columns 6 through 8). Using that information, we melt the data using the following code.Now, it’s time to play around and see what we can do with reshape. The counterpart to “
melt” is “cast“. We first mention what dataset we are working with (m.book.sales) and then the reshape function that we want to use. We use the “~” symbol to define relationships. The “~” is usually read as “is described by”, but it’s best understood with some experimentation.Well, that’s not really useful. But the notice provided, “
Aggregation requires fun.aggregate: length used as default” is helpful. What that means is that we need to tell R what functions we want to use, for instance, “sum” or “mean“. Let’s try this again.Much better. But, can we make more complicated tables? Here’s one sorted first by region, then by representative.
Notice that the above code is a convenient flat table. Also, the above table works well because each representative only sells in one region. But what if we had used “
Region + Representative” instead?Notice that now things are not so tidy. Books from a given publisher are sold in multiple districts. By sticking another “
~” into the code after “variable“, we can get some separated tables. Compare the following table with the previous one.What if we wanted totals? To get that, we need to add “
margins” to our code.And, what if we wanted both sum and mean? We can set up multiple functions as follows.
Of course, we can also subset our data so that we just get information on selected variables. You may remember from the original book sales dataset that the “
variables” are “Sales“, “Margin“, and “Quantity“. We can use that information to cast different summary tables. Here are two more examples.Who needs pivot tables in Excel?
Related posts (possibly):