Using the reshape package in R for pivot-table-like functionality

A little more than a week ago, I wrote about creating pivot tables in Microsoft Excel and OpenOffice.org. I also mentioned that I would explain how to do similar calculations by using R. This post will explain how to achieve similar results in R by using the reshape package.

I had initially started experimenting with the reshape package several months ago when I was trying to figure out how to reshape data from wide to long formats. However, once I started experimenting with it, I realized I had misunderstood what the reshape package was designed to do. Now that I finally have a grasp of what can be done using the package, I thought I would share what I’ve found using a few examples.

Part 1: Contingency tables using built-in functions

There are a lot of different ways to tabulate data in R. In this post, I’ll start by demonstrating how to use table, xtabs, and ftable before making things more interesting and using the reshape package to flip data around.

We’ll start with R’s built-in functions. First, we’ll get some data into R. We’ll start with a basic table from Wikipedia that was used to demonstrate cross-tabulation. I’ve already gone and put the data in a CSV file that can be loaded using the code below.

?View Code RSPLUS
1
2
3
4
5
6
> handedness = read.csv("http://news.mrdwab.com/handedness")
> table(handedness)
        Handedness
Gender   left-handed right-handed
  Female           1            5
  Male             2            4

Did you see how easy it was to tabulate frequencies of the data? Unfortunately, not all data is that easy to work with. Now, we’ll add a couple of columns to the data.

?View Code RSPLUS
1
2
3
4
5
> set.seed(123)
> fav.col = sample(c("red", "green", "blue"), 12, replace = T)
> set.seed(123)
> fav.shape = sample(c("square", "triangle", "circle"), 12, replace = T)
> handedness.plus = cbind(handedness,fav.col,fav.shape)

We can use the table function again to see what our data look like. In the following line, we are using favorite color (fav.col) as our rows, Gender as our columns, and separate tables for left-handed and right-handed respondents.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
> table(handedness.plus$fav.col, handedness.plus$Gender,
       handedness.plus$Handedness)
, ,  = left-handed
 
        Female Male
  blue       0    1
  green      1    1
  red        0    0
 
, ,  = right-handed
 
        Female Male
  blue       2    2
  green      2    1
  red        1    1

This is OK, but the code is somewhat cumbersome. The xtabs function will give you the same result, but with some simpler code. Notice that we don’t need to use the “$” notation in the following code since the last element in our command identifies the data.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
> xtabs(~ fav.col + Gender + Handedness, handedness.plus)
, , Handedness = left-handed
 
       Gender
fav.col Female Male
  blue       0    1
  green      1    1
  red        0    0
 
, , Handedness = right-handed
 
       Gender
fav.col Female Male
  blue       2    2
  green      2    1
  red        1    1

The ftable (flat table) function is even more flexible. Instead of having two separate tables, for instance, you can have a single table with the same information. With the ftable function, you mention the data, and then identify what data you want to use for rows, and what data you want to use for your columns. In our data, we have four columns named “Gender“, “Handedness“, “fav.col“, and “fav.shape“. If we wanted a flat table version of the previous output, we want the third column as our rows, and our columns are determined by the first and third columns. Here are three examples and their output.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
> ftable(handedness.plus, row.vars = 3, col.vars = c(1, 2))
        Gender          Female                     Male
        Handedness left-handed right-handed left-handed right-handed
fav.col
blue                         0            2           1            2
green                        1            2           1            1
red                          0            1           0            1
> ftable(handedness.plus, row.vars = 1, col.vars = c(2, 4))
       Handedness left-handed                 right-handed
       fav.shape       circle square triangle       circle square triangle
Gender
Female                      0      0        1            2      1        2
Male                        1      0        1            2      1        1
> ftable(handedness.plus, row.vars = 1:3, col.vars = 4)
                            fav.shape circle square triangle
Gender Handedness   fav.col
Female left-handed  blue                   0      0        0
                    green                  0      0        1
                    red                    0      0        0
       right-handed blue                   2      0        0
                    green                  0      0        2
                    red                    0      1        0
Male   left-handed  blue                   1      0        0
                    green                  0      0        1
                    red                    0      0        0
       right-handed blue                   2      0        0
                    green                  0      0        1
                    red                    0      1        0

Part 2: Advanced results using the reshape package

These flat frequency tables are informative, but they may not provide you all the information you actually want. For instance, what if you wanted more pivot-table-like results, where you were interested in not just frequencies, but maybe also sums and averages. If that’s the case, then you really need to figure out how to use the reshape package. Here are some examples.

First, we’ll load the package, load some data, and preview the rows in our data. (We’ll use the same data from the Pivot Tables in Excel and OpenOffice.org Calc post.)

?View Code RSPLUS
1
2
3
4
5
6
7
> library(reshape)
Loading required package: plyr
Loading required package: plyr
> book.sales = read.csv("http://news.mrdwab.com/booksales")
> names(book.sales)
[1] "Representative" "Region"         "Month"          "Publisher"
[5] "Subject"        "Sales"          "Margin"         "Quantity"

In order to use the reshape package, you need to “melt” the data. The melting function asks for id variables (id.vars) and measured variables (measured). In the case of the book sales dataset, the measured variables are “Sales“, “Margin“, and “Quantity” (columns 6 through 8). Using that information, we melt the data using the following code.

?View Code RSPLUS
1
> m.book.sales = melt(book.sales, id.vars = 1:5)

Now, it’s time to play around and see what we can do with reshape. The counterpart to “melt” is “cast“. We first mention what dataset we are working with (m.book.sales) and then the reshape function that we want to use. We use the “~” symbol to define relationships. The “~” is usually read as “is described by”, but it’s best understood with some experimentation.

?View Code RSPLUS
1
2
3
4
5
6
7
> cast(m.book.sales, Region ~ variable)
Aggregation requires fun.aggregate: length used as default
  Region Sales Margin Quantity
1      E    15     15       15
2      N   122    122      122
3      S    91     91       91
4      W    77     77       77

Well, that’s not really useful. But the notice provided, “Aggregation requires fun.aggregate: length used as default” is helpful. What that means is that we need to tell R what functions we want to use, for instance, “sum” or “mean“. Let’s try this again.

?View Code RSPLUS
1
2
3
4
5
6
> cast(m.book.sales, Region ~ variable, sum)
  Region   Sales  Margin Quantity
1      E  2267.0 1081.84      135
2      N 18328.3 8552.22     1130
3      S 13276.5 6230.62      813
4      W 12078.8 5758.08      690

Much better. But, can we make more complicated tables? Here’s one sorted first by region, then by representative.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
> cast(m.book.sales, Region + Representative ~ variable, sum)
   Region Representative   Sales  Margin Quantity
1       E           Kunj 1309.00  606.37       82
2       E         Rajesh  958.00  475.47       53
3       N        Gajanan 5348.95 2551.27      307
4       N        Mallesh 6931.70 3273.15      445
5       N          Priya 4790.65 2168.95      309
6       N           Ravi 1257.00  558.85       69
7       S        Mahanta 3780.70 1698.01      239
8       S            Raj 3463.80 1690.53      183
9       S           Soni 6032.00 2842.08      391
10      W     Shreekanth 4065.20 1930.84      222
11      W      Shreerang 3570.20 1675.55      213
12      W      Sree Hari 4443.40 2151.69      255

Notice that the above code is a convenient flat table. Also, the above table works well because each representative only sells in one region. But what if we had used “Region + Representative” instead?

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
> cast(m.book.sales, Region + Publisher ~ variable, sum)
   Region   Publisher  Sales  Margin Quantity
1       E  Bloomsbury  261.0  137.03       18
2       E McGraw-Hill 1116.0  507.45       58
3       E   Routledge  120.0   64.80        6
4       E        SAGE  610.0  289.36       43
5       E        Viva  160.0   83.20       10
6       N  Bloomsbury 2812.5 1267.79      237
7       N McGraw-Hill 3901.8 1783.68      193
8       N     Penguin  864.0  403.56       54
9       N   Routledge 3741.0 1690.60      174
10      N        SAGE 5484.8 2629.49      380
11      N        Viva 1524.2  777.10       92
12      S  Bloomsbury 1335.0  595.01      114
13      S McGraw-Hill 2455.8 1123.62      125
14      S     Penguin  972.0  429.30       67
15      S   Routledge 2688.0 1292.60      123
16      S        SAGE 4640.6 2206.77      320
17      S        Viva 1185.1  583.32       64
18      W  Bloomsbury  787.5  371.21       63
19      W McGraw-Hill 4299.0 1944.25      203
20      W     Penguin  414.0  188.82       29
21      W   Routledge 1834.0  904.78       85
22      W        SAGE 3540.3 1733.46      246
23      W        Viva 1204.0  615.56       64

Notice that now things are not so tidy. Books from a given publisher are sold in multiple districts. By sticking another “~” into the code after “variable“, we can get some separated tables. Compare the following table with the previous one.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
> cast(m.book.sales, Region ~ variable ~ Publisher, sum)
, , Publisher = Bloomsbury
 
      variable
Region  Sales  Margin Quantity
     E  261.0  137.03       18
     N 2812.5 1267.79      237
     S 1335.0  595.01      114
     W  787.5  371.21       63
 
, , Publisher = McGraw-Hill
 
      variable
Region  Sales  Margin Quantity
     E 1116.0  507.45       58
     N 3901.8 1783.68      193
     S 2455.8 1123.62      125
     W 4299.0 1944.25      203
 
, , Publisher = Penguin
 
      variable
Region Sales Margin Quantity
     E     0   0.00        0
     N   864 403.56       54
     S   972 429.30       67
     W   414 188.82       29
 
, , Publisher = Routledge
 
      variable
Region Sales  Margin Quantity
     E   120   64.80        6
     N  3741 1690.60      174
     S  2688 1292.60      123
     W  1834  904.78       85
 
, , Publisher = SAGE
 
      variable
Region  Sales  Margin Quantity
     E  610.0  289.36       43
     N 5484.8 2629.49      380
     S 4640.6 2206.77      320
     W 3540.3 1733.46      246
 
, , Publisher = Viva
 
      variable
Region  Sales Margin Quantity
     E  160.0  83.20       10
     N 1524.2 777.10       92
     S 1185.1 583.32       64
     W 1204.0 615.56       64

What if we wanted totals? To get that, we need to add “margins” to our code.

?View Code RSPLUS
1
2
3
4
5
6
7
> cast(m.book.sales, Region ~ variable, sum, margins=c("grand_col", "grand_row"))
  Region   Sales   Margin Quantity    (all)
1      E  2267.0  1081.84      135  3483.84
2      N 18328.3  8552.22     1130 28010.52
3      S 13276.5  6230.62      813 20320.12
4      W 12078.8  5758.08      690 18526.88
5  (all) 45950.6 21622.76     2768 70341.36

And, what if we wanted both sum and mean? We can set up multiple functions as follows.

?View Code RSPLUS
1
2
3
4
5
6
> cast(m.book.sales, Region ~ variable, c(sum, mean))
  Region Sales_sum Sales_mean Margin_sum Margin_mean Quantity_sum Quantity_mean
1      E    2267.0   151.1333    1081.84    72.12267          135      9.000000
2      N   18328.3   150.2320    8552.22    70.10016         1130      9.262295
3      S   13276.5   145.8956    6230.62    68.46835          813      8.934066
4      W   12078.8   156.8675    5758.08    74.78026          690      8.961039

Of course, we can also subset our data so that we just get information on selected variables. You may remember from the original book sales dataset that the “variables” are “Sales“, “Margin“, and “Quantity“. We can use that information to cast different summary tables. Here are two more examples.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
> cast(m.book.sales, Region + Representative ~ variable,
      c(length, sum, mean), subset = variable %in% "Sales")
   Region Representative Sales_length Sales_sum Sales_mean
1       E           Kunj            8   1309.00   163.6250
2       E         Rajesh            7    958.00   136.8571
3       N        Gajanan           33   5348.95   162.0894
4       N        Mallesh           47   6931.70   147.4830
5       N          Priya           35   4790.65   136.8757
6       N           Ravi            7   1257.00   179.5714
7       S        Mahanta           24   3780.70   157.5292
8       S            Raj           23   3463.80   150.6000
9       S           Soni           44   6032.00   137.0909
10      W     Shreekanth           26   4065.20   156.3538
11      W      Shreerang           22   3570.20   162.2818
12      W      Sree Hari           29   4443.40   153.2207
> cast(m.book.sales, Region + Representative ~ variable,
      c(sum, mean), subset = variable %in% c("Sales", "Quantity"))
   Region Representative Sales_sum Sales_mean Quantity_sum Quantity_mean
1       E           Kunj   1309.00   163.6250           82     10.250000
2       E         Rajesh    958.00   136.8571           53      7.571429
3       N        Gajanan   5348.95   162.0894          307      9.303030
4       N        Mallesh   6931.70   147.4830          445      9.468085
5       N          Priya   4790.65   136.8757          309      8.828571
6       N           Ravi   1257.00   179.5714           69      9.857143
7       S        Mahanta   3780.70   157.5292          239      9.958333
8       S            Raj   3463.80   150.6000          183      7.956522
9       S           Soni   6032.00   137.0909          391      8.886364
10      W     Shreekanth   4065.20   156.3538          222      8.538462
11      W      Shreerang   3570.20   162.2818          213      9.681818
12      W      Sree Hari   4443.40   153.2207          255      8.793103

Who needs pivot tables in Excel?


Related posts (possibly):

  1. Pivot Tables in Excel and OpenOffice.org Calc One of the features I find useful in Excel is...
  2. Stratified random sampling in R from a data frame After a little bit more work, there’s a new stratified...
  3. The new sample size calculator for R (already) aka “Maybe I shouldn’t post so quickly” Just hours ago,...
  4. Getting data into R When you first open R, you’re greeted with a screen...
  5. Reshaping data in R revisited A year ago, I wrote a post about reshaping data...
This entry was posted in (all categories), Geekiness, Useless Knowledge and tagged , , , , , , , , , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
  • Axlett

    And you ENJOY doing this?!?!

  • http://news.mrdwab.com mrdwab

    Wouldn’t you?

    Just wait until my current project is unveiled….

  • Edmundchen

    Thanks for your sharing! It at least gives me some ideas when difficulties keep coming to me~~

    • http://news.mrdwab.com mrdwab

      You’re welcome. I think solving problems in R helps generate new ideas anyway…

  • Mauricio

    Would be great if you could comment on whether or not ‘reshape’ can be used to create output similar to that of ftable … that is, how to create something that looks like a pivot table with categorical variables on both the rows and columns … but with values based on functions (i.e. sum)

    • http://news.mrdwab.com mrdwab

      Mauricio,

      I’m not sure I understand entirely what you’re looking for. Can you provide an example of the raw data and describe what you want to end up with?

  • http://blog.hiremebecauseimsmart.com recommendation engineer

    I agree, melt/recast/reshape is essentially a better (harder to use but better) version of Pivot tables / OLAP cubes. I wouldn’t say your other post was wrong exactly, just not using the full power of plyr.
    I think in the other one you could have just transposed the data, no?

    • http://news.mrdwab.com mrdwab

      I still need to get to grips with plyr (pun intended?), but it appears it could also help with the wide-to-long reshape example. That example isn’t a simple transpose example though–if it were, the dimensions of the output would be the same as the input, but with the lengths of the rows and columns switched.