The "splitstackshape" package for R

A while ago, a friend of ours presented me with a data problem. Her questionnaire had some questions where the respondent could provide multiple responses. You know, the "Check as many as apply" type of questions. One way that this data is commonly stored is to put a comma separated value into a single cell in a spreadsheet. In fact, if you use something like Google Forms to collect your data and have questions that use check-boxes, that's how your data will finally be stored in a Google Spreadsheet.

Thus, you might end up with something like this:

mydf <- data.frame(
  ID = c("A01", "A02", "A03", "A04", "B01", "B02"),
  Name = c("Boyd", "Rufus", "Dana",
           "Carole", "Ramona", "Kelley"),
  Likes = c("1,2,4,5,6", "1,2,5,6", "1,3,4",
            "2,3,6", "1,2,3,5", "1,4,6"),
  Siblings = c("Reynolds , Albert , Ortega",
               "Cohen , Bert , Montgomery",
               "Pierce", "Colon , Michelle , Ballard",
               "Snyder , Joann ,", "James , Roxanne ,"))

mydf
#    ID   Name     Likes                   Siblings
# 1 A01   Boyd 1,2,4,5,6 Reynolds , Albert , Ortega
# 2 A02  Rufus   1,2,5,6  Cohen , Bert , Montgomery
# 3 A03   Dana     1,3,4                     Pierce
# 4 A04 Carole     2,3,6 Colon , Michelle , Ballard
# 5 B01 Ramona   1,2,3,5           Snyder , Joann ,
# 6 B02 Kelley     1,4,6          James , Roxanne ,

There are two typical R approaches to split data like this up. First is strsplit, which will create a list of your split data:

X <- lapply(mydf, function(y) {
  strsplit(as.character(y), ",")
})

X
# $Likes
# $Likes[[1]]
# [1] "1" "2" "4" "5" "6"
#
# $Likes[[2]]
# [1] "1" "2" "5" "6"
#
# $Likes[[3]]
# [1] "1" "3" "4"
#
# $Likes[[4]]
# [1] "2" "3" "6"
#
# $Likes[[5]]
# [1] "1" "2" "3" "5"
#
# $Likes[[6]]
# [1] "1" "4" "6"
#
#
# $Siblings
# $Siblings[[1]]
# [1] "Reynolds " " Albert "  " Ortega"  
#
# $Siblings[[2]]
# [1] "Cohen "      " Bert "      " Montgomery"
#
# $Siblings[[3]]
# [1] "Pierce"
#
# $Siblings[[4]]
# [1] "Colon "     " Michelle " " Ballard"  
#
# $Siblings[[5]]
# [1] "Snyder " " Joann "
#
# $Siblings[[6]]
# [1] "James "    " Roxanne "

If required, you can also make those lists "columns" in your original data.frame, but that can also create some problems later on, for instance if you try to use write.csv.

The second common approach is to use read.csv again on the concatenated columns of data.

Y <- lapply(mydf, function(y) {
  read.csv(text = as.character(y), header = FALSE,
           strip.white = TRUE, blank.lines.skip = FALSE,
           fill = TRUE, stringsAsFactors = FALSE)
})

Y
# $Likes
#   V1 V2 V3 V4 V5
# 1  1  2  4  5  6
# 2  1  2  5  6 NA
# 3  1  3  4 NA NA
# 4  2  3  6 NA NA
# 5  1  2  3  5 NA
# 6  1  4  6 NA NA
#
# $Siblings
#         V1       V2         V3
# 1 Reynolds   Albert     Ortega
# 2    Cohen     Bert Montgomery
# 3   Pierce
# 4    Colon Michelle    Ballard
# 5   Snyder    Joann
# 6    James  Roxanne

do.call(cbind, Y)
#   Likes.V1 Likes.V2 Likes.V3 Likes.V4 Likes.V5 Siblings.V1
# 1        1        2        4        5        6    Reynolds
# 2        1        2        5        6       NA       Cohen
# 3        1        3        4       NA       NA      Pierce
# 4        2        3        6       NA       NA       Colon
# 5        1        2        3        5       NA      Snyder
# 6        1        4        6       NA       NA       James
#   Siblings.V2 Siblings.V3
# 1      Albert      Ortega
# 2        Bert  Montgomery
# 3
# 4    Michelle     Ballard
# 5       Joann
# 6     Roxanne

From the above, it's easy to use stack to get the data into a "long" format, which would make it easy to do things like tabulate your data and do further analysis. Here are a few rows of both columns in this stacked form.

lapply(seq_along(Y), function(Z) head(stack(Y[[Z]])))
# [[1]]
#   values ind
# 1      1  V1
# 2      1  V1
# 3      1  V1
# 4      2  V1
# 5      1  V1
# 6      1  V1
#
# [[2]]
#    values ind
# 1 Reynolds  V1
# 2    Cohen  V1
# 3   Pierce  V1
# 4    Colon  V1
# 5   Snyder  V1
# 6    James  V1

For the past year and a half, I've been hanging around Stack Overflow answering questions mostly under the R tag. One of the categories of questions that I typically like to answer are questions involving "reshaping" data into formats that might facilitate further analysis.

I've finally decided to take some of my experience from there to put together a few functions that I think might be generally useful into a package named "splitstackshape". Below, I will demonstrate a few of the key functions.

Installation

The package is on CRAN, so you can install it with:

install.packages("splitstackshape")

Alternatively, you can check the development page on Github as new features may periodically be added there that are not present in the CRAN version.

concat.split

The main unifying function in this group of functions is concat.split.multiple which allows you to split multiple columns of concatenated data and either keep the output in a wide format or convert it into a long format. Here's an example.

CSW <- concat.split.multiple(
  mydf, split.cols = c("Likes", "Siblings"),
  seps = ",", direction = "wide")

CSW
#    ID   Name Likes_1 Likes_2 Likes_3 Likes_4 Likes_5 Siblings_1
# 1 A01   Boyd       1       2       4       5       6   Reynolds
# 2 A02  Rufus       1       2       5       6      NA      Cohen
# 3 A03   Dana       1       3       4      NA      NA     Pierce
# 4 A04 Carole       2       3       6      NA      NA      Colon
# 5 B01 Ramona       1       2       3       5      NA     Snyder
# 6 B02 Kelley       1       4       6      NA      NA      James
#   Siblings_2 Siblings_3
# 1     Albert     Ortega
# 2       Bert Montgomery
# 3       <NA>       <NA>
# 4   Michelle    Ballard
# 5      Joann       <NA>
# 6    Roxanne       <NA>


CSL <- concat.split.multiple(
  mydf, split.cols = c("Likes", "Siblings"),
  seps = ",", direction = "long")

CSL
#               ID   Name time Likes   Siblings
# A01.Boyd.1   A01   Boyd    1     1   Reynolds
# A02.Rufus.1  A02  Rufus    1     1      Cohen
# A03.Dana.1   A03   Dana    1     1     Pierce
# A04.Carole.1 A04 Carole    1     2      Colon
# B01.Ramona.1 B01 Ramona    1     1     Snyder
# B02.Kelley.1 B02 Kelley    1     1      James
# A01.Boyd.2   A01   Boyd    2     2     Albert
# A02.Rufus.2  A02  Rufus    2     2       Bert
# A03.Dana.2   A03   Dana    2     3       <NA>
# A04.Carole.2 A04 Carole    2     3   Michelle
# B01.Ramona.2 B01 Ramona    2     2      Joann
# B02.Kelley.2 B02 Kelley    2     4    Roxanne
# A01.Boyd.3   A01   Boyd    3     4     Ortega
# A02.Rufus.3  A02  Rufus    3     5 Montgomery
# A03.Dana.3   A03   Dana    3     4       <NA>
# A04.Carole.3 A04 Carole    3     6    Ballard
# B01.Ramona.3 B01 Ramona    3     3       <NA>
# B02.Kelley.3 B02 Kelley    3     6       <NA>
# A01.Boyd.4   A01   Boyd    4     5       <NA>
# A02.Rufus.4  A02  Rufus    4     6       <NA>
# A03.Dana.4   A03   Dana    4    NA       <NA>
# A04.Carole.4 A04 Carole    4    NA       <NA>
# B01.Ramona.4 B01 Ramona    4     5       <NA>
# B02.Kelley.4 B02 Kelley    4    NA       <NA>
# A01.Boyd.5   A01   Boyd    5     6       <NA>
# A02.Rufus.5  A02  Rufus    5    NA       <NA>
# A03.Dana.5   A03   Dana    5    NA       <NA>
# A04.Carole.5 A04 Carole    5    NA       <NA>
# B01.Ramona.5 B01 Ramona    5    NA       <NA>
# B02.Kelley.5 B02 Kelley    5    NA       <NA>

As I mentioned, this is a "family" of functions. For instance, when splitting numeric data, you might want to create a "binary" matrix of your concatenated data. Here, a value of "1" is mapped to the first column, a value of "2" to the second, and so on.

concat.split.expanded(mydf, split.col = "Likes", sep = ",",
                      mode = "binary", drop = FALSE, fill = 0)
#    ID   Name     Likes                   Siblings Likes_1 Likes_2
# 1 A01   Boyd 1,2,4,5,6 Reynolds , Albert , Ortega       1       1
# 2 A02  Rufus   1,2,5,6  Cohen , Bert , Montgomery       1       1
# 3 A03   Dana     1,3,4                     Pierce       1       0
# 4 A04 Carole     2,3,6 Colon , Michelle , Ballard       0       1
# 5 B01 Ramona   1,2,3,5           Snyder , Joann ,       1       1
# 6 B02 Kelley     1,4,6          James , Roxanne ,       1       0
#   Likes_3 Likes_4 Likes_5 Likes_6
# 1       0       1       1       1
# 2       0       0       1       1
# 3       1       1       0       0
# 4       1       0       0       1
# 5       1       0       1       0
# 6       0       1       0       1

The mode = "value" setting is similar, but instead of recoding to "1", the original values are retained.

concat.split.expanded(mydf, split.col = "Likes", sep = ",",
                      mode = "value", drop = TRUE, fill = NA)
#    ID   Name                   Siblings Likes_1 Likes_2
# 1 A01   Boyd Reynolds , Albert , Ortega       1       2
# 2 A02  Rufus  Cohen , Bert , Montgomery       1       2
# 3 A03   Dana                     Pierce       1      NA
# 4 A04 Carole Colon , Michelle , Ballard      NA       2
# 5 B01 Ramona           Snyder , Joann ,       1       2
# 6 B02 Kelley          James , Roxanne ,       1      NA
#   Likes_3 Likes_4 Likes_5 Likes_6
# 1      NA       4       5       6
# 2      NA      NA       5       6
# 3       3       4      NA      NA
# 4       3      NA      NA       6
# 5       3      NA       5      NA
# 6      NA       4      NA       6

Stacked and Reshape

The Stacked and Reshape functions are designed for convenient (guess what) stacking and reshaping of wide datasets.

Consider, for example the CSW object I created earlier when demonstrating the default direction = "wide" setting in concat.split.multiple. If we wanted to convert this into a "semi-long" format (that is, keep separate columns for "Likes" and "Siblings"), we would generally try to use reshape() from base R. The problem is that these data are "unbalanced". There are only three "times" for "Siblings", while there are five times for "Likes". (Ignore, for the time being, that such a dataset would not make much sense since I'm stacking unrelated categorical variables. A better example would be where a certain measured variable in a longitudinal study were introduced in a different wave, meaning it was not present in one or more earlier waves.)

Here's what happens with reshape() from base R:

reshape(CSW, direction = "long", idvar = 1:2,
        varying = 3:ncol(CSW), sep = "_")
# Error in reshapeLong(data, idvar = idvar, timevar = timevar, varying = varying,  :
#   'varying' arguments must be the same length

Here is the Reshape approach from "splitstackshape":

Reshape(CSW, id.vars = c("ID", "Name"),
        var.stubs = c("Likes", "Siblings"),
        sep = "_")
#               ID   Name time Likes   Siblings
# A01.Boyd.1   A01   Boyd    1     1   Reynolds
# A02.Rufus.1  A02  Rufus    1     1      Cohen
# A03.Dana.1   A03   Dana    1     1     Pierce
# A04.Carole.1 A04 Carole    1     2      Colon
# B01.Ramona.1 B01 Ramona    1     1     Snyder
# B02.Kelley.1 B02 Kelley    1     1      James
# A01.Boyd.2   A01   Boyd    2     2     Albert
# A02.Rufus.2  A02  Rufus    2     2       Bert
# A03.Dana.2   A03   Dana    2     3       <NA>
# A04.Carole.2 A04 Carole    2     3   Michelle
# B01.Ramona.2 B01 Ramona    2     2      Joann
# B02.Kelley.2 B02 Kelley    2     4    Roxanne
# A01.Boyd.3   A01   Boyd    3     4     Ortega
# A02.Rufus.3  A02  Rufus    3     5 Montgomery
# A03.Dana.3   A03   Dana    3     4       <NA>
# A04.Carole.3 A04 Carole    3     6    Ballard
# B01.Ramona.3 B01 Ramona    3     3       <NA>
# B02.Kelley.3 B02 Kelley    3     6       <NA>
# A01.Boyd.4   A01   Boyd    4     5       <NA>
# A02.Rufus.4  A02  Rufus    4     6       <NA>
# A03.Dana.4   A03   Dana    4    NA       <NA>
# A04.Carole.4 A04 Carole    4    NA       <NA>
# B01.Ramona.4 B01 Ramona    4     5       <NA>
# B02.Kelley.4 B02 Kelley    4    NA       <NA>
# A01.Boyd.5   A01   Boyd    5     6       <NA>
# A02.Rufus.5  A02  Rufus    5    NA       <NA>
# A03.Dana.5   A03   Dana    5    NA       <NA>
# A04.Carole.5 A04 Carole    5    NA       <NA>
# B01.Ramona.5 B01 Ramona    5    NA       <NA>
# B02.Kelley.5 B02 Kelley    5    NA       <NA>

You'll notice that this is the same as the output of CSL, since the direction = "long" argument in concat.split.multiple simply calls the Reshape function after splitting the concatenated data.

Here's Stacked. The syntax is similar to Reshape. First, I'll stack just one column, then I'll stack both, to demonstrate the different features.

Stacked(CSW, id.vars = c("ID", "Name"),
        var.stubs = "Likes", sep = "_")
# $Likes
#     ID   Name .time_1 Likes Siblings_1 Siblings_2 Siblings_3
#  1: A01   Boyd       1     1   Reynolds     Albert     Ortega
#  2: A01   Boyd       2     2   Reynolds     Albert     Ortega
#  3: A01   Boyd       3     4   Reynolds     Albert     Ortega
#  4: A01   Boyd       4     5   Reynolds     Albert     Ortega
#  5: A01   Boyd       5     6   Reynolds     Albert     Ortega
#  6: A02  Rufus       1     1      Cohen       Bert Montgomery
#  7: A02  Rufus       2     2      Cohen       Bert Montgomery
#  8: A02  Rufus       3     5      Cohen       Bert Montgomery
#  9: A02  Rufus       4     6      Cohen       Bert Montgomery
# 10: A02  Rufus       5    NA      Cohen       Bert Montgomery
# 11: A03   Dana       1     1     Pierce         NA         NA
# 12: A03   Dana       2     3     Pierce         NA         NA
# 13: A03   Dana       3     4     Pierce         NA         NA
# 14: A03   Dana       4    NA     Pierce         NA         NA
# 15: A03   Dana       5    NA     Pierce         NA         NA
# 16: A04 Carole       1     2      Colon   Michelle    Ballard
# 17: A04 Carole       2     3      Colon   Michelle    Ballard
# 18: A04 Carole       3     6      Colon   Michelle    Ballard
# 19: A04 Carole       4    NA      Colon   Michelle    Ballard
# 20: A04 Carole       5    NA      Colon   Michelle    Ballard
# 21: B01 Ramona       1     1     Snyder      Joann         NA
# 22: B01 Ramona       2     2     Snyder      Joann         NA
# 23: B01 Ramona       3     3     Snyder      Joann         NA
# 24: B01 Ramona       4     5     Snyder      Joann         NA
# 25: B01 Ramona       5    NA     Snyder      Joann         NA
# 26: B02 Kelley       1     1      James    Roxanne         NA
# 27: B02 Kelley       2     4      James    Roxanne         NA
# 28: B02 Kelley       3     6      James    Roxanne         NA
# 29: B02 Kelley       4    NA      James    Roxanne         NA
# 30: B02 Kelley       5    NA      James    Roxanne         NA
#      ID   Name .time_1 Likes Siblings_1 Siblings_2 Siblings_3

Stacked(CSW, id.vars = c("ID", "Name"),
        var.stubs = c("Likes", "Siblings"), sep = "_")
# $Likes
#      ID   Name .time_1 Likes
#  1: A01   Boyd       1     1
#  2: A01   Boyd       2     2
#  3: A01   Boyd       3     4
#  4: A01   Boyd       4     5
#  5: A01   Boyd       5     6
#  6: A02  Rufus       1     1
#  7: A02  Rufus       2     2
#  8: A02  Rufus       3     5
#  9: A02  Rufus       4     6
# 10: A02  Rufus       5    NA
# 11: A03   Dana       1     1
# 12: A03   Dana       2     3
# 13: A03   Dana       3     4
# 14: A03   Dana       4    NA
# 15: A03   Dana       5    NA
# 16: A04 Carole       1     2
# 17: A04 Carole       2     3
# 18: A04 Carole       3     6
# 19: A04 Carole       4    NA
# 20: A04 Carole       5    NA
# 21: B01 Ramona       1     1
# 22: B01 Ramona       2     2
# 23: B01 Ramona       3     3
# 24: B01 Ramona       4     5
# 25: B01 Ramona       5    NA
# 26: B02 Kelley       1     1
# 27: B02 Kelley       2     4
# 28: B02 Kelley       3     6
# 29: B02 Kelley       4    NA
# 30: B02 Kelley       5    NA
#      ID   Name .time_1 Likes
#
# $Siblings
#      ID   Name .time_1   Siblings
#  1: A01   Boyd       1   Reynolds
#  2: A01   Boyd       2     Albert
#  3: A01   Boyd       3     Ortega
#  4: A02  Rufus       1      Cohen
#  5: A02  Rufus       2       Bert
#  6: A02  Rufus       3 Montgomery
#  7: A03   Dana       1     Pierce
#  8: A03   Dana       2         NA
#  9: A03   Dana       3         NA
# 10: A04 Carole       1      Colon
# 11: A04 Carole       2   Michelle
# 12: A04 Carole       3    Ballard
# 13: B01 Ramona       1     Snyder
# 14: B01 Ramona       2      Joann
# 15: B01 Ramona       3         NA
# 16: B02 Kelley       1      James
# 17: B02 Kelley       2    Roxanne
# 18: B02 Kelley       3         NA

Anyway, test out the package. Hopefully there aren't too many bugs, and if there are, do be sure to report them.


Update (28 August 2013)

Version 1.2.0 of "splitstackshape" has been accepted at CRAN. The main change is in the Stacked function (and, by extension, merged.stack). It now uses unlist(..., use.names = FALSE) which is much faster than the previous implementation, which used stack.

comments powered by Disqus