splitstackshape V1.4.0 for R

After more than a year since splitstackshape V1.2.0, I’ve finally gotten around to making some major updates and submitting the package to CRAN.

So, if you have messed up datasets filled with concatenated cells of data, and you need to split that data up and reorganize it for later analysis, install and load the latest version (V1.4.0) of splitstackshape with:

install.packages("splitstackshape")
library(splitstackshape)
packageVersion("splitstackshape")
## [1] '1.4.0'

Read on for details!

What's New?

cSplit

cSplit becomes one of the core functions for data processing. It splits the data into either a “wide” format or a “long” format. In the wide format, cells with an unbalanced set of delimiters get expanded out to fill a common number of columns.

The delimiters are vectorized over splitCols and sep, letting you split multiple columns in one statement.

dat <- data.frame(id = 1:3, V1 = c("a, b, c", "d, e, f, g", "h, i"),
                  V2 = c("1|2", "3|4|5|6", "7|8"))
dat
##   id         V1      V2
## 1  1    a, b, c     1|2
## 2  2 d, e, f, g 3|4|5|6
## 3  3       h, i     7|8

cSplit(dat, splitCols = c("V1", "V2"), sep = c(",", "|"))
##    id V1_1 V1_2 V1_3 V1_4 V2_1 V2_2 V2_3 V2_4
## 1:  1    a    b    c   NA    1    2   NA   NA
## 2:  2    d    e    f    g    3    4    5    6
## 3:  3    h    i   NA   NA    7    8   NA   NA

## Notice that other columns get recycled
cSplit(dat, "V1", sep = ",", direction = "long")
##    id V1      V2
## 1:  1  a     1|2
## 2:  1  b     1|2
## 3:  1  c     1|2
## 4:  2  d 3|4|5|6
## 5:  2  e 3|4|5|6
## 6:  2  f 3|4|5|6
## 7:  2  g 3|4|5|6
## 8:  3  h     7|8
## 9:  3  i     7|8

cSplit_f

cSplit_f added. The _f is both for “fixed” and “fread”. Since the function depends on fread, it only works if the columns that need to be split have the same number of delimiters–fread does not work with unbalanced/ragged data. It my tests, it’s much faster than cSplit if you know that the data are balanced.

As with cSplit, the delimiters are vectorized over splitCols and sep, letting you split multiple columns in one statement.

dat <- data.frame(id = 1:3, V1 = c("a, b, c", "d, e, f", "g, h, i"),
                  V2 = c("1|2|3", "4|5|6", "7|8|9"))
dat
##   id      V1    V2
## 1  1 a, b, c 1|2|3
## 2  2 d, e, f 4|5|6
## 3  3 g, h, i 7|8|9

cSplit_f(dat, splitCols = c("V1", "V2"), sep = c(",", "|"))
##    id V1_1 V1_2 V1_3 V2_1 V2_2 V2_3
## 1:  1    a    b    c    1    2    3
## 2:  2    d    e    f    4    5    6
## 3:  3    g    h    i    7    8    9

stratified

Great for taking quick stratified random samples from a data.frame or a data.table. Can either be a fixed sample size, or proportional according to the group size.

set.seed(1)
dat <- data.frame(ID = 1:20,
              A = sample(c("AA", "BB"), 20, replace = TRUE),
              B = rnorm(20), C = abs(round(rnorm(20), digits=1)),
              D = sample(c("CA", "NY", "TX"), 20, replace = TRUE),
              E = sample(c("M", "F"), 20, replace = TRUE))
dat
##    ID  A           B   C  D E
## 1   1 AA  1.51178117 1.4 NY F
## 2   2 AA  0.38984324 0.1 NY M
## 3   3 BB -0.62124058 0.4 CA M
## 4   4 BB -2.21469989 0.1 TX M
## 5   5 AA  1.12493092 1.4 NY F
## 6   6 BB -0.04493361 0.4 CA M
## 7   7 BB -0.01619026 0.4 CA F
## 8   8 BB  0.94383621 0.1 NY M
## 9   9 BB  0.82122120 1.1 TX M
## 10 10 AA  0.59390132 0.8 NY F
## 11 11 AA  0.91897737 0.2 TX F
## 12 12 AA  0.78213630 0.3 TX M
## 13 13 BB  0.07456498 0.7 NY M
## 14 14 AA -1.98935170 0.6 NY F
## 15 15 BB  0.61982575 0.7 CA F
## 16 16 AA -0.05612874 0.7 CA F
## 17 17 BB -0.15579551 0.4 TX F
## 18 18 BB -1.47075238 0.8 CA F
## 19 19 AA -0.47815006 0.1 NY F
## 20 20 BB  0.41794156 0.9 NY F

stratified(dat, "A", 2)         ## Two from each group of A
##    ID  A           B   C  D E
## 1: 14 AA -1.98935170 0.6 NY F
## 2: 11 AA  0.91897737 0.2 TX F
## 3:  6 BB -0.04493361 0.4 CA M
## 4: 20 BB  0.41794156 0.9 NY F

stratified(dat, "E", .3)        ## 30% sample from each group in column A
##    ID  A          B   C  D E
## 1: 17 BB -0.1557955 0.4 TX F
## 2: 11 AA  0.9189774 0.2 TX F
## 3:  5 AA  1.1249309 1.4 NY F
## 4: 15 BB  0.6198257 0.7 CA F
## 5:  2 AA  0.3898432 0.1 NY M
## 6: 12 AA  0.7821363 0.3 TX M

# Stratified by column D but only use rows where column E == "F"
stratified(dat, "D", .4, select = list(E = "F"))
##    ID  A           B   C  D E
## 1: 16 AA -0.05612874 0.7 CA F
## 2: 15 BB  0.61982575 0.7 CA F
## 3:  5 AA  1.12493092 1.4 NY F
## 4: 10 AA  0.59390132 0.8 NY F
## 5: 17 BB -0.15579551 0.4 TX F

What else is new?

  • cSplit has replaced splitstackshape:::read.concat (but the read.concat function is still included).
  • Reshape has been made faster (more like Stacked and merged.stack), and for the most part, the id.vars should now be optional in all of these functions.
  • getanID and expandRows have been added as utility functions.
  • concat.split.list and concat.split.expanded can now be called with cSplit_l and cSplit_e instead.

I’m expecting that there will be some rough edges, but hopefully nothing has been seriously broken! If you find anything, send your bug reports over to GitHub

comments powered by Disqus