Stratified random sampling in R from a data frame

After a little bit more work, there’s a new stratified random sampling function, this one letting you sample from a data frame, returning all the variables for each of your samples as a nice data frame that you can continue working on as usual.

Get the function at http://news.mrdwab.com/stratified. Usage notes in the head of the function.

Here’s the function:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
stratified = function(df, id, group, size, seed="NULL", ...) {
  #  USE: * Specify your data frame, ID variable (as column number), and
  #         grouping variable (as column number) as the first three arguments.
  #       * Decide on your sample size. For a sample proportional to the
  #         population, enter "size" as a decimal. For an equal number of
  #         samples from each group, enter "size" as a whole number.
  #       * Decide on if you want to use a seed or not. If not, leave blank
  #         or type "NULL" (with quotes). 
  #
  #  Example 1: To sample 10% of each group from a data frame named "z", where
  #             the ID variable is the first variable, the grouping variable
  #             is the fourth variable, and the desired seed is "1", use:
  # 
  #                 > stratified(z, 1, 4, .1, 1)
  #
  #  Example 2: To run the same sample as above but without a seed, use:
  # 
  #                 > stratified(z, 1, 4, .1)
  #
  #  Example 3: To sample 5 from each group from a data frame named "z", where
  #             the ID variable is the first variable, the grouping variable
  #             is the third variable, and the desired seed is 2, use:
  #
  #                 > stratified(z, 1, 3, 5, 2)
  #
  #  NOTE: Not tested on datasets with LOTS of groups or with HUGE
  #        differences in group sizes. Probably INCREDIBLY inefficient.
 
  k = unstack(data.frame(as.vector(df[id]), as.vector(df[group])))
  l = length(k)
  results = vector("list", l)
 
  if (seed == "NULL" & size < 1) {
      for (i in 1:length(k)) {
        N = k[[i]]
        n = round(length(N)*size)
        results[[i]] = list(sample(N, n, ...))
      }
    } else if (seed == "NULL" & size >= 1) {
      for (i in 1:length(k)) {
        N = k[[i]]
        results[[i]] = list(sample(N, size, ...))
      }
    } else if (size < 1) {
      for (i in 1:length(k)) {
        set.seed(seed)
        N = k[[i]]
        n = round(length(N)*size)
        results[[i]] = list(sample(N, n, ...))
      }
    } else if (size >= 1) {
      for (i in 1:length(k)) {
        set.seed(seed)
        N = k[[i]]
        results[[i]] = list(sample(N, size, ...))
      }
    }
  z = data.frame(c(unlist(results)))
  names(z) = names(df[id])
  w = merge(df, z)
  w[order(w[group]), ]
}

And here are some examples of the function in action:

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
> source("http://news.mrdwab.com/stratified")
> # Make up some data
> A = 1:100
> B = sample(c("AA", "BB", "CC", "DD", "EE"), 100, replace=T)
> C = rnorm(100)
> D = abs(round(rnorm(100), digits=1))
> E = sample(c("CA", "NY", "TX"), 100, replace=T)
> dat = data.frame(A, B, C, D, E)
> # view the first few rows
> head(dat)
  A  B           C   D  E
1 1 CC -0.07870439 0.6 NY
2 2 CC -0.65048634 0.3 TX
3 3 EE  1.02703616 1.3 NY
4 4 BB -1.08696775 0.4 TX
5 5 CC  0.56741795 0.2 CA
6 6 AA -0.46448941 0.5 TX
> # Sample 10% from each group from variable B, no seed
> stratified(dat, 1, 2, .1)
    A  B           C   D  E
2   6 AA -0.46448941 0.5 TX
7  71 AA  1.98128479 2.1 CA
5  53 BB  1.00539398 0.7 NY
10 97 BB  0.68252675 1.9 NY
1   1 CC -0.07870439 0.6 NY
4  42 CC -2.00256854 0.3 TX
8  76 DD -0.84151459 0.2 NY
9  95 DD -0.47276142 0.3 CA
11 99 DD  1.05173419 2.1 TX
3  10 EE -0.69079473 1.1 TX
6  57 EE -0.38210921 1.5 CA
> # Sample 10% from each group from variable E, seed of 1
> stratified(dat, 1, 5, .1, 1)
    A  B          C   D  E
4  33 AA  1.6105099 0.5 CA
7  48 AA  0.3128274 0.6 CA
9  62 DD  0.4673061 0.0 CA
10 86 EE  0.4047880 1.6 CA
3  28 AA -1.6815553 0.3 NY
5  36 AA  0.3307508 0.3 NY
8  53 BB  1.0053940 0.7 NY
1  21 DD  0.5229282 1.2 TX
2  27 BB  0.8678977 0.7 TX
6  44 DD -0.5790353 0.9 TX
> # You can also be verbose if it helps you remember what you're doing
> stratified(df=dat, id=1, group=5, size=.1, seed=1)
    A  B          C   D  E
4  33 AA  1.6105099 0.5 CA
7  48 AA  0.3128274 0.6 CA
9  62 DD  0.4673061 0.0 CA
10 86 EE  0.4047880 1.6 CA
3  28 AA -1.6815553 0.3 NY
5  36 AA  0.3307508 0.3 NY
8  53 BB  1.0053940 0.7 NY
1  21 DD  0.5229282 1.2 TX
2  27 BB  0.8678977 0.7 TX
6  44 DD -0.5790353 0.9 TX

Related posts (possibly):

  1. Stratified Random Sampling in R–A Function in Progress IMPORTANT: This is here mostly to remind me of how...
  2. Simple sampling with R I mentioned in an earlier post (Am I inconsistent?) that...
  3. Sampling with replacement in R In my last post about sampling (Simple sampling with R)...
  4. Quickly reshaping data from “wide” to “long” formats in R A lot of the times, students at the Academy enter...
  5. The new sample size calculator for R (already) aka “Maybe I shouldn’t post so quickly” Just hours ago,...
This entry was posted in Geekiness, Useless Knowledge and tagged , , , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.
  • http://news.mrdwab.com mrdwab

    Of course, standard tricks can also be used. For instance, if you wanted to take a sample of only groups “CA” and “NY” and drop “TX”, you can use the following:

    > stratified(dat[dat$E!="TX",], 1, 5, .1, 1)
       A  B           C   D  E
    2 36 DD  0.33295037 0.2 CA
    4 44 CC  0.70021365 0.8 CA
    1 23 DD  0.61072635 0.5 NY
    3 37 DD  1.06309984 1.5 NY
    5 52 EE  0.04211587 1.7 NY
    6 91 BB -1.91435943 0.7 NY

  • B Li

    There seems to be a bug for large data set with lots of groups. In my case, there are 24000 observations and 2089 groups. I used this function to sample one observation from each group. No errors and the total sample size is alway 2089, but there are always some groups that are sampled with 2 observations and some are sampled with 0 observation.
    The sample size is very different from group to group in my case. Maybe that’s what you mean in the NOTE of the function “NOTE: Not tested on datasets with LOTS of groups or with HUGE differences in group sizes. Probably INCREDIBLY inefficient.” ?

    • B Li

      I think I’ve found the reason for that problem. The key thing is the sample() function used in this stratified() function.  When there is only one numeric value (say, 10) to be sampled from, the sample() function samples by default from 1 to 10.
      Sorry for the misleading message I posted earlier.

      • http://news.mrdwab.com mrdwab

        Hi. I hope you have found the function useful.

        It sounds like you’ve solved the problem you were having, but I’m not sure I understand your follow up response here. Do you have an example you can share?