Stratified Random Sampling in R–A Function in Progress

IMPORTANT: This is here mostly to remind me of how I solved my problem. You should read Stratified random sampling in R from a data frame if you really want to use this function.

I know that sampling is quite complex, and I will admit that I know very little about its complexities. Fortunately, software like R lets you draw simple random samples pretty easily, either either with or without replacement. Unfortunately, I could not find any feature to allow me to do simple stratified random sampling, at least not with the features I was looking for. Fortunately again, with a little bit of experimenting, it can be pretty easy to learn how to write functions in R when a direct solution does not present itself.

This post shares my initial “work-in-progress” on writing an R function for stratified sampling.

The problem…

Here’s the minimum that I was hoping for:

  • I wanted to be able to draw both a proportional sample (which is more common, for it allows you to make generalizations about the population as a whole) as well as a fixed-size sample (which less common, but it is useful for making comparisons across groups).
  • I often use a seed when sampling, so I wanted that to be a part of the function.
  • I wanted the output to be the same as if I were to sample from each group individually.
  • I was hoping that my output could be stored as a new object that I could then reuse (either a list or a data frame, preferably the latter).

My initial searches directed me to Yihui Xie’s page on stratified sampling using tapply(). However, this option did not satisfy my needs. As far as I could figure, it only allowed me to take a fixed sample size. Also, I wasn’t totally satisfied with the output.

Consider the following. In Yihui Xie’s example, there is a difference between the results one would get if they sampled from each group separately, but using the same seed.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
> dat = data.frame(x = 1:15, stratum = gl(3, 5))
> dat
    x stratum
1   1       1
2   2       1
3   3       1
4   4       1
5   5       1
6   6       2
7   7       2
8   8       2
9   9       2
10 10       2
11 11       3
12 12       3
13 13       3
14 14       3
15 15       3
> set.seed(1); tapply(dat$x, dat$stratum, sample, size = 3)
$`1`
[1] 2 5 4
 
$`2`
[1] 10  6  8
 
$`3`
[1] 15 13 12
 
> # Compare with what we get when we sample individually:
> set.seed(1); sample(1:5, 3)
[1] 2 5 4
> set.seed(1); sample(6:10, 3)
[1]  7 10  9
> set.seed(1); sample(11:15, 3)
[1] 12 15 14

I’m sure there’s some sampling theory that explains this, or at least something about how R treats its data, but at the moment, that’s beyond my humble level of expertise.

Stratified sampling, Mr. DWAB style…

The solution I arrived at is to use “unstack()” and a few conditional loops to take the samples.

And, without more rambling, here’s what I came up with.

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
stratified = function(df, size, seed, dframe=FALSE, ...) {
         # USE: Start with a data frame with your cases in one column and your
         # groups in another column. Decide on if you want to use a seed or not. 
         # If not, seed should be "NO" (with quotes). Decide on if you want your
         # output as a data frame or not; by default, dframe is set to "FALSE".
         # To take a sample proportional to the population size in each group,
         # enter "size" as a decimal. Otherwise, enter size as a whole number.
         #
         # Example 1a: To sample 10% of each group from a data frame named "z"
         # and using a seed of "1", use: > stratified(z, .1, 1)
         # Example 1b: To run the same sample as above but display the result as
         # a data frame, use: > stratified(z, .1, 1, T)
         #
         # Example 2: To sample 10% of each group from a data frame named "z"
         # and using no seed, use: > stratified(z, .1, "NO")
         #
         # Example 3: To sample 5 from each group from a data frame named "z"
         # and using a seed of 30, use: > stratified(z, 5, 30)
         #
         # NOTE: Not recommended for datasets with LOTS of groups or with HUGE
         # differences in group sizes.
  k = unstack(df)
  if (dframe == FALSE) {
    if (seed == "NO" & size < 1) {
      for (i in 1:length(k)) {
        N = k[[i]]
        n = round(length(N)*size)
        pre = structure(list("Group" = names(k[i]), "Population Size" =
                             length(k[[i]]), "Sample Size" = n, Seed = seed,
                             Sample = sample(N, n, ...)), class = "power.htest")
        print(pre)
      }
    } else if (seed == "NO" & size >= 1) {
      for (i in 1:length(k)) {
        N = k[[i]]
        pre = structure(list("Group" = names(k[i]), "Population Size" =
                             length(k[[i]]), "Sample Size" = size, Seed = seed,
                             Sample = sample(N, size, ...)),
                             class = "power.htest")
        print(pre)
      }
    } else if (size < 1) {
      for (i in 1:length(k)) {
        set.seed(seed)
        N = k[[i]]
        n = round(length(N)*size)
        pre = structure(list("Group" = names(k[i]), "Population Size" =
                             length(k[[i]]), "Sample Size" = n, Seed = seed,
                             Sample = sample(N, n, ...)), class = "power.htest")
        print(pre)
      }
    } else if (size >= 1) {
      for (i in 1:length(k)) {
        set.seed(seed)
        N = k[[i]]
        pre = structure(list("Group" = names(k[i]), "Population Size" =
                             length(k[[i]]), "Sample Size" = size, Seed = seed,
                             Sample = sample(N, size, ...)),
                             class = "power.htest")
        print(pre)
      }
    }
  } else {
    if (seed == "NO" & size < 1) {
      for (i in 1:length(k)) {
        N = k[[i]]
        n = round(length(N)*size)
        res = data.frame(names(k[i]), sample(N, n, ...))
        names(res) = c("Group", "Samples")
        print(res)
      }
    } else if (seed == "NO" & size >= 1) {
      for (i in 1:length(k)) {
        N = k[[i]]
        res = data.frame(names(k[i]), sample(N, size, ...))
        names(res) = c("Group", "Samples")
        print(res)
      }
    } else if (size < 1) {
      for (i in 1:length(k)) {
        set.seed(seed)
        N = k[[i]]
        n = round(length(N)*size)
        res = data.frame(names(k[i]), sample(N, n, ...))
        names(res) = c("Group", "Samples")
        print(res)
      }
    } else if (size >= 1) {
      for (i in 1:length(k)) {
        set.seed(seed)
        N = k[[i]]
        res = data.frame(names(k[i]), sample(N, size, ...))
        names(res) = c("Group", "Samples")
        print(res)
      }
    }
  }
}

You can load the function by typing:

?View Code RSPLUS
1
> source("http://news.mrdwab.com/stratified-beta")

And now, to test it…

Let’s generate some dummy data and see what we can come up with. The function takes the following arguments (in the following order):

  • df: The source data frame, with the first column being the IDs and the second column being the groups.
  • size: The sample size you want, either as a percentage (for proportional sampling–expressed as a decimal) or as a whole number.
  • seed: The seed you want to use. If you don’t want to use a seed, enter “NO”.
  • dframe: What format you want the output in, either a list or a data frame. Defaults to a list (dframe=FALSE), which is better at the moment since the data frame option is not working the way I expect it to yet.
?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
> # Generate some data
> a = 1:100
> set.seed(123)
> b = sample(c("a", "b", "c", "d"), 100, replace = T)
> z = data.frame(a, b)
> # Check how big each group is
> table(z$b)
 
 a  b  c  d
26 27 20 27
> # Make sure the function is loaded before you continue!
> # source("http://news.mrdwab.com/stratified-beta")
> # Take a 15% sample and use a seed of 1
> stratified(z, .15, 1)
 
 
 
          Group = a
Population Size = 26
    Sample Size = 4
           Seed = 1
         Sample = 38, 45, 54, 81
 
 
 
 
          Group = b
Population Size = 27
    Sample Size = 4
           Seed = 1
         Sample = 39, 43, 60, 79
 
 
 
 
          Group = c
Population Size = 20
    Sample Size = 3
           Seed = 1
         Sample = 23, 26, 33
 
 
 
 
          Group = d
Population Size = 27
    Sample Size = 4
           Seed = 1
         Sample = 21, 31, 53, 71
 
> # Take a sample of 5 from each group and use a seed of 1
> stratified(z, 5, 1)
 
 
 
          Group = a
Population Size = 26
    Sample Size = 5
           Seed = 1
         Sample = 38, 45, 54, 81, 30
 
 
 
 
          Group = b
Population Size = 27
    Sample Size = 5
           Seed = 1
         Sample = 39, 43, 60, 79, 19
 
 
 
 
          Group = c
Population Size = 20
    Sample Size = 5
           Seed = 1
         Sample = 23, 26, 33, 78, 14
 
 
 
 
          Group = d
Population Size = 27
    Sample Size = 5
           Seed = 1
         Sample = 21, 31, 53, 71, 11
 
> # Take a sample of 15 from each group, with replacement, and a seed of 1
> stratified(z, 15, 1, replace=T)
 
 
 
          Group = a
Population Size = 26
    Sample Size = 15
           Seed = 1
         Sample = 38, 45, 56, 91, 35, 91, 96, 74, 62, 15, 35, 30, 74, 45, 81
 
 
 
 
          Group = b
Population Size = 27
    Sample Size = 15
           Seed = 1
         Sample = 39, 44, 63, 93, 29, 93, 95, 66, 64, 3, 29, 19, 70, 44, 77
 
 
 
 
          Group = c
Population Size = 20
    Sample Size = 15
           Seed = 1
         Sample = 23, 26, 55, 94, 22, 92, 94, 72, 61, 9, 22, 14, 72, 26, 78
 
 
 
 
          Group = d
Population Size = 27
    Sample Size = 15
           Seed = 1
         Sample = 21, 32, 58, 88, 16, 88, 89, 65, 59, 4, 16, 11, 67, 32, 69
 
> # Take a sample of 10% from each group, using a seed of 1,
> # and display the output as a data frame
> stratified(z, .1, 1, dframe=T)
  Group Samples
1     a      38
2     a      45
3     a      54
  Group Samples
1     b      39
2     b      43
3     b      60
  Group Samples
1     c      23
2     c      26
  Group Samples
1     d      21
2     d      31
3     d      53

Replicating the results from tapply()

I mentioned earlier that the results are different from what you would get if you were to use the tapply() function. However, it is easy to get the same results using this stratified function–simply move your “seed” outside of the function (enter seed as "NO" [with quotes] and instead, use set.seed() as you normally would).

?View Code RSPLUS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
> # See what tapply() gives us
> set.seed(1); tapply(z$a, z$b, sample, size = 4)
$a
[1] 38 45 54 81
 
$b
[1] 29 86 95 63
 
$c
[1] 61  9 14 92
 
$d
[1] 67 31 68 34
 
> # The normal usage for the stratified function
> stratified(z, 4, 1, dframe = T)
  Group Samples
1     a      38
2     a      45
3     a      54
4     a      81
  Group Samples
1     b      39
2     b      43
3     b      60
4     b      79
  Group Samples
1     c      23
2     c      26
3     c      33
4     c      78
  Group Samples
1     d      21
2     d      31
3     d      53
4     d      71
> # Getting the same results as tapply()
> # Set the seed before using the function,
> # and set the seed for the function as "NO"
> set.seed(1); stratified(z, 4, "NO", dframe = T)
  Group Samples
1     a      38
2     a      45
3     a      54
4     a      81
  Group Samples
1     b      29
2     b      86
3     b      95
4     b      63
  Group Samples
1     c      61
2     c       9
3     c      14
4     c      92
  Group Samples
1     d      67
2     d      31
3     d      68
4     d      34

The unfortunate…

There are some advantages to each of the output formats. I’ve set up the list to be quite verbose, which is useful with the proportionate sampling since it shows us how many samples have been taken from each group. The data frame output format, on the other hand, is quite compact.

What I still need to figure out, though, is why R won’t store my output. I suspect that it has something to do with how my loops are set up. I assume that somewhere, I need to add something like an rbind command.

When the time is right, I will be sure to post what I’ve found.


Related posts (possibly):

  1. Stratified random sampling in R from a data frame After a little bit more work, there’s a new stratified...
  2. Simple sampling with R I mentioned in an earlier post (Am I inconsistent?) that...
  3. Sampling with replacement in R In my last post about sampling (Simple sampling with R)...
  4. A sample size calculator function for R IMPORTANT: This is here mostly to remind me of how...
  5. The new sample size calculator for R (already) aka “Maybe I shouldn’t post so quickly” Just hours ago,...
This entry was posted in (all categories), Geekiness, Useless Knowledge and tagged , , , , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.