1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
| > # Generate some data
> a = 1:100
> set.seed(123)
> b = sample(c("a", "b", "c", "d"), 100, replace = T)
> z = data.frame(a, b)
> # Check how big each group is
> table(z$b)
a b c d
26 27 20 27
> # Make sure the function is loaded before you continue!
> # source("http://news.mrdwab.com/stratified-beta")
> # Take a 15% sample and use a seed of 1
> stratified(z, .15, 1)
Group = a
Population Size = 26
Sample Size = 4
Seed = 1
Sample = 38, 45, 54, 81
Group = b
Population Size = 27
Sample Size = 4
Seed = 1
Sample = 39, 43, 60, 79
Group = c
Population Size = 20
Sample Size = 3
Seed = 1
Sample = 23, 26, 33
Group = d
Population Size = 27
Sample Size = 4
Seed = 1
Sample = 21, 31, 53, 71
> # Take a sample of 5 from each group and use a seed of 1
> stratified(z, 5, 1)
Group = a
Population Size = 26
Sample Size = 5
Seed = 1
Sample = 38, 45, 54, 81, 30
Group = b
Population Size = 27
Sample Size = 5
Seed = 1
Sample = 39, 43, 60, 79, 19
Group = c
Population Size = 20
Sample Size = 5
Seed = 1
Sample = 23, 26, 33, 78, 14
Group = d
Population Size = 27
Sample Size = 5
Seed = 1
Sample = 21, 31, 53, 71, 11
> # Take a sample of 15 from each group, with replacement, and a seed of 1
> stratified(z, 15, 1, replace=T)
Group = a
Population Size = 26
Sample Size = 15
Seed = 1
Sample = 38, 45, 56, 91, 35, 91, 96, 74, 62, 15, 35, 30, 74, 45, 81
Group = b
Population Size = 27
Sample Size = 15
Seed = 1
Sample = 39, 44, 63, 93, 29, 93, 95, 66, 64, 3, 29, 19, 70, 44, 77
Group = c
Population Size = 20
Sample Size = 15
Seed = 1
Sample = 23, 26, 55, 94, 22, 92, 94, 72, 61, 9, 22, 14, 72, 26, 78
Group = d
Population Size = 27
Sample Size = 15
Seed = 1
Sample = 21, 32, 58, 88, 16, 88, 89, 65, 59, 4, 16, 11, 67, 32, 69
> # Take a sample of 10% from each group, using a seed of 1,
> # and display the output as a data frame
> stratified(z, .1, 1, dframe=T)
Group Samples
1 a 38
2 a 45
3 a 54
Group Samples
1 b 39
2 b 43
3 b 60
Group Samples
1 c 23
2 c 26
Group Samples
1 d 21
2 d 31
3 d 53 |
Stratified Random Sampling in R–A Function in Progress
I know that sampling is quite complex, and I will admit that I know very little about its complexities. Fortunately, software like R lets you draw simple random samples pretty easily, either either with or without replacement. Unfortunately, I could not find any feature to allow me to do simple stratified random sampling, at least not with the features I was looking for. Fortunately again, with a little bit of experimenting, it can be pretty easy to learn how to write functions in R when a direct solution does not present itself.
This post shares my initial “work-in-progress” on writing an R function for stratified sampling.
The problem…
Here’s the minimum that I was hoping for:
My initial searches directed me to Yihui Xie’s page on stratified sampling using tapply(). However, this option did not satisfy my needs. As far as I could figure, it only allowed me to take a fixed sample size. Also, I wasn’t totally satisfied with the output.
Consider the following. In Yihui Xie’s example, there is a difference between the results one would get if they sampled from each group separately, but using the same seed.
I’m sure there’s some sampling theory that explains this, or at least something about how R treats its data, but at the moment, that’s beyond my humble level of expertise.
Stratified sampling, Mr. DWAB style…
The solution I arrived at is to use “unstack()” and a few conditional loops to take the samples.
And, without more rambling, here’s what I came up with.
You can load the function by typing:
And now, to test it…
Let’s generate some dummy data and see what we can come up with. The function takes the following arguments (in the following order):
df: The source data frame, with the first column being the IDs and the second column being the groups.size: The sample size you want, either as a percentage (for proportional sampling–expressed as a decimal) or as a whole number.seed: The seed you want to use. If you don’t want to use a seed, enter “NO”.dframe: What format you want the output in, either a list or a data frame. Defaults to a list (dframe=FALSE), which is better at the moment since the data frame option is not working the way I expect it to yet.Replicating the results from tapply()
I mentioned earlier that the results are different from what you would get if you were to use the
tapply()function. However, it is easy to get the same results using thisstratifiedfunction–simply move your “seed” outside of the function (enter seed as"NO"[with quotes] and instead, useset.seed()as you normally would).The unfortunate…
There are some advantages to each of the output formats. I’ve set up the list to be quite verbose, which is useful with the proportionate sampling since it shows us how many samples have been taken from each group. The data frame output format, on the other hand, is quite compact.
What I still need to figure out, though, is why R won’t store my output. I suspect that it has something to do with how my loops are set up. I assume that somewhere, I need to add something like an rbind command.
When the time is right, I will be sure to post what I’ve found.
Related posts (possibly):