A sample size calculator function for R

IMPORTANT: This is here mostly to remind me of how I solved my problem. You should read if you really want to use this function.

In the research class at the Tata-Dhan Academy, students are currently getting into sampling, so I thought I would introduce them to R. However, try as I might, I couldn’t find how to do a simple sample size calculation in R if I knew, for instance, the size of the population I wanted to sample from, the confidence level desired, and the confidence interval desired.

Now, I know that there are literally hundreds of such calculators online, but I thought it would be a good excuse for me to learn how to write a function. Here are my first three four functions which demonstrate some of the features available for writing functions in R. These are relatively basic, and there might be better ways to do this (if there are, please share!) but it was still a fun experiment for me.

samp.size()

Here’s my first attempt (based on these formulas).

ss = \frac{Z^2\times p\times(1-p)}{c^2}

pss = \frac{ss}{1+\frac{ss-1}{pop}}

samp.size = function(z.val, margin, c.interval, population) {
    ss = (z.val^2 * margin * (1 - margin))/(c.interval^2)
    return(ss/(1 + ((ss - 1)/population)))
}

Here’s what’s happening. The sample.size = function(z.val, margin, c.interval, population) part tells R that we’re creating a function called sample size that’s dependent on inputs for four variables (z.val, margin, c.interval, and population–in that order). The curly brackets enclose the formula or set of formulas that use these four variables. In this particular function, there are only two formulas. The first line is the equation used to determine the sample size when the population is not known, and the second line uses this first formula to determine the sample size for a known (finite) population.

The downside to this function is that you need to specify your z value, which means looking it up in a table like this one

The upside is that since this is the raw formula, you can actually use it for any confidence level you want, while the other two functions are limited in the confidence levels they offer. I’ve highlighted the intersection points for confidence levels of 80%, 90%, 95%, 98%, and 99%. From there, you first read the corresponding value in the first column and the first row to find the z value to use in our samp.size function. For instance, for 80%, we look for the value closest to .4 (since this table is based on a symmetric normal distribution) and we find that the corresponding first column value is 1.2, and the corresponding first row value is .08, so we would use a z value of 1.28.

Knowing this information, and assuming a 50% response distribution and a 5% confidence interval, we can now use the samp.size function as follows.

samp.size(1.28, 0.5, 0.05, 100)
## [1] 62.33

Our recommended sample size is 62.33.

NOTE: *Forget all of this nonsense and scroll down to the sample.size() function at the end of this post. It is much better and much easier to use.

sample.size.table()

After reading some more about determining sample size, I thought it might be interesting to see in one place what the recommended sample sizes would be for some common confidence levels (80%, 90%, 95%, 98%, 99%, 99.5%, 99.8%, 99.9%, and 99.99%, with data from Wikipedia’s article about the normal distribution).

Along with those confidence levels being built-in to my function, I thought I would also set the response distribution to default to 50% and the confidence interval to default to 5%. That way, all that the user would have to do is enter the population size, and a table would be generated with the suggested sample sizes. Here’s the function I created for that.

sample.size.table = function(margin=.5, c.interval=.05, population) {
  z.val=c(1.281551565545, 1.644853626951, 1.959963984540, 
          2.326347874041, 2.575829303549, 2.807033768344, 
          3.090232306168, 3.290526731492, 3.890591886413)
  ss = (z.val^2 * margin * (1-margin))/(c.interval^2)
  p.ss = ss/(1 + ((ss-1)/population))
  c.level = c("80%","90%","95%","98%","99%",
              "99.5%","99.8%","99.9%","99.99%")
  results = data.frame(c.level, round(p.ss, digits = 0))
  names(results) = c("Confidence Level", "Sample Size")
  METHOD = c("Suggested sample sizes at different confidence levels")
  moe = paste((c.interval*100), "%", sep="")
  resp.dist = paste((margin*100),"%", sep="")
  pre = structure(list(Population=population, 
                       "Margin of error" = moe,
                       "Response distribution" = resp.dist, 
                       method = METHOD),
                  class = "power.htest")
  print(pre)
  print(results)
}

As you read through this function, most of it is simply about presentation. The formulas are the same as the ones in the samp.size() function, but there is a lot more information to display, and I wanted it to be somewhat nicely formatted too. Notice that as I did not want the user to change the confidence level (it’s an array of preset values), I moved that out of the function() statement. Using this function is quite easy. Imagine that we want to accept the default values for the response distribution and the confidence interval, all we need to do is declare our population size.

sample.size.table(, , 100)
## 
##      Suggested sample sizes at different confidence levels 
## 
##            Population = 100
##       Margin of error = 5%
## Response distribution = 50%
## 
##   Confidence Level Sample Size
## 1              80%          62
## 2              90%          73
## 3              95%          80
## 4              98%          85
## 5              99%          87
## 6            99.5%          89
## 7            99.8%          91
## 8            99.9%          92
## 9           99.99%          94

Notice that in order for this to work, you need the correct number of commas to show that you’re accepting the default values for the other two variables. Or, you can use something like sample.size.table(population = 100) to be on the safe side. If you don’t include them, you might end up something like this:

sample.size.table(100)
## Error: 'population' is missing

sample.size.old()

After writing my second function, I decided to try one more time, this time allowing users to use the more familiar “95” for a confidence level of 95% instead of having to look up the value for 95% in the z table. Doing this would also give me an excuse to try using if and else in my function. Here’s what I came up with.

sample.size.old = function(c.lev, margin=.5, 
                           c.interval=.05, population) {
  if (c.lev==80) {
    z.val=1.281551565545
  } else if (c.lev==90) {
    z.val=1.644853626951
  } else if (c.lev==95) {
    z.val=1.959963984540
  } else if (c.lev==98) {
    z.val=2.326347874041
  } else if (c.lev==99) {
    z.val=2.575829303549
  } else if (c.lev==99.5) {
    z.val=2.807033768344
  } else if (c.lev==99.8) {
    z.val=3.090232306168
  } else if (c.lev==99.9) {
    z.val=3.290526731492
  } else if (c.lev==99.99) {
    z.val=3.890591886413
  }
  ss = (z.val^2 * margin * (1-margin))/c.interval^2
  p.ss = round((ss/(1 + ((ss-1)/population))), digits=0)
  METHOD = paste("Recommended sample size for a population of ", 
                 population, " at a ", c.lev, 
                 "% confidence level", sep = "")
  structure(list(Population = population, 
                 "Confidence level" = c.lev,
                 "Margin of error" = c.interval, 
                 "Response distribution" = margin,
                 "Recommended sample size" = p.ss, 
                 method = METHOD),
            class = "power.htest")
}

As you can see, this is similar to the sample.size.table() function, but in this case, the user has to explicitly enter the confidence level they want (selecting from either 80%, 90%, 95%, 98%, 99%, 99.5%, 99.8%, 99.9%, or 99.99%) and must specify the population. They can also change the default values for the response distribution (second position) or the margin of error (third position). Here’s an example.

sample.size.old(99.99, , , 100)
## 
##      Recommended sample size for a population of 100 at a 99.99% confidence level 
## 
##              Population = 100
##        Confidence level = 99.99
##         Margin of error = 0.05
##   Response distribution = 0.5
## Recommended sample size = 94
## 

The “duh” moment or sample.size()

Of course, after posting this, I had one of those “duh” moments when I remembered the qnorm() function that’s built in to R. By using that function, we can now use the following function to determine sample sizes at any different confidence levels. Furthermore, we can enter the value in a human-friendly form. No more having to use a z table to find out the value for 98%. Just type in 98 as your first value and you’re set to go! Here’s the final function:

sample.size = function(c.lev, margin=.5, 
                       c.interval=.05, population) {
  z.val = qnorm(.5+c.lev/200)
  ss = (z.val^2 * margin * (1-margin))/c.interval^2
  p.ss = round((ss/(1 + ((ss-1)/population))), digits=0)
  METHOD = paste("Recommended sample size for a population of ", 
                 population, " at a ", c.lev, 
                 "% confidence level", sep = "")
  structure(list(Population = population, 
                 "Confidence level" = c.lev,
                 "Margin of error" = c.interval, 
                 "Response distribution" = margin,
                 "Recommended sample size" = p.ss, 
                 method = METHOD),
            class = "power.htest")
}

And, here’s how you use it:

sample.size(98, , , 100)
## 
##      Recommended sample size for a population of 100 at a 98% confidence level 
## 
##              Population = 100
##        Confidence level = 98
##         Margin of error = 0.05
##   Response distribution = 0.5
## Recommended sample size = 85
## 

You can also use sample.size(c.lev = 98, population = 100) if those extra commas bother you.

By the way, you may notice in the function code for sample.size.table(), sample.size.old(), and sample.size() that the last item is class = "power.htest". That is simply for formatting the output and it is taken from the power.t.test() function.

By the way some more, if you want to see the underlying code for other functions, you can usually refer to just their name and the syntax will print out. For instance, to view the power.t.test(), just write power.t.test at the command prompt and hit enter.

Even more by the way, you don’t need to type these functions all the time. If you want to use these functions, you can first load them into R by typing source(“http://news.mrdwab.com/sample.size”) at the command prompt in R before you try to take the sample.

2 thoughts on “A sample size calculator function for R

  1. Hello,
    I am one of the authors of “Collaborative Statistics” and I am very impressed what you are doing with the book and R. Is there information about R and the calculator(s) that run it somewhere on your website? I would like to learn about it.

    Susan Dean

    • Dear Susan,

      Thanks for the encouragement. I enjoy the “Collaborative Statistics” book and I’m planning on recommending it to the faculty that teaches statistics at the school I work at.

      The best place to start for information about R would be to visit the R Project’s home page (http://www.r-project.org) and download a copy of the software. Once you’ve gotten the software installed, there are many books and websites that might help you get started. I found SimpleR and Using R for Data Analysis and Graphics to be very useful. R in a Nutshell is also a great retail book.

      R is very syntax oriented (so, more like Stata than SPSS). As I have some experience programming, it’s more natural for me to use the command-line interface. (I actually use the SciViews-K extension for running R from within Komodo Edit.) But, if you prefer a more standard interface, there are several graphical user interfaces (GUIs) that might make the software easier to use. I like the combination of Deducer and JGR for most basic statistics, many people like the R Commander, and if you are using Linux, RKWard is great. Also, since you’re an educator, you’ might want to check out Revolution Analytics which offers a free version of their enterprise software to academics.

      I hope this is enough to get you started!

      ~ Ananda

Leave a Reply