A sample size calculator function for R

IMPORTANT: This is here mostly to remind me of how I solved my problem. You should read “The new sample size calculator for R (already)” if you really want to use this function.

In the research class at the Tata-Dhan Academy, students are currently getting into sampling, so I thought I would introduce them to R. However, try as I might, I couldn’t find how to do a simple sample size calculation in R if I knew, for instance, the size of the population I wanted to sample from, the confidence level desired, and the confidence interval desired.

Now, I know that there are literally hundreds of such calculators online, but I thought it would be a good excuse for me to learn how to write a function. Here are my first three four functions which demonstrate some of the features available for writing functions in R. These are relatively basic, and there might be better ways to do this (if there are, please share!) but it was still a fun experiment for me.

Using the reshape package in R for pivot-table-like functionality

A little more than a week ago, I wrote about creating pivot tables in Microsoft Excel and OpenOffice.org. I also mentioned that I would explain how to do similar calculations by using R. This post will explain how to achieve similar results in R by using the reshape package.

I had initially started experimenting with the reshape package several months ago when I was trying to figure out how to reshape data from wide to long formats. However, once I started experimenting with it, I realized I had misunderstood what the reshape package was designed to do. Now that I finally have a grasp of what can be done using the package, I thought I would share what I’ve found using a few examples.

Getting data into R

When you first open R, you’re greeted with a screen similar to the following:

R version 2.10.0 (2009-10-26)
Copyright (C) 2009 The R Foundation for Statistical Computing
ISBN 3-900051-07-0

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

>

I’ve been trying to encourage my students to use R for some of their work, but in the process, I sort of forgot that for most people, starting up a program and just being greeted with a command prompt might be somewhat intimidating. So after several of my students indicated that they had downloaded and installed R but had no idea what to do next, I thought I would write about some of the very basic ways to get started. I recognize that for some huge datasets, the suggestions here are not the best, but for me, and for most of my students, the datasets that we would be working with are actually quite small.

R is like a giant calculator for grownups

One of the things that is interesting about R is how flexible it is. One of the fun things about it is how interactive it can be. While my examples so far have been a little bit more involved, it can be useful to spend some time just getting acquainted with how R performs basic calculations. In fact, I sometimes like to think of R as a giant calculator for grownups to play with. The following syntax snippets show how you can perform basic calculations with R. This is by no means complete, but it should provide a reasonable introduction to someone just getting started</emwith R. (Experienced R users would find this TOTALLY useless….)

A little spark for presenting your data

For some reason, I’ve been obsessing over the presentation of data. (Either it is that I’ve just read all of Edward Tufte’s books, or I’m just being a nerd. But I guess that those two things aren’t exactly exclusive….) Considering my obsession, you could imagine how I felt when one of my students stood up and made a presentation that included the following slides, along with the typical, “As you can see here, the production of rice has been decreasing. And as you can see in this chart, the production of wheat has been decreasing,” for slide after slide after slide.

If for some reason you’re not able to see the embedded slides, you can also view the slides in a new window.

It's a choropleth party with R, and everyone's invited

Map party time. For some reason this happens every once in a while with me. A few years ago, I got to develop a website filled with choropleth maps galore. It was a pretty tedious process. Excel sheets. Photoshop. No good access to free Indian shapefiles. I was even thinking of making my own SVG files of Indian states at one point and thinking of a complex PHP and MySQL website.

Skip forward a few years now, and I’m back with the maps. Only this time, I have some new tools and resources: the software named after a pirate’s favorite letter, some free maps from the Global Administrative Areas website, some data from the 2001 Indian Census (I selected district data, all districts, and total population), and Google Docs (to clean up my CSV files).

Quickly reshaping data from `wide` to `long` formats in R

A lot of the times, students at the Academy enter data in a “wide” format (since it is a very natural way to enter data in a spreadsheet). Let’s say, for example, that they were collecting data for a household, and for each person, they were collecting information on three variables. Assume also that they were only collecting information about five household members. They might end up with a first row of column names something like “HouseholdID” | “member.01” | “member.02” | “member.03” | “member.04” | “member.05” | “variable1.01” | “variable1.02” | “variable1.03” | “variable1.04” | “variable1.05” | “variable2.01” | “variable2.02” … and so on. Sometimes, however, we may find it more useful to have our data in a “long” format. This post tells you how to quickly do that using R.

Sampling with replacement in R

In my last post about sampling, Simple sampling with R, we were doing simple sampling without replacement–that is, each item could only be selected once. However, there are times when you want to simulate sampling with replacement. For example, if you wanted to simulate sampling the results of rolling a dice 50 times, your outcomes each time could be a 1, 2, 3, 4, 5 or 6, but 50 is more than 6, so you need to let the software “replace” the sample before it takes another sample.

This post explains how to do this with R.

Simple sampling with R

I mentioned in an earlier post (“Am I inconsistent?”) that I got interested in R because Amy had asked me to help her with some sampling at one point. Since that was my starting point, I thought I would share some of my experiments with you. In this post:

  1. Simple random sampling
  2. Simple random sampling with a seed
  3. Sorting your sample

Am I inconsistent?

I won’t pretend that I don’t have any illegal software installed on my computers, but here are a couple of scenarios that have occurred at work recently.

A student came to me with his new laptop and asked me “Can I have a copy of Office 2007?”

I thought for a minute and said, “Um, why? What do you have right now?”

“They installed Office 2003 when I purchased the laptop.”

“So, what’s wrong with that version.”

Silence.