It’s a choropleth party with R, and everyone’s invited

Tamil Nadu Population Density

Map party time. For some reason this happens every once in a while with me. A few years ago, I got to develop a website filled with choropleth maps galore. It was a pretty tedious process. Excel sheets. Photoshop. No good access to free Indian shapefiles. I was even thinking of making my own SVG files of Indian states at one point and thinking of a complex PHP and MySQL website.

Skip forward a few years now, and I’m back with the maps. Only this time, I have some new tools and resources: the software named after a pirate’s favorite letter, some free maps from the Global Administrative Areas website, some data from the 2001 Indian Census (I selected district data, all districts, and total population), and Google Docs (to clean up my CSV files).

Enough history. Let’s get started mapping, and since I’m in Tamil Nadu, I’m going to restrict myself to that state.

First, download the district level RData file from the GADM website (it’s a little under 7MB). Double-click on the file to open the R workspace. By using > ls() we can see that the only object in this workspace is named “gadm”. At this point, you can’t actually do much. Typing > fix(gadm) just pops up an editor window that says
<S4 object of class structure("SpatialPolygonsDataFrame", package = "sp")>
which isn’t really useful. We need to load the libraries we’ll need:


> library(sp)
> library(RColorBrewer)

Since I want to see what is in the gadm object, I can just write a CSV of the object by typing > write.csv(gadm, "gadm-data.csv"). This will get us the following file:

As you can see, this spreadsheet is for all the districts in India, but I’m only interested in Tamil Nadu. Furthermore, without first sorting, not all of the Tamil Nadu districts are presented together. To fix this, I can simply create a new object for Tamil Nadu. For this, I only want the rows where the value in the “NAME_1″ column is “Tamil Nadu”. It might also be easier for me if I have a CSV of this new object so that I can arrange my data from the census in the right order, and it’s also easier if the district names (which are under the variable “NAME_2″) are in alphabetical order.


> Tamil.Nadu = gadm[gadm$NAME_1=="Tamil Nadu",] # Select only Tamil Nadu
> Tamil.Nadu = Tamil.Nadu[order(Tamil.Nadu$NAME_2),] # Order by district
> write.csv(Tamil.Nadu, "TN_db_raw.csv") # Write output as CSV

Fortunately, the names for the districts in the GADM dataset mostly match with those from the Indian census, so matching the data is quite straightforward. I’ve gone ahead and uploaded the combined information to Google Docs.

Let’s load this new file.


> TN.DB = read.csv("http://news.mrdwab.com/tnpop", header=T)
> names(TN.DB) # check to get variable names
[1] "NAME_2"     "Total_HH"   "Total_Pop"  "Male_Pop"   "Female_Pop"
[6] "Sex_Ratio"  "Area"       "Pop_Dens"

And now, let’s create a map. While it might be tempting to create a map of, say, total population, that’s not really how choropleth maps should be used. From the data here, without doing any further calculations, the only variable that makes sense to map is population density. I’ll start by doing a quick summary and plot to see what the data look like


# Use "digits=" to make sure that R doesn't round our results
> summary(TN.DB$Pop_Dens, digits=6)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
  278.870   343.480   416.850  1291.610   605.765 24963.500
> plot(TN.DB$Pop_Dens)
Plot of population density

Notice the outlier. This will definitely affect our data if we don't deal with it.

From our plot, I can see that there is an outlier in the data. This is useful to know since if I simply went ahead and created our “bins” with that data point included, I wouldn’t end up capturing the variance between the lower values: they would most likely all be put into one bin.

In this map, I’m going to be lazy and simply divide the population density range into bins that are (more-or-less) equal sizes. You can also use quartiles if you prefer.


# Set your lower and upper limits, excluding the outlier, and length 
# equal to one more than the number of bins you actually want.
> TNPopDense = c(round(seq(from=0, to=1100, length=8), digits=0),
+              round(max(TN.DB$Pop_Dens)))
> TNPopDense # Preview our breaks
[1]     0   157   314   471   629   786   943  1100 24964
# Use our breaks to put each of the values in the "Pop_Dens" column
# into its corresponding bin.
> PopDenseRange = as.factor(as.numeric(cut(TN.DB$Pop_Dens, TNPopDense)))
> PopDenseRange # Preview which bins the districts fall into.
 [1] 3 8 4 4 2 3 3 5 7 3 5 4 3 3 2 2 2 4 2 5 3 5 4 3 4 3 3 4 3 3
Levels: 2 3 4 5 7 8

Note that this may not have been the best way to do this–I wanted to put each district in one of eight levels, preferably using all levels, but they only fall into six levels with this particular set of data.

Now, using the levels that we got in the previous step, we can work on our “legend” for the map. We can use “> 157″, “157 – 314″, and so on as our legend entries. After that, we merge a column into our TN.DB object to store these new values.


> levels(PopDenseRange) = list("< 157"="1", "157 - 314"="2", "315 - 471"="3",
+                         "472 - 629"="4", "630 - 786"="5", "787 - 943"="6",
+                         "943 - 1,100"="7", "> 1,100"="8")
> TN.DB$PopDenseRange = PopDenseRange # Merging info
> shadePopDense = brewer.pal(8, "Blues") # Setting up the coloring scheme
> Tamil.Nadu$PopDenseRange = PopDenseRange
# And now we plot it!
> PopDensePlot = spplot(Tamil.Nadu, "PopDenseRange", col="blue",
 +               col.regions=shadePopDense,
 +               main="Tamil Nadu population density by district")
> PopDensePlot
Tamil Nadu Population Density

Here's our final map of population density.

And that’s pretty much it! I’m sure there’s a lot of room for improvement in the code and the process, but overall, not too bad for just a few lines of syntax in R. [R also exports really nice PDF files.]

Aside from referring to the typical R documentation, this example really helped me to figure out what I needed to do.


Related posts (possibly):

  1. Getting data into R When you first open R, you’re greeted with a screen...
  2. A little spark for presenting your data For some reason, I’ve been obsessing over the presentation of...
  3. Quickly reshaping data from “wide” to “long” formats in R A lot of the times, students at the Academy enter...
  4. R is like a giant calculator for grownups One of the things that is interesting about R is...
  5. Using the reshape package in R for pivot-table-like functionality A little more than a week ago, I wrote about...
This entry was posted in (all categories), Geekiness, India, Useless Knowledge and tagged , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

2 Comments

  1. Axle
    Posted May 18, 2010 at 1:54 pm | Permalink

    Crazy man!

  2. Posted May 18, 2010 at 3:03 pm | Permalink

    It must run in the family….

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

blog comments powered by Disqus