Regular expressions in R

In my last post (), I showed a few things I had figured out recently related to regular expressions. By now, you have also figured out that I like figuring things out in R, and application of regular expressions is one of these things.

Since R is scriptable, it is easy to put a series of regular expressions to work to get the results you need. Consider the following, which uses this text file as the input, and which gives us the same output as “Example 3″ from my earlier post:

a = readLines("http://news.mrdwab.com/wp-content/uploads/2011/03/unprocessed-1.txt")
b = gsub("^([01]:[ |0-9]+)$", "", a)
b = gsub("^([0-9]|[0-9-]+)\\.([0-9]{4,5})", "", b)
b = gsub("^([A-Z])$", "", b)
birthweight.percentiles = matrix(scan(textConnection(b), skip=17),
                                 ncol=12, byrow=T)
colnames(birthweight.percentiles) = c("Month",
                                      scan(textConnection(b),
                                           what="character",
                                           skip=5, n=11))
birthweight.percentiles
##       Month 1st 3rd 5th 15th 25th 50th 75th 85th 95th 97th 99th
##  [1,]     0 2.3 2.4 2.5  2.8  2.9  3.2  3.6  3.7  4.0  4.2  4.4
##  [2,]     1 3.0 3.2 3.3  3.6  3.8  4.2  4.6  4.8  5.2  5.4  5.7
##  [3,]     2 3.8 4.0 4.1  4.5  4.7  5.1  5.6  5.9  6.3  6.5  6.9
##  [4,]     3 4.4 4.6 4.7  5.1  5.4  5.8  6.4  6.7  7.2  7.4  7.8
##  [5,]     4 4.8 5.1 5.2  5.6  5.9  6.4  7.0  7.3  7.9  8.1  8.6
##  [6,]     5 5.2 5.5 5.6  6.1  6.4  6.9  7.5  7.8  8.4  8.7  9.2
##  [7,]     6 5.5 5.8 6.0  6.4  6.7  7.3  7.9  8.3  8.9  9.2  9.7
##  [8,]     7 5.8 6.1 6.3  6.7  7.0  7.6  8.3  8.7  9.4  9.6 10.2
##  [9,]     8 6.0 6.3 6.5  7.0  7.3  7.9  8.6  9.0  9.7 10.0 10.6
## [10,]     9 6.2 6.6 6.8  7.3  7.6  8.2  8.9  9.3 10.1 10.4 11.0
## [11,]    10 6.4 6.8 7.0  7.5  7.8  8.5  9.2  9.6 10.4 10.7 11.3
## [12,]    11 6.6 7.0 7.2  7.7  8.0  8.7  9.5  9.9 10.7 11.0 11.7
## [13,]    12 6.8 7.1 7.3  7.9  8.2  8.9  9.7 10.2 11.0 11.3 12.0

Similarly, we can replicate the “bonus session” (which is based on this text file) as follows:

n = readLines("http://news.mrdwab.com/wp-content/uploads/2011/03/unprocessed-5.txt")
org.name = gsub("^([0-9]\\. )(.*) \\(.*", "'\\2'", n)
org.name = gsub("^[0-9].*", "", org.name)
orgs = rep(scan(textConnection(org.name),
                what="character"), c(16, 5, 1, 1, 2, 4))
ss = gsub("^([0-9]\\. )(.*)\\(([0-9]+)\\)( )", "", n)
ss = gsub("^([0-9]+) (.*) (.*)", "\\2,\\3", ss)
states.sites = read.csv(textConnection(ss), header=F)
operation.areas = cbind(orgs, states.sites)
colnames(operation.areas) = c("Organization", "State", "Sites")
operation.areas
##            Organization             State Sites
## 1        Organization M    Andhra Pradesh     7
## 2        Organization M Arunachal Pradesh     8
## 3        Organization M             Assam     8
## 4        Organization M             Bihar    24
## 5        Organization M       Chattisgarh     2
## 6        Organization M               Goa    15
## 7        Organization M           Gujarat    19
## 8        Organization M           Haryana     4
## 9        Organization M  Himachal Pradesh    14
## 10       Organization M Jammu and Kashmir     2
## 11       Organization M         Jharkhand     2
## 12       Organization M         Karnataka     4
## 13       Organization M            Kerala     2
## 14       Organization M    Madhya Pradesh     2
## 15       Organization M       Maharashtra     2
## 16       Organization M           Manipur     2
## 17         Foundation X         Meghalaya    29
## 18         Foundation X           Mizoram    10
## 19         Foundation X          Nagaland     4
## 20         Foundation X            Odisha    12
## 21         Foundation X        Puducherry    14
## 22                NGO Z            Punjab     8
## 23           Government         Rajasthan    16
## 24 Research Institute A            Sikkim     4
## 25 Research Institute A        Tamil Nadu     4
## 26       Organization C           Tripura     8
## 27       Organization C     Uttar Pradesh    15
## 28       Organization C       Uttarakhand     1
## 29       Organization C       West Bengal    12

Notice the use of readLines to import the text file, gsub to declare the search and replace expressions, textConnection to treat an R object as a text file, and the escaped backslashes. The other steps are more or less the same as they would be if we were using a good text editor. By the way, the inspiration for this came from here.

Leave a Reply