In my last post (Sounds interesting. Is that a regular expression?), I showed a few things I had figured out recently related to regular expressions. By now, you have also figured out that I like figuring things out in R, and application of regular expressions is one of these things.
Since R is scriptable, it is easy to put a series of regular expressions to work to get the results you need. Consider the following, which uses this text file as the input, and which gives us the same output as “Example 3″ from my earlier post:
a = readLines("http://news.mrdwab.com/wp-content/uploads/2011/03/unprocessed-1.txt")
b = gsub("^([01]:[ |0-9]+)$", "", a)
b = gsub("^([0-9]|[0-9-]+)\\.([0-9]{4,5})", "", b)
b = gsub("^([A-Z])$", "", b)
birthweight.percentiles = matrix(scan(textConnection(b), skip=17),
ncol=12, byrow=T)
colnames(birthweight.percentiles) = c("Month",
scan(textConnection(b),
what="character",
skip=5, n=11))
birthweight.percentiles
## Month 1st 3rd 5th 15th 25th 50th 75th 85th 95th 97th 99th
## [1,] 0 2.3 2.4 2.5 2.8 2.9 3.2 3.6 3.7 4.0 4.2 4.4
## [2,] 1 3.0 3.2 3.3 3.6 3.8 4.2 4.6 4.8 5.2 5.4 5.7
## [3,] 2 3.8 4.0 4.1 4.5 4.7 5.1 5.6 5.9 6.3 6.5 6.9
## [4,] 3 4.4 4.6 4.7 5.1 5.4 5.8 6.4 6.7 7.2 7.4 7.8
## [5,] 4 4.8 5.1 5.2 5.6 5.9 6.4 7.0 7.3 7.9 8.1 8.6
## [6,] 5 5.2 5.5 5.6 6.1 6.4 6.9 7.5 7.8 8.4 8.7 9.2
## [7,] 6 5.5 5.8 6.0 6.4 6.7 7.3 7.9 8.3 8.9 9.2 9.7
## [8,] 7 5.8 6.1 6.3 6.7 7.0 7.6 8.3 8.7 9.4 9.6 10.2
## [9,] 8 6.0 6.3 6.5 7.0 7.3 7.9 8.6 9.0 9.7 10.0 10.6
## [10,] 9 6.2 6.6 6.8 7.3 7.6 8.2 8.9 9.3 10.1 10.4 11.0
## [11,] 10 6.4 6.8 7.0 7.5 7.8 8.5 9.2 9.6 10.4 10.7 11.3
## [12,] 11 6.6 7.0 7.2 7.7 8.0 8.7 9.5 9.9 10.7 11.0 11.7
## [13,] 12 6.8 7.1 7.3 7.9 8.2 8.9 9.7 10.2 11.0 11.3 12.0
Similarly, we can replicate the “bonus session” (which is based on this text file) as follows:
n = readLines("http://news.mrdwab.com/wp-content/uploads/2011/03/unprocessed-5.txt")
org.name = gsub("^([0-9]\\. )(.*) \\(.*", "'\\2'", n)
org.name = gsub("^[0-9].*", "", org.name)
orgs = rep(scan(textConnection(org.name),
what="character"), c(16, 5, 1, 1, 2, 4))
ss = gsub("^([0-9]\\. )(.*)\\(([0-9]+)\\)( )", "", n)
ss = gsub("^([0-9]+) (.*) (.*)", "\\2,\\3", ss)
states.sites = read.csv(textConnection(ss), header=F)
operation.areas = cbind(orgs, states.sites)
colnames(operation.areas) = c("Organization", "State", "Sites")
operation.areas
## Organization State Sites
## 1 Organization M Andhra Pradesh 7
## 2 Organization M Arunachal Pradesh 8
## 3 Organization M Assam 8
## 4 Organization M Bihar 24
## 5 Organization M Chattisgarh 2
## 6 Organization M Goa 15
## 7 Organization M Gujarat 19
## 8 Organization M Haryana 4
## 9 Organization M Himachal Pradesh 14
## 10 Organization M Jammu and Kashmir 2
## 11 Organization M Jharkhand 2
## 12 Organization M Karnataka 4
## 13 Organization M Kerala 2
## 14 Organization M Madhya Pradesh 2
## 15 Organization M Maharashtra 2
## 16 Organization M Manipur 2
## 17 Foundation X Meghalaya 29
## 18 Foundation X Mizoram 10
## 19 Foundation X Nagaland 4
## 20 Foundation X Odisha 12
## 21 Foundation X Puducherry 14
## 22 NGO Z Punjab 8
## 23 Government Rajasthan 16
## 24 Research Institute A Sikkim 4
## 25 Research Institute A Tamil Nadu 4
## 26 Organization C Tripura 8
## 27 Organization C Uttar Pradesh 15
## 28 Organization C Uttarakhand 1
## 29 Organization C West Bengal 12
Notice the use of readLines to import the text file, gsub to declare the search and replace expressions, textConnection to treat an R object as a text file, and the escaped backslashes. The other steps are more or less the same as they would be if we were using a good text editor. By the way, the inspiration for this came from here.