Sounds interesting. Is that a regular expression?

I’ve been meaning to learn how to use regular expressions for quite some time now, but just never seemed to get around to doing so. The other night, I decided to take a stab at them though, and over the past few days, I’ve sort of managed to learn a few tricks. Some of these might seem unnecessary, particularly since the examples comprise relatively small chunks of text. But, hopefully you can also see the application of the same techniques for larger text files. In some of the examples, I’ve also included how it might help with preparing your data for use with a program like R. For all of these examples, I’ve used Geany as my text editor. I suggest you use a good text editor like Geany or Notepad-plus-plus too.

Example 1—Changing numeric date formats

Imagine we’re given a file containing dates in the form of m(m)/d(d)/yyyy and someone gives us totally arbitrary instructions to change it to Y:yyyy (tab) M:mm (tab) D:dd (really, I can’t tell you who or why.)

Below is our starting text. Note that some of the months and days have only one digit, while some single digit dates are entered with a preceding zero.

09/05/1978
12/11/2003
11/9/2010
3/13/2001

My solution:

Search for this Replace with this Why?
^([0-9])/ 0\1/ To find single digit months and fill with a preceding zero.
/([0-9])/ /0\1/ To find single digit days and fill with a preceding zero.
^([0-9]+)/([0-9]+)/([0-9]+) Y:\3\tM:\2\tD:\1 Separates a date into three sections that we are able to rearrange as we see fit.

The sections in the regular expression search pattern enclosed by parentheses () become “references” that we can refer to by their location. In other words, notice that the last regular expression search pattern had three pairs of parentheses. The first one ([0-9]+) searches for a number. The + says to keep going until you find the next search item, which is a forward slash. The ^ before the first item indicates that this should be matched at the start of a line.

If you followed the instructions correctly, you should get the following:

Y:1978	M:05	D:09
Y:2003	M:11	D:12
Y:2010	M:09	D:11
Y:2001	M:13	D:03

Example 2—Changing names around

The same crazy person that wanted that odd date format also wanted us to change this list of names from being “First-name Last-name” to being “Last-name, First-initial”. We could do this manually, but why should we? Here’s what we start with:

Ethan Nakata
Stepanie Foutz
Nikole Pritt
Lesley Ramsay
Lucienne Anderson
Ardith Guo
Kassie Roloff
Kathy Edie
Kellee Rowse
Effie Bensinger
Bethel Gravel
Kathaleen Kovac
Candance Clauss
Sherell Dobrowolski
Kym Thurmon
Xiomara Tocci
Brice Tallon
Natalya Bouldin
Jacki Parise
Evonne Mun
Search for this Replace with this Why?
^([A-Z])([a-z]+) ([A-Za-z]+) \3, \1. The first set of parentheses selects just the first initial, the second set, the rest of the first name, the third set, the entire last name. In the replace box, we remove the second reference, and insert a comma and space in between reference three and one, and a period after reference 1.

Here’s the result:

Nakata, E.
Foutz, S.
Pritt, N.
Ramsay, L.
Anderson, L.
Guo, A.
Roloff, K.
Edie, K.
Rowse, K.
Bensinger, E.
Gravel, B.
Kovac, K.
Clauss, C.
Dobrowolski, S.
Thurmon, K.
Tocci, X.
Tallon, B.
Bouldin, N.
Parise, J.
Mun, E.

Example 3—Import really ugly data cut and pasted from a PDF into R

Someone thought that it would be a good idea to copy a part of this table and send it to you as a text file (again, who knows why…). You want to import the data into R and use it. Can you do it efficiently? You’re not actually interested in everything. You are most interested in the “Month” column and the values in the columns titled “1st” to “99th”.

Year: Month
Month
L
M
S
1st
3rd
5th
15th
25th
50th
75th
85th
95th
97th
99th
0: 0
0
0.3809
3.2322
0.14171
2.3
2.4
2.5
2.8
2.9
3.2
3.6
3.7
4.0
4.2
4.4
0: 1
1
0.1714
4.1873
0.13724
3.0
3.2
3.3
3.6
3.8
4.2
4.6
4.8
5.2
5.4
5.7
0: 2
2
0.0962
5.1282
0.13000
3.8
4.0
4.1
4.5
4.7
5.1
5.6
5.9
6.3
6.5
6.9
0: 3
3
0.0402
5.8458
0.12619
4.4
4.6
4.7
5.1
5.4
5.8
6.4
6.7
7.2
7.4
7.8
0: 4
4
-0.0050
6.4237
0.12402
4.8
5.1
5.2
5.6
5.9
6.4
7.0
7.3
7.9
8.1
8.6
0: 5
5
-0.0430
6.8985
0.12274
5.2
5.5
5.6
6.1
6.4
6.9
7.5
7.8
8.4
8.7
9.2
0: 6
6
-0.0756
7.2970
0.12204
5.5
5.8
6.0
6.4
6.7
7.3
7.9
8.3
8.9
9.2
9.7
0: 7
7
-0.1039
7.6422
0.12178
5.8
6.1
6.3
6.7
7.0
7.6
8.3
8.7
9.4
9.6
10.2
0: 8
8
-0.1288
7.9487
0.12181
6.0
6.3
6.5
7.0
7.3
7.9
8.6
9.0
9.7
10.0
10.6
0: 9
9
-0.1507
8.2254
0.12199
6.2
6.6
6.8
7.3
7.6
8.2
8.9
9.3
10.1
10.4
11.0
0:10
10
-0.1700
8.4800
0.12223
6.4
6.8
7.0
7.5
7.8
8.5
9.2
9.6
10.4
10.7
11.3
0:11
11
-0.1872
8.7192
0.12247
6.6
7.0
7.2
7.7
8.0
8.7
9.5
9.9
10.7
11.0
11.7
1: 0
12
-0.2024
8.9481
0.12268
6.8
7.1
7.3
7.9
8.2
8.9
9.7
10.2
11.0
11.3
12.0

This can actually be done with two regular expression statements and two or three lines of code in R. First, the regular expressions:

Search for this Replace with this Why?
^([01]:[ |0-9]+) Nothing This removes the Year: Month column.
^([0-9]|[0-9-]+)\.([0-9]{4,5}) Nothing This removes the numeric values for the "L", "M", and "S" columns.
^([A-Z]{1})\r\n Nothing This removes the actual text "L", "M", and "S" from lines 3-5 of the unprocessed file.

If you did this correctly, you should end up with a text file like the one linked here.

Year: Month
Month
1st
3rd
5th
15th
25th
50th
75th
85th
95th
97th
99th

0



2.3
2.4
2.5
2.8
2.9
3.2
3.6
3.7
4.0
4.2
4.4

1



3.0
3.2
3.3
3.6
3.8
4.2
4.6
4.8
5.2
5.4
5.7

2



3.8
4.0
4.1
4.5
4.7
5.1
5.6
5.9
6.3
6.5
6.9

3



4.4
4.6
4.7
5.1
5.4
5.8
6.4
6.7
7.2
7.4
7.8

4



4.8
5.1
5.2
5.6
5.9
6.4
7.0
7.3
7.9
8.1
8.6

5



5.2
5.5
5.6
6.1
6.4
6.9
7.5
7.8
8.4
8.7
9.2

6



5.5
5.8
6.0
6.4
6.7
7.3
7.9
8.3
8.9
9.2
9.7

7



5.8
6.1
6.3
6.7
7.0
7.6
8.3
8.7
9.4
9.6
10.2

8



6.0
6.3
6.5
7.0
7.3
7.9
8.6
9.0
9.7
10.0
10.6

9



6.2
6.6
6.8
7.3
7.6
8.2
8.9
9.3
10.1
10.4
11.0

10



6.4
6.8
7.0
7.5
7.8
8.5
9.2
9.6
10.4
10.7
11.3

11



6.6
7.0
7.2
7.7
8.0
8.7
9.5
9.9
10.7
11.0
11.7

12



6.8
7.1
7.3
7.9
8.2
8.9
9.7
10.2
11.0
11.3
12.0

Do a “select all” and switch over to R. We’re going to first create a matrix with the values from line 15 to the end, and then create the column names for the matrix from lines 2 through 13. Do the following magic:

birthweight.percentiles = matrix(scan("clipboard", skip=14), ncol=12, byrow=T)
# Read 156 items
colnames(birthweight.percentiles) = scan("clipboard", what="character", skip=1, n=12)
# Read 12 items
birthweight.percentiles
#       Month 1st 3rd 5th 15th 25th 50th 75th 85th 95th 97th 99th
#  [1,]     0 2.3 2.4 2.5  2.8  2.9  3.2  3.6  3.7  4.0  4.2  4.4
#  [2,]     1 3.0 3.2 3.3  3.6  3.8  4.2  4.6  4.8  5.2  5.4  5.7
#  [3,]     2 3.8 4.0 4.1  4.5  4.7  5.1  5.6  5.9  6.3  6.5  6.9
#  [4,]     3 4.4 4.6 4.7  5.1  5.4  5.8  6.4  6.7  7.2  7.4  7.8
#  [5,]     4 4.8 5.1 5.2  5.6  5.9  6.4  7.0  7.3  7.9  8.1  8.6
#  [6,]     5 5.2 5.5 5.6  6.1  6.4  6.9  7.5  7.8  8.4  8.7  9.2
#  [7,]     6 5.5 5.8 6.0  6.4  6.7  7.3  7.9  8.3  8.9  9.2  9.7
#  [8,]     7 5.8 6.1 6.3  6.7  7.0  7.6  8.3  8.7  9.4  9.6 10.2
#  [9,]     8 6.0 6.3 6.5  7.0  7.3  7.9  8.6  9.0  9.7 10.0 10.6
# [10,]     9 6.2 6.6 6.8  7.3  7.6  8.2  8.9  9.3 10.1 10.4 11.0
# [11,]    10 6.4 6.8 7.0  7.5  7.8  8.5  9.2  9.6 10.4 10.7 11.3
# [12,]    11 6.6 7.0 7.2  7.7  8.0  8.7  9.5  9.9 10.7 11.0 11.7
# [13,]    12 6.8 7.1 7.3  7.9  8.2  8.9  9.7 10.2 11.0 11.3 12.0

# Optional: If you need or prefer a data frame instead of a matrix, run:
# birthweight.percentiles = as.data.frame(birthweight.percentiles)

The first line of R code scans in the data values and fills them into a matrix with 12 columns, filling by row from left to right. The second line adds the column names. Notice how we were able to use “skip” and “n” to select the values we were interested in at each stage.

Example 4—WordPress.com’s monthly stats table is ugly

Really it is! Here’s a screenshot. It looks pretty, right?

But, if you copy it into a an Excel spreadsheet, you get this:

Ugh. How are we supposed to work with this?

Well, it depends on if you copied from Excel to your text editor, or if you copied directly from the WordPress stats screen to a text editor. I’ll cover both scenarios below.

Copying from WordPress to Excel to your text editor

If you copied from WordPress to Excel to your text editor, you’d end up with a text file like the one linked here.

24-Jan	25-Jan	26-Jan	27-Jan	28-Jan	29-Jan	30-Jan	201	29


19	27	14	32	73	25	11

31-Jan	1-Feb	2-Feb	3-Feb	4-Feb	5-Feb	6-Feb	302	43	50.25%


23	15	63	49	52	29	71

7-Feb	8-Feb	9-Feb	10-Feb	11-Feb	12-Feb	13-Feb	369	53	22.19%


59	35	24	88	29	96	38

14-Feb	15-Feb	16-Feb	17-Feb	18-Feb	19-Feb	20-Feb	376	54	1.90%


115	96	60	41	15	24	25

21-Feb	22-Feb	23-Feb	24-Feb	25-Feb	26-Feb	27-Feb	291	42	-22.61%


51	29	57	23	79	21	31

28-Feb	1-Mar	2-Mar	3-Mar	4-Mar			187	40	-3.18%


25	54	36	46	26

This is easy to clean up and import into R. You can simply search for ^([0-9]{1,2}-[A-Z].*), replace it with nothing, copy what’s left and read it into R by using something like data = scan("clipboard").

Notice that we sacrifice the dates here though.

Copying from WordPress to your text editor

If you copied from WordPress to your text editor, you’d end up with a text file like the one linked here.

Jan 24

19
Jan 25

27
Jan 26

14
Jan 27

32
Jan 28

73
Jan 29

25
Jan 30

11
201	29
Jan 31

23
Feb 1

15
Feb 2

63
Feb 3

49
Feb 4

52
Feb 5

29
Feb 6

71
302	43	+50.25%
Feb 7

59
Feb 8

35
Feb 9

24
Feb 10

88
Feb 11

29
Feb 12

96
Feb 13

38
369	53	+22.19%
Feb 14

115
Feb 15

96
Feb 16

60
Feb 17

41
Feb 18

15
Feb 19

24
Feb 20

25
376	54	+1.90%
Feb 21

51
Feb 22

29
Feb 23

57
Feb 24

23
Feb 25

79
Feb 26

21
Feb 27

31
291	42	-22.61%
Feb 28

25
Mar 1

54
Mar 2

36
Mar 3

46
Mar 4

26
187	40	-3.18%

From this file, we can either search and replace in a way that we retain the dates or we drop them and keep just the stats.

Keep the dates

Search for this Replace with this Why?
\r\n\r\n \t This isn't actually with regular expressions, but with "escape sequences". We're replacing two line spaces with a tab. This will result in a format like "Date tab number-of-visits". It will also make it easier for us to do the next step of search and replace.
^([0-9].*) Nothing In our previous step, we ended up with a convenient scenario where the lines we're interested in start with a character, and all the other lines start with a number. We can now easily remove those lines.

You can now easily copy this to your clipboard and read this into R using data = read.table("clipboard", header=F). Don’t forget to add header=F or else R will think the first line is the column names.

Just gimme the data

Do everything you did before, but add one more regular expression search and replace. Search for ^([A-Z].*)\t(.*) and replace it with \2. This will create a list of just the data that can be read into R by using something like data = scan("clipboard").

Example 5—Merged cells are a pain….

Your friend gives you this PDF with a beautiful table in it, and you need to extract the data. When you copy it into Microsoft Word, OpenOffice Writer, or whatever you prefer, it looks terrible, like the unformatted text below:

1. Organization M (117) 1 Andhra Pradesh 7
2 Arunachal Pradesh 8
3 Assam 8
4 Bihar 24
5 Chattisgarh 2
6 Goa 15
7 Gujarat 19
8 Haryana 4
9 Himachal Pradesh 14
10 Jammu and Kashmir 2
11 Jharkhand 2
12 Karnataka 4
13 Kerala 2
14 Madhya Pradesh 2
15 Maharashtra 2
16 Manipur 2
2. Foundation X (69) 17 Meghalaya 29
18 Mizoram 10
19 Nagaland 4
20 Odisha 12
21 Puducherry 14
3. NGO Z (8) 22 Punjab 8
4. Government (16) 23 Rajasthan 16
5. Research Institute A (8) 24 Sikkim 4
25 Tamil Nadu 4
6. Organization C (36) 26 Tripura 8
27 Uttar Pradesh 15
28 Uttarakhand 1
29 West Bengal 12

This is actually also pretty easy to fix with the following set of regular expressions.

Search for this Replace with this Why?
^([0-9]\. )(.*)\(([0-9]+)\)( ) \t\2\t\r\n We're trying to match just the name of the organization and put each organization on a line by itself, preceded and followed by a tab.
^([0-9]+) (.*) (.*) \1\t\2\t\3 We want to break up the state information into three parts and insert a tab to indicate each column.

After doing these two steps, your data should now look like this:

	Organization M
1	Andhra Pradesh	7
2	Arunachal Pradesh	8
3	Assam	8
4	Bihar	24
5	Chattisgarh	2
6	Goa	15
7	Gujarat	19
8	Haryana	4
9	Himachal Pradesh	14
10	Jammu and Kashmir	2
11	Jharkhand	2
12	Karnataka	4
13	Kerala	2
14	Madhya Pradesh	2
15	Maharashtra	2
16	Manipur	2
	Foundation X
17	Meghalaya	29
18	Mizoram	10
19	Nagaland	4
20	Odisha	12
21	Puducherry	14
	NGO Z
22	Punjab	8
	Government
23	Rajasthan	16
	Research Institute A
24	Sikkim	4
25	Tamil Nadu	4
	Organization C
26	Tripura	8
27	Uttar Pradesh	15
28	Uttarakhand	1
29	West Bengal	12

This is now a tab delimited file, so it can easily be imported into any program that uses such files, or can be pasted into a word processor and converted into a table by using the “text to table” feature that any decent word processor should have.

Bonus session

Imagine (for whatever reason) you wanted the table from Example 5 to be used in R. Here’s how you can do it. First, count how many sites each organization is working in and make a note of that. In this case, it’s 16, 5, 1, 1, 2, 4. Then, do the following:

Search for this Replace with this Why?
^([0-9]\. )(.*) \(.* "\2" We want just the organization's name, and this is the first step to help us with that. Be sure to put the replace reference in quotes.
^[0-9].* Nothing We now have just the names of each organization (with a lot of blank lines in between).

Copy whatever is in your text editor, switch over to R and enter the following:

orgs = rep(scan("clipboard", what="character"), c(16, 5, 1, 1, 2, 4))

Switch back to your text editor, and revert the file to its original state (just do a couple of undos). Then, do what you did in Example 5 except:

  • In the first step, instead of replacing with \t\2\t\r\n, replace with nothing.
  • In the next step, instead of replacing with \1\t\2\t\3, replace with \2\t\3.
Again, copy whatever’s in your text file, switch back over to R, and do the following:

organizations = read.delim("clipboard", header=F)
cbind(orgs, organizations)
#                    orgs                V1 V2
# 1        Organization M    Andhra Pradesh  7
# 2        Organization M Arunachal Pradesh  8
# 3        Organization M             Assam  8
# 4        Organization M             Bihar 24
# 5        Organization M       Chattisgarh  2
# 6        Organization M               Goa 15
# 7        Organization M           Gujarat 19
# 8        Organization M           Haryana  4
# 9        Organization M  Himachal Pradesh 14
# 10       Organization M Jammu and Kashmir  2
# 11       Organization M         Jharkhand  2
# 12       Organization M         Karnataka  4
# 13       Organization M            Kerala  2
# 14       Organization M    Madhya Pradesh  2
# 15       Organization M       Maharashtra  2
# 16       Organization M           Manipur  2
# 17         Foundation X         Meghalaya 29
# 18         Foundation X           Mizoram 10
# 19         Foundation X          Nagaland  4
# 20         Foundation X            Odisha 12
# 21         Foundation X        Puducherry 14
# 22                NGO Z            Punjab  8
# 23           Government         Rajasthan 16
# 24 Research Institute A            Sikkim  4
# 25 Research Institute A        Tamil Nadu  4
# 26       Organization C           Tripura  8
# 27       Organization C     Uttar Pradesh 15
# 28       Organization C       Uttarakhand  1
# 29       Organization C       West Bengal 12
More Reading
comments powered by Disqus