Wednesday, October 15, 2008

Grep and Regular Expressions


I love the powerful matching that regular expressions achieve in processing text.

As I was working on a data mining project the other day I was interested in screen scraping some data and needed to use some regular expressions. I was a little rusty on my regular expression use so after a few times of not getting my patterns right I went to the man page for grep for help.

The man page for grep is also the man page for egrep and fgrep. Keep in mind on some systems egrep is simply an alias for "grep -E", so if you run egrep -P (see blelow), you may get:

grep: conflicting matchers specified


As I was reading the part about regular expressions I noticed that there were three types of regular expressions: basic (-G) , extended (-E) and Perl (-P).

The documentation noted that the basic type of regular expressions are weak at best, but in GNU grep, there is no difference between basic and extended regular expressions, both work as if extended.

The extended regular expressions are very powerful indeed. However I really wanted to use "lazy" pattern matching and Perl seems to have the right expression I was looking for:

{n,m}? Match at least n but not more than m times.

grep didn't seem to support this. I could have used a workaround or a Perl One-liner,but I was more interested in doing this with one expression.

I looked into using grep -P, to use some Perl expressions, however, when using it I got:

grep: The -P option is not supported

As I discovered grep may or may not come with the Perl option compiled in. At this point grep -P doesn't seem to work at all, however pcregrepseems to be able to to this job.




Oh well. One of the most important thing to keep in mind is that generally speaking all regular expressions are greedy by default - that is they will match the most amount of pattern that they can.

Also, generally speaking, if the match fails due to the previous pattern being greedy the expression engine will backtrack and retry with a less greedy pattern and then continue the match.

You may also turn off the greediness with a question mark after the particular pattern.


There are a lot of options to regular expressions and that they may change from system to system and utility to utility.

The bottom line is no matter what tool or utility you are using whether it is awk, sed, grep or Perl, make sure you read the proper man page and understand which exact pattern options they support otherwise you may get a little frustrated.

More Info:
http://packages.debian.org/search?keywords=pcregrep

0 comments: