Tuesday, February 3, 2009

Super Grep: pcregrep


Per the website the PCRE library is a set of functions that implement regular expression pattern matching using the same syntax and semantics as Perl 5.

Once downloaded it gives the latest set of tools and can replace your existing grep command.


For example say I have some data I'd like match, for example retrieving a bug ID from a file in an engineer's case log.


We know the engineers have been taught well and but don't always refer to issues consistently such as:


Thank you for contacting regarding this matter. We have filed "Bug ID - 1991782" for this issue.

Thank you for contacting us. We have filed "Bug - 1991782" for this issue.

Thank you for contacting us today. We have filed "Issue - 1991782" for this issue.

(The above 3 line are all each on one line but this blog may format them with a line break- See the first picture below).


Our regex may be something like:

grep -Ei '\W(ID|bug|issue)[[:space:]]*(-|:#)*[[:space:]]*[1-9][0-9]{4,6}' ticket.123.txt


This matches all instances which is great:





But what if the content system we're accessing causes the line to be wrapped as follows?


Thank you for contacting us. We have filed "Bug -
1991782" for this issue.



Well, grep -E (using -E to extended regular expressions) will not match this as it does not match a newline at all.


However, pcregrep will match this line using -M, for "multiline" mode:





Compiling PCRE was trivial on my Linux machine. I did the standard configure, make, make install (as root).

You can find the latest PCRE at http://www.pcre.org/

Now, we're off to squash more bugs!!