Tuesday, January 13, 2009

Rev - !looT gnitpircS taerG A


I was working on another data mining project and ran into the following problem:

The goal was to monitor a webpage that listed a technical support queue's incoming tickets.

The page included the ticket number, subject of the request and the engineer assigned to the matter.

The web page looked something like this (shortened for clarity) :




One of the problems that we were facing was that the HTML was challenging to parse.

Generally using screen scraping techniques if you can hook on to a character or word that is consistently repeated if you can easily get the data you want.

In this case we didn't have any line breaks and we're trying to get the information inside of an HTML table. We'd need multiple expressions to sort out how to determine what was between the 's was in fact valuable.

We could have used a a few regular expressions, but I had a better idea: the utility rev! Rev reverse all lines of a file.

Since we don't know exactly when the column of important data will appear (I.E the 10th column? 12th?), but we do know it will be the last column, why don't we simply replace all white space with a delimiter (in this case a colon ":"), and then reverse the whole line, thereby making the last column first, cut out just that first part, and then reverse the token to get our engineer.


Our script looks like this:


#!/bin/bash

NAME=name_of_engineer

FILE=supportq.txt
URL="http://server/q.html"

lynx $URL -dump -nolist -width=5000 > $FILE


if grep $NAME $FILE > /dev/null ; then

ENG=$(cat $FILE | grep $NAME | tr -s '[:blank:]' : | rev | cut -d ':' -f1 | rev)
TKT=$(cat $FILE | tr -s '[:blank:]' : | cut -d ':' -f2 | grep -E [0-9]\{7,10\})
ISSU=$(cat $FILE | grep $TKT | tr -s '[:blank:]' : | cut -d ':' -f4-15)

echo
echo Ticket:$TKT
echo Issue:$ISSU
echo TSE:$TSE
echo

more processing here

fi

"more processing here" represents where we login to the SSO system and send a response using this previously seen curl technique.


The -width=5000 parameter for lynx just causes the line not to wrap and make it easier to parse the HTML.

That was it. Very Simple, Short and Sweet.