Wednesday, October 29, 2008

Lynx -dump as a Screen Scraping tool


Recently I wanted to screen scrape some data from MSN's moneycentral.msn.com site.

The data was all kept in a table like so:



This could be a bit of a pain to download the HTML code (see shot below) and then try to parse the data or somehow remove the HTML with a regular expression. Also maybe the MSN HTML code will change it's format and then we'd have to redo our script.



If you use lynx, it will automatically remove the HTML (as that is it's default behavior as a web browser). The -dump will allow you to save the result to a file. Here's the difference from above.




We can get the page as follows:

lynx "http://moneycentral.msn.com/investor/invsub/results/statemnt.aspx?Symbol=US:$STOCK&lstStatement=10YearSummary&stmtView=Ann" -dump > $FILE

where $STOCK is the stock you're interested in.


From here it was simple to get use cut and tr to get the data, since we don't worry about the HTML:


cat $FILE | tr -s '[:blank:]' : | grep -A 10 ':Sales:EBIT:Depreciation:Total:Net:Income:EPS:Tax:Rate:' | cut -d : -f3

6,591.77
5,607.38
4,701.29
3,864.95
3,148.59
2,690.48
2,272.23
1,838.63
1,492.52
1,308.07

Note that tr -s means "--squeeze-repeats" or as the man page says "replace each input sequence of a repeated character that is listed in SET1 with a single occurrence of that character") So *all* the blank characters in a row become a since colon in the command above.

Also note that we're interested in the 10 lines after the string ":Sales:EBIT:Depreciation:Total:Net:Income:EPS:Tax:Rate:"

From there we get the 3rd column (-f3 in the cut command above using a colon (:) as delimiter (-d) ).

This will get us the 10 Years of Sales for this Company.

0 comments: