Thursday, October 16, 2008

Data Mining - Curl Does the Right Thing


As I was again working on my data mining project the other day, I was tasked at downloading some information from a web-based database and wanted to extract some data.

I was using curl and netcat to grab the information from the website, but I didn't have the correct login script as this database was secured by an SSO system.

I didn't want to waste time in discovering how this SSO system worked just yet. I knew that since I had a direct login to this system I could use Firefox LiveHeaders to grab my SSO cookie and then use curl to send an additional header to web-based service - basically pretending as if it had already logged in though it had not.

Here's where the -H option comes in handy, to send an extra header when getting a web page:

-H "Cookie: SSO=1%7Cjohn%7C%2F7C20081016165339%7Cpassword%7CLDAP%7is _secretC10000007C%7Cdsa%7CE%2BR"

Bingo. I was in.

Also this particular web database only allows Firefox and Internet Explorer as valid web clients. I also needed to be able to tell the remote Web server that I was different browser.

Here I also use the -H option and told the Web server that curl is actually Firefox using the following syntax:

-H "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.3) Gecko/2008092417 Firefox/6.0.3"

These options are great but what I noticed when I tried to get the data was that my whole terminal was garbled:


After thinking about this for a minute I realized that the server was doing what I told it to do: Send me compressed data.

I had sent the HTTP string in the header:

Accept-Encoding:gzip,deflate

Oops.

The Web server was expecting me to decompress the data. Curl has a (--compressed) option which does the following: 1) Request a compressed response and 2) return the uncompressed document. Wow. Perfect. Curl had done the right thing again!

After that I collected my data and life was good.

0 comments: