My first Github commit

My 2012 Summer vacation was 3 months long and it started to become very boring after the 1st few days. So I though I’ll try out Github , Python’s Http interfaces like urlli, urllib2 , httplib , twisted,etc ,writing a simple crawler which has been in my todo list for a very long time. So I started writing a simple single-threaderd crawler to get the list of users and list of problems on SPOJ.Its was pretty simple using urllib2. You just needed urllib2.urlopen() . A byproduct of this was the SPOJ-Problem selector whose code I wanted to commit to Github as my 1st project. Its a simple non threaded , non queued simple crawler.It crawled pretty slow. So I decided to thread it . Add a global queue that will store the list of URLS and multiple worker threads that will pop urls from queue and push back new urls. Still there was a problem ! As the no of URLs increased , u cant store all of them in memory . So , I had to make a persistent queue by wrapping a simple database with the interfaces of a queue. This way I made a fullfleged crawler that can crawl multiple pages and added ability to pause and resume too using python Shelves. Then I came across Scrapy, a crawler frame-work for python. It has of these feature pre built and I needed to write only the parser for each HTML page. I tried tweaking it a lot but still I liked my pause-play Crawler better.

Here is the link to it:

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s