My 2012 Summer vacation was 3 months long and it started to become very boring after the 1st few days. So I though I’ll try out Github , Python’s Http interfaces like urlli, urllib2 , httplib , twisted,etc ,writing a simple crawler which has been in my todo list for a very long time. So I started writing a simple single-threaderd crawler to get the list of users and list of problems on SPOJ.Its was pretty simple using urllib2. You just needed urllib2.urlopen() . A byproduct of this was the SPOJ-Problem selector whose code I wanted to commit to Github as my 1st project. Its a simple non threaded , non queued simple crawler.It crawled pretty slow. So I decided to thread it . Add a global queue that will store the list of URLS and multiple worker threads that will pop urls from queue and push back new urls. Still there was a problem ! As the no of URLs increased , u cant store all of them in memory . So , I had to make a persistent queue by wrapping a simple database with the interfaces of a queue. This way I made a fullfleged crawler that can crawl multiple pages and added ability to pause and resume too using python Shelves. Then I came across Scrapy, a crawler frame-work for python. It has of these feature pre built and I needed to write only the parser for each HTML page. I tried tweaking it a lot but still I liked my pause-play Crawler better.
Here is the link to it: https://github.com/jujojujo2003/SPOJ-Problem-Selector