Monday, October 14, 2013

Scrapy API: The Creepy Crawlers


Hi, i came across Scrapy framework few months ago and have been interested with it. This blog will show how you can create a web application that uses Scrapy. All in 15mins ;)


Lets Begin!

  1. Download Scrapy and its dependencies
  2. Create a scrapy project. If you are familiar with django's structure you will find scrapy's structure very easy. both have structure that are almost alike.
    • scrapy startproject <project_name>
  3. Creating a Spider. Spiders are user-written classes used to scrape information from a domain.
    • To create a spider, you must subclass scrapy.spider.BaseSpider and define the three main, mandatory, attributes: name, start_urls, parse()
    • Sample Code.

  1. Send spiders to crawl from domains. To put your spider to work, go to the project's top level directory and run "scrapy crawl <spider_name>".
  2. Extracting Items. Scrapy uses a mechanism based on XPath expressions called XPath selectors. You can learn how to code XPaths here.
Additional Information:
Scrapy's is separated into three parts the spiders, items, pipeline. Spiders are the scripts that handles the scraping of data from the domains. Items works like models(in django), this are containers that will be loaded with scraped data. Pipeline is where you do the processing of the scraped data. though Pipeline is not required.


Sample project CLICK HERE

That's all! have fun coding.