Let's scrap Internet with Scrapy

Internet contains the most universal data resource so far in the history, so many of them are free for any users.

However, due to the variety of different webpages, these are hard to be collected and reused by group.

To fetch the data from pages, we may need to analyse the webpages’ structure and evaluate the data to fetch.

The process of this is called “web scraping”, it is the most important task of the any web search engines.

Also, the users can be benefited by making their scrapers to fetch the required data collections from the sea.

Scrapy introduction:

Scrapy is a popular web crawler framework by Python, it contains many modules with solid mechanism

to provide a simplified and completed web scraping solutions. To start using Scrapy, we can install by pip.

pip install scrapy

After installed and open the help page, we can see that,

$ scrapy -h
Scrapy 2.0.1 - no active project

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

  [ more ]      More commands available when run from project directory

Use "scrapy <command> -h" to see more info about a command

Normally to develop a web scraper to use, the following commands will be used,

  • startproject <project_name> [project_dir]: a Python project structured folder will be generated to kick off
  • genspider [options] <name> <domain>: generate a templated spider, supported HTML, XML and CSV as well
  • shell [url|file]: boot the interactive console(IPython) with the given url or file scraped, very useful for develop
  • runspider [options] <spider_file>: run a spider, resulted with scraped data and JSON, CSV records

For more details, refering from the official documents would never be a bad choice.

Scrapy workflow:

scrapy-workflow

Generally, when a spider started to crawl webpages using Scrapy, the engine will be operated on as follow,

  1. The spider sends the URL requests to crawl and queued in the scheduler

  2. The scheduler passes the next request to downloader and response through the downloader middlewares

  3. The spider receives the downloader’s response objects and returns scraped items through the spider middlewares

  4. The item pipeline set will handle the processed items from spider then output to the external

P.S: The process repeats until there are no more requests from the scheduler’s queue list

Comments