LinuxCommandLibrary

scrapy

TLDR

Create new project

$ scrapy startproject [project_name]
copy
Generate spider
$ scrapy genspider [spider_name] [domain.com]
copy
Run spider
$ scrapy crawl [spider_name]
copy
Run spider and save to file
$ scrapy crawl [spider_name] -o [output.json]
copy
Interactive shell for testing
$ scrapy shell "[https://example.com]"
copy
Check spider contracts
$ scrapy check [spider_name]
copy
List available spiders
$ scrapy list
copy
Fetch URL and show response
$ scrapy fetch [https://example.com]
copy

SYNOPSIS

scrapy command [-o output] [-s setting=value] [options] [arguments]

DESCRIPTION

Scrapy is a Python framework for web scraping and crawling. It handles requests, parsing, and data extraction with built-in support for following links, handling cookies, and respecting robots.txt.
Projects contain spiders - classes that define how to scrape sites. Spiders specify start URLs, parse response with CSS/XPath selectors, and yield items or further requests.
The shell command provides interactive testing. You can experiment with selectors on live pages before writing spider code. Response object methods match spider context.
Items define the data structure being scraped. Item pipelines process scraped data: validation, cleaning, storage to databases or files. Multiple output formats are supported.
Middleware handles request/response processing: user agents, proxies, retries, cookies. Settings control behavior: concurrency, delays, download timeouts, and more.
Extensions add functionality: stats collection, throttling, autothrottle, and custom callbacks.

PARAMETERS

startproject NAME

Create new Scrapy project.
genspider NAME DOMAIN
Generate spider from template.
crawl SPIDER
Run a spider.
shell [URL]
Interactive shell for testing.
list
List available spiders.
check [SPIDER]
Run contract checks.
fetch URL
Fetch URL and print.
view URL
Open URL in browser.
parse URL
Parse URL with spider.
runspider FILE
Run spider from file.
-o FILE, --output FILE
Output file (json, csv, xml).
-s SETTING=VALUE
Override setting.
-a NAME=VALUE
Spider argument.
-t FORMAT, --output-format FORMAT
Output format.
--nolog
Disable logging.
--loglevel LEVEL
Log level: DEBUG, INFO, WARNING.

CAVEATS

JavaScript-rendered content requires Splash or Selenium integration. Some sites block scrapers via rate limiting or CAPTCHAs. Aggressive scraping may violate ToS. Robots.txt should be respected. Debug shell doesn't persist state.

HISTORY

Scrapy was created by Pablo Hoffman and Shane Evans at Insophia around 2008. It grew from internal tools to a general-purpose framework. The project became one of the most popular Python scraping tools, with commercial company Scrapinghub (now Zyte) providing support and services.

SEE ALSO

Copied to clipboard