scrapy
TLDR
Create new project
SYNOPSIS
scrapy command [-o output] [-s setting=value] [options] [arguments]
DESCRIPTION
Scrapy is a Python framework for web scraping and crawling. It handles requests, parsing, and data extraction with built-in support for following links, handling cookies, and respecting robots.txt.
Projects contain spiders - classes that define how to scrape sites. Spiders specify start URLs, parse response with CSS/XPath selectors, and yield items or further requests.
The shell command provides interactive testing. You can experiment with selectors on live pages before writing spider code. Response object methods match spider context.
Items define the data structure being scraped. Item pipelines process scraped data: validation, cleaning, storage to databases or files. Multiple output formats are supported.
Middleware handles request/response processing: user agents, proxies, retries, cookies. Settings control behavior: concurrency, delays, download timeouts, and more.
Extensions add functionality: stats collection, throttling, autothrottle, and custom callbacks.
PARAMETERS
startproject NAME
Create new Scrapy project.genspider NAME DOMAIN
Generate spider from template.crawl SPIDER
Run a spider.shell [URL]
Interactive shell for testing.list
List available spiders.check [SPIDER]
Run contract checks.fetch URL
Fetch URL and print.view URL
Open URL in browser.parse URL
Parse URL with spider.runspider FILE
Run spider from file.-o FILE, --output FILE
Output file (json, csv, xml).-s SETTING=VALUE
Override setting.-a NAME=VALUE
Spider argument.-t FORMAT, --output-format FORMAT
Output format.--nolog
Disable logging.--loglevel LEVEL
Log level: DEBUG, INFO, WARNING.
CAVEATS
JavaScript-rendered content requires Splash or Selenium integration. Some sites block scrapers via rate limiting or CAPTCHAs. Aggressive scraping may violate ToS. Robots.txt should be respected. Debug shell doesn't persist state.
HISTORY
Scrapy was created by Pablo Hoffman and Shane Evans at Insophia around 2008. It grew from internal tools to a general-purpose framework. The project became one of the most popular Python scraping tools, with commercial company Scrapinghub (now Zyte) providing support and services.
SEE ALSO
curl(1), wget(1), beautifulsoup(1), playwright(1)


