trafilatura
Extract main text content from web pages
TLDR
Extract text from URL
SYNOPSIS
trafilatura [-u url] [-i file] [--xml] [--json] [options]
DESCRIPTION
trafilatura extracts the main text content from web pages, automatically removing navigation, ads, headers, footers, and other boilerplate elements. It can fetch pages from URLs directly or process local HTML files.
Output is available in plain text, XML, or JSON formats. The tool also extracts metadata such as publication dates, authors, and page titles. Batch processing mode handles multiple URLs from a list file, making it suitable for web scraping and corpus building workflows.
PARAMETERS
-u URL
Input URL.-i FILE
Input file.--xml
XML output.--json
JSON output.--with-comments
Include comments.--batch
Batch mode.
CAVEATS
Python required. Extraction varies by site. Network access needed.
HISTORY
trafilatura was created for web scraping and text extraction, removing boilerplate from web pages.
