trafilatura
Extract main text content from web pages
TLDR
SYNOPSIS
trafilatura [-u url] [-i file] [options]
DESCRIPTION
trafilatura extracts the main text content from web pages, automatically removing navigation, ads, headers, footers, and other boilerplate elements. It can fetch pages from URLs directly or process local HTML files.Output is available in plain text, CSV, JSON, HTML, Markdown, XML, or XML-TEI formats. The tool also extracts metadata such as publication dates, authors, and page titles. Batch processing handles multiple URLs from a list file, making it suitable for web scraping and corpus building. Link discovery via feeds, sitemaps, and crawling is built in.
PARAMETERS
-u, --URL URL
Fetch and process a URL.-i, --input-file FILE
Input file (HTML file or list of URLs for batch processing).-o, --output-dir DIR
Write results to specified directory.--output-format FORMAT
Output format: txt, csv, json, html, markdown, xml, xmltei.--json
JSON output shorthand.--xml
XML output shorthand.--csv
CSV output shorthand.--no-comments
Exclude comments from extraction.--no-tables
Exclude table elements from extraction.--with-metadata
Extract and include metadata in output.--precision
Favor extraction precision (less noise, less text).--recall
Favor extraction recall (more text, possibly more noise).-f, --fast
Fast extraction without fallback detection.--formatting
Include text formatting (bold, italic, etc.).--links
Include links with targets in output.--deduplicate
Filter out duplicate documents and sections.--feed [URL]
Look for feeds or pass feed URL as input.--sitemap [URL]
Look for sitemaps or enter sitemap URL.--parallel N
Number of cores/threads for downloads and processing.
CAVEATS
Python required. Extraction quality varies by site structure. Network access needed for URL fetching.
HISTORY
trafilatura was created by Adrien Barbaresi as an academic project for web scraping and text extraction, written in Python.
