trafilatura
Extract main content from web pages
TLDR
Extract text from a URL
Extract text and save to a file
Extract text in JSON format
Extract text from multiple URLs listed in a file
Crawl a website using its sitemap
Extract text while preserving HTML formatting
Extract text including comments
Display help
SYNOPSIS
trafilatura [OPTIONS] [URL_OR_PATH]
trafilatura --url URL [OPTIONS]
trafilatura --file PATH [OPTIONS]
trafilatura --crawl URL [OPTIONS]
PARAMETERS
--url
Specify a URL to process directly.
--file
Process an HTML file from the local filesystem.
--crawl
Initiate a basic crawl starting from the given URL.
--json
Output the extracted content and metadata as JSON.
--xml
Output the extracted content and metadata as XML.
--csv
Output the extracted content and metadata as CSV (for lists of items).
--output
Specify an output file path for the extracted content.
--config
Load a custom configuration file.
--keep-empty
Keep empty fields in the output.
--no-comments
Do not extract comments from the webpage.
--no-images
Exclude images from the extracted content.
--no-soup
Skip BeautifulSoup cleanup, useful for malformed HTML.
--links
Extract only the links found on the page.
--silent
Suppress progress messages and warnings.
--verbose
Enable verbose output for debugging.
--help
Display a help message and exit.
--version
Show program's version number and exit.
DESCRIPTION
trafilatura is a powerful and versatile Python library designed for robust web content extraction. It provides a command-line interface (CLI) that allows users to effortlessly extract the main content, metadata (like author, date, title), and comments from web pages. Its primary goal is to simplify the process of retrieving clean, structured data from HTML, handling common issues such as boilerplate content, advertisements, and navigation elements. It supports various input methods, including direct URLs, local HTML files, and even basic crawling capabilities, offering output in multiple formats like JSON, XML, and plain text. This makes trafilatura an invaluable tool for data journalists, researchers, and developers working with web-sourced information.
CAVEATS
trafilatura is a Python library and thus requires a Python environment and pip for installation. While powerful, its web scraping capabilities can be affected by website changes, anti-bot measures, or rate limiting. Always respect robots.txt and website terms of service when scraping. Performance and accuracy can vary depending on the complexity and structure of the target webpage.
INPUT FROM STANDARD INPUT
trafilatura can also process HTML content piped directly to it from standard input, making it highly flexible for integration into shell scripts. For example: curl
CONFIGURATION FILES
Advanced users can fine-tune extraction behavior using custom YAML configuration files, allowing for granular control over aspects like XPath expressions, content filtering, and metadata extraction rules.
HISTORY
trafilatura emerged as a project aiming to provide a robust, language-agnostic, and user-friendly solution for extracting boilerplate-free main content from web pages. Developed primarily in Python, it leverages techniques to intelligently identify and isolate relevant textual content, making it a reliable choice for diverse web scraping needs. Its development has focused on continuous improvement in accuracy and adaptability to the ever-evolving landscape of web design.
SEE ALSO
curl(1): Tool for transferring data from or to a server., wget(1): Non-interactive network downloader., lynx(1): A general purpose distributed information browser for the World Wide Web., pup(1): A command-line HTML processor., python(1): An interpreted, interactive, object-oriented programming language.


