LinuxCommandLibrary

trafilatura

Extract main content from web pages

TLDR

Extract text from a URL

$ trafilatura [[-u|--URL]] [url]
copy

Extract text and save to a file
$ trafilatura [[-u|--URL]] [url] [[-o|--output-dir]] [path/to/output.txt]
copy

Extract text in JSON format
$ trafilatura [[-u|--URL]] [url] --json
copy

Extract text from multiple URLs listed in a file
$ trafilatura [[-i|--input-file]] [path/to/url_list.txt]
copy

Crawl a website using its sitemap
$ trafilatura --sitemap [url_to_sitemap.xml]
copy

Extract text while preserving HTML formatting
$ trafilatura [[-u|--URL]] [url] --formatting
copy

Extract text including comments
$ trafilatura [[-u|--URL]] [url] --with-comments
copy

Display help
$ trafilatura [[-h|--help]]
copy

SYNOPSIS

trafilatura [OPTIONS] [URL_OR_PATH]
trafilatura --url URL [OPTIONS]
trafilatura --file PATH [OPTIONS]
trafilatura --crawl URL [OPTIONS]

PARAMETERS

--url
    Specify a URL to process directly.

--file
    Process an HTML file from the local filesystem.

--crawl
    Initiate a basic crawl starting from the given URL.

--json
    Output the extracted content and metadata as JSON.

--xml
    Output the extracted content and metadata as XML.

--csv
    Output the extracted content and metadata as CSV (for lists of items).

--output
    Specify an output file path for the extracted content.

--config
    Load a custom configuration file.

--keep-empty
    Keep empty fields in the output.

--no-comments
    Do not extract comments from the webpage.

--no-images
    Exclude images from the extracted content.

--no-soup
    Skip BeautifulSoup cleanup, useful for malformed HTML.

--links
    Extract only the links found on the page.

--silent
    Suppress progress messages and warnings.

--verbose
    Enable verbose output for debugging.

--help
    Display a help message and exit.

--version
    Show program's version number and exit.

DESCRIPTION

trafilatura is a powerful and versatile Python library designed for robust web content extraction. It provides a command-line interface (CLI) that allows users to effortlessly extract the main content, metadata (like author, date, title), and comments from web pages. Its primary goal is to simplify the process of retrieving clean, structured data from HTML, handling common issues such as boilerplate content, advertisements, and navigation elements. It supports various input methods, including direct URLs, local HTML files, and even basic crawling capabilities, offering output in multiple formats like JSON, XML, and plain text. This makes trafilatura an invaluable tool for data journalists, researchers, and developers working with web-sourced information.

CAVEATS

trafilatura is a Python library and thus requires a Python environment and pip for installation. While powerful, its web scraping capabilities can be affected by website changes, anti-bot measures, or rate limiting. Always respect robots.txt and website terms of service when scraping. Performance and accuracy can vary depending on the complexity and structure of the target webpage.

INPUT FROM STANDARD INPUT

trafilatura can also process HTML content piped directly to it from standard input, making it highly flexible for integration into shell scripts. For example: curl | trafilatura.

CONFIGURATION FILES

Advanced users can fine-tune extraction behavior using custom YAML configuration files, allowing for granular control over aspects like XPath expressions, content filtering, and metadata extraction rules.

HISTORY

trafilatura emerged as a project aiming to provide a robust, language-agnostic, and user-friendly solution for extracting boilerplate-free main content from web pages. Developed primarily in Python, it leverages techniques to intelligently identify and isolate relevant textual content, making it a reliable choice for diverse web scraping needs. Its development has focused on continuous improvement in accuracy and adaptability to the ever-evolving landscape of web design.

SEE ALSO

curl(1): Tool for transferring data from or to a server., wget(1): Non-interactive network downloader., lynx(1): A general purpose distributed information browser for the World Wide Web., pup(1): A command-line HTML processor., python(1): An interpreted, interactive, object-oriented programming language.

Copied to clipboard