LinuxCommandLibrary

trafilatura

Extract main text content from web pages

TLDR

Extract text from URL

$ trafilatura -u [https://example.com]
copy
Extract from file
$ trafilatura -i [page.html]
copy
Output as XML
$ trafilatura -u [url] --xml
copy
Include comments
$ trafilatura -u [url] --with-comments
copy
Output as JSON
$ trafilatura -u [url] --json
copy
Batch processing
$ trafilatura -i [urls.txt] --batch
copy

SYNOPSIS

trafilatura [-u url] [-i file] [--xml] [--json] [options]

DESCRIPTION

trafilatura extracts the main text content from web pages, automatically removing navigation, ads, headers, footers, and other boilerplate elements. It can fetch pages from URLs directly or process local HTML files.
Output is available in plain text, XML, or JSON formats. The tool also extracts metadata such as publication dates, authors, and page titles. Batch processing mode handles multiple URLs from a list file, making it suitable for web scraping and corpus building workflows.

PARAMETERS

-u URL

Input URL.
-i FILE
Input file.
--xml
XML output.
--json
JSON output.
--with-comments
Include comments.
--batch
Batch mode.

CAVEATS

Python required. Extraction varies by site. Network access needed.

HISTORY

trafilatura was created for web scraping and text extraction, removing boilerplate from web pages.

SEE ALSO

curl(1), wget(1), lynx(1)

> TERMINAL_GEAR

Curated for the Linux community

Copied to clipboard

> TERMINAL_GEAR

Curated for the Linux community