beautifulsoup
TLDR
Parse HTML and find all links
SYNOPSIS
DESCRIPTION
Beautiful Soup is a Python library for parsing HTML and XML documents. While not a command-line tool itself, it is commonly used in Python one-liners and scripts for web scraping, data extraction, and HTML manipulation.
The library creates a parse tree from HTML documents, allowing navigation, search, and modification of the tree. It works with multiple parsers and handles malformed markup gracefully, making it ideal for scraping real-world websites.
Beautiful Soup provides Pythonic idioms for navigating the parse tree, including iteration, attribute access, and CSS selector support. Combined with requests for HTTP, it forms the foundation of most Python web scraping workflows.
COMMON METHODS
BeautifulSoup(markup, parser)
Parse HTML/XML markup with specified parser.soup.find(name, attrs)
Find first matching element.soup.find_all(name, attrs)
Find all matching elements.soup.select(css_selector)
Find elements using CSS selectors.element.get_text()
Extract text content from element.element['attribute']
Access element attribute value.element.string
Get direct string content of element.soup.prettify()
Return formatted HTML string.
PARSERS
html.parser
Built-in Python parser (no extra dependencies).lxml
Fast C-based parser (requires lxml package).lxml-xml
XML parser using lxml.html5lib
Browser-like parsing (requires html5lib package).
INSTALLATION
CAVEATS
Beautiful Soup is for parsing only; it doesn't fetch web pages (use requests or httpx). JavaScript-rendered content requires a browser automation tool like Selenium. The default html.parser is slower than lxml. Web scraping may violate website terms of service; always check robots.txt and rate-limit requests.
HISTORY
Beautiful Soup was created by Leonard Richardson and first released in 2004. Version 4 (bs4), released in 2012, added support for multiple parsers and Python 3. The library became one of the most popular Python tools for web scraping, valued for its simple API and ability to handle poorly-formed HTML. It remains actively maintained and widely used.


