Skip to content

feat: Add AiCrawler with AI-powered HTML extraction#1964

Open
Mantisus wants to merge 6 commits into
apify:masterfrom
Mantisus:llm-html-crawler
Open

feat: Add AiCrawler with AI-powered HTML extraction#1964
Mantisus wants to merge 6 commits into
apify:masterfrom
Mantisus:llm-html-crawler

Conversation

@Mantisus

@Mantisus Mantisus commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Description

  • Adds AiCrawler - a new HTTP crawler that parses pages with parsel and uses pydantic-ai as the layer for LLM interaction.
  • AiHtmlDistiller is a protocol for distillers that clean HTML and convert it to a compact format (e.g., cleaned HTML, Markdown) for an LLM.
    • AiCleanHtmlDistiller removes comments, noisy attributes, and scripts, returning a compact HTML version.
    • AiSkeletonDistiller extends AiCleanHtmlDistiller by truncating text and collapsing repeated siblings.
  • AiHtmlExtractor is a protocol for extractors that turn a page into structured data using a distiller and an LLM.
    • AiDirectExtractor sends the distilled page to an LLM together with a Pydantic schema describing the target data and returns the validated result.
    • AiSelectorExtractor asks the LLM for CSS selectors once and caches them in a KeyValueStore, so later pages are extracted without an LLM call.

Issues

Testing

  • Added new unit tests for AiCrawler, AiCleanHtmlDistiller, AiSkeletonDistiller, AiDirectExtractor, and AiSelectorExtractor.

@Mantisus Mantisus self-assigned this Jun 15, 2026
@Mantisus Mantisus marked this pull request as ready for review June 17, 2026 19:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for AI/LLM-based HTML parsing (selectors)

2 participants