fix: preserve whitespace around inline children in JsonCssExtractionStrategy by andri-didych · Pull Request #1985 · unclecode/crawl4ai

andri-didych · 2026-05-26T11:27:31Z

Summary

JsonCssExtractionStrategy._get_element_text calls
element.get_text(strip=True) without a separator. With nested inline
tags (foo bar baz) BeautifulSoup strips each
text node and concatenates them with no separator, yielding
foobarbaz. This silently corrupts product names, headlines, and any
other text field whose selector matches an element with inline
children — a very common pattern ( brand spans, 
emphasis, /, etc.).

This PR switches the call to get_text(separator=" ", strip=True) so
bs4 normalizes inter-node gaps to a single space. The sibling
JsonLxmlExtractionStrategy._get_element_text already does the same
thing via " ".join(...) of every text node — this aligns the CSS
strategy with that behavior.

Reproducer

from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

html = '<html><body><span class="name">Wireless <b>Logitech</b> Mouse M325</span></body></html>'
schema = {
    "baseSelector": "body",
    "fields": [{"name": "name", "selector": "span.name", "type": "text"}],
}
[record] = JsonCssExtractionStrategy(schema).extract(url="x", html_content=html)

# before:  record["name"] == "WirelessLogitechMouse M325"
# after:   record["name"] == "Wireless Logitech Mouse M325"

Behavior change

The only observable difference is for elements whose children sit
adjacent with no whitespace between them, e.g. ab:

before: "ab"
after: "a b"

This matches the rendered text and the existing
JsonLxmlExtractionStrategy behavior, but is worth calling out for
anyone with snapshots that depend on the old concatenation. Whitespace
around inline tags (the common case) is now preserved correctly.

Test plan

New regression test in tests/test_jsoncss_text_whitespace.py exercises the strategy's .extract() entrypoint directly (no fixtures, no network) and pins the corrected behavior.
tests/test_source_sibling_selector.py — all 14 pre-existing CSS-strategy tests still pass.
Verified the new test fails on develop without the fix and passes with it.
black --check clean on changed files.

…trategy `JsonCssExtractionStrategy._get_element_text` calls `element.get_text(strip=True)` without a separator. With nested inline tags (`foo bar baz`) BeautifulSoup strips each text node and concatenates them with no separator, yielding `foobarbaz`. This silently corrupts product names, headlines, and any other text field whose selector matches an element with inline children. Pass `separator=" ", strip=True` instead — bs4 normalizes inter-node gaps to a single space and trims leading/trailing whitespace. Mirrors what the sibling `JsonLxmlExtractionStrategy` already does via `" ".join(...)` over every text node. Add a regression test under `tests/` that exercises the scenario directly through the strategy's `extract()` entrypoint.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: preserve whitespace around inline children in JsonCssExtractionStrategy#1985

fix: preserve whitespace around inline children in JsonCssExtractionStrategy#1985
andri-didych wants to merge 1 commit into
unclecode:developfrom
andri-didych:fix/jsoncss-text-whitespace

andri-didych commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

andri-didych commented May 26, 2026

Summary

Reproducer

Behavior change

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant