Skip to content

fix: preserve whitespace around inline children in JsonCssExtractionStrategy#1985

Open
andri-didych wants to merge 1 commit into
unclecode:developfrom
andri-didych:fix/jsoncss-text-whitespace
Open

fix: preserve whitespace around inline children in JsonCssExtractionStrategy#1985
andri-didych wants to merge 1 commit into
unclecode:developfrom
andri-didych:fix/jsoncss-text-whitespace

Conversation

@andri-didych
Copy link
Copy Markdown

Summary

JsonCssExtractionStrategy._get_element_text calls
element.get_text(strip=True) without a separator. With nested inline
tags (<span>foo <b>bar</b> baz</span>) BeautifulSoup strips each
text node and concatenates them with no separator, yielding
foobarbaz. This silently corrupts product names, headlines, and any
other text field whose selector matches an element with inline
children — a very common pattern (<b> brand spans, <strong>
emphasis, <sup>/<sub>, etc.).

This PR switches the call to get_text(separator=" ", strip=True) so
bs4 normalizes inter-node gaps to a single space. The sibling
JsonLxmlExtractionStrategy._get_element_text already does the same
thing via " ".join(...) of every text node — this aligns the CSS
strategy with that behavior.

Reproducer

from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

html = '<html><body><span class="name">Wireless <b>Logitech</b> Mouse M325</span></body></html>'
schema = {
    "baseSelector": "body",
    "fields": [{"name": "name", "selector": "span.name", "type": "text"}],
}
[record] = JsonCssExtractionStrategy(schema).extract(url="x", html_content=html)

# before:  record["name"] == "WirelessLogitechMouse M325"
# after:   record["name"] == "Wireless Logitech Mouse M325"

Behavior change

The only observable difference is for elements whose children sit
adjacent with no whitespace between them, e.g. <p><b>a</b><b>b</b></p>:

  • before: "ab"
  • after: "a b"

This matches the rendered text and the existing
JsonLxmlExtractionStrategy behavior, but is worth calling out for
anyone with snapshots that depend on the old concatenation. Whitespace
around inline tags (the common case) is now preserved correctly.

Test plan

  • New regression test in tests/test_jsoncss_text_whitespace.py exercises the strategy's .extract() entrypoint directly (no fixtures, no network) and pins the corrected behavior.
  • tests/test_source_sibling_selector.py — all 14 pre-existing CSS-strategy tests still pass.
  • Verified the new test fails on develop without the fix and passes with it.
  • black --check clean on changed files.

…trategy

`JsonCssExtractionStrategy._get_element_text` calls
`element.get_text(strip=True)` without a separator. With nested
inline tags (`<span>foo <b>bar</b> baz</span>`) BeautifulSoup
strips each text node and concatenates them with no separator,
yielding `foobarbaz`. This silently corrupts product names,
headlines, and any other text field whose selector matches an
element with inline children.

Pass `separator=" ", strip=True` instead — bs4 normalizes inter-node
gaps to a single space and trims leading/trailing whitespace.
Mirrors what the sibling `JsonLxmlExtractionStrategy` already does
via `" ".join(...)` over every text node.

Add a regression test under `tests/` that exercises the scenario
directly through the strategy's `extract()` entrypoint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant