fix: preserve whitespace around inline children in JsonCssExtractionStrategy#1985
Open
andri-didych wants to merge 1 commit into
Open
fix: preserve whitespace around inline children in JsonCssExtractionStrategy#1985andri-didych wants to merge 1 commit into
andri-didych wants to merge 1 commit into
Conversation
…trategy `JsonCssExtractionStrategy._get_element_text` calls `element.get_text(strip=True)` without a separator. With nested inline tags (`<span>foo <b>bar</b> baz</span>`) BeautifulSoup strips each text node and concatenates them with no separator, yielding `foobarbaz`. This silently corrupts product names, headlines, and any other text field whose selector matches an element with inline children. Pass `separator=" ", strip=True` instead — bs4 normalizes inter-node gaps to a single space and trims leading/trailing whitespace. Mirrors what the sibling `JsonLxmlExtractionStrategy` already does via `" ".join(...)` over every text node. Add a regression test under `tests/` that exercises the scenario directly through the strategy's `extract()` entrypoint.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
JsonCssExtractionStrategy._get_element_textcallselement.get_text(strip=True)without a separator. With nested inlinetags (
<span>foo <b>bar</b> baz</span>) BeautifulSoup strips eachtext node and concatenates them with no separator, yielding
foobarbaz. This silently corrupts product names, headlines, and anyother text field whose selector matches an element with inline
children — a very common pattern (
<b>brand spans,<strong>emphasis,
<sup>/<sub>, etc.).This PR switches the call to
get_text(separator=" ", strip=True)sobs4 normalizes inter-node gaps to a single space. The sibling
JsonLxmlExtractionStrategy._get_element_textalready does the samething via
" ".join(...)of every text node — this aligns the CSSstrategy with that behavior.
Reproducer
Behavior change
The only observable difference is for elements whose children sit
adjacent with no whitespace between them, e.g.
<p><b>a</b><b>b</b></p>:"ab""a b"This matches the rendered text and the existing
JsonLxmlExtractionStrategybehavior, but is worth calling out foranyone with snapshots that depend on the old concatenation. Whitespace
around inline tags (the common case) is now preserved correctly.
Test plan
tests/test_jsoncss_text_whitespace.pyexercises the strategy's.extract()entrypoint directly (no fixtures, no network) and pins the corrected behavior.tests/test_source_sibling_selector.py— all 14 pre-existing CSS-strategy tests still pass.developwithout the fix and passes with it.black --checkclean on changed files.