Description
When using xdk.Client.stream.posts(...) for the Filtered Stream, the call crashes with UnicodeDecodeError whenever a TCP chunk boundary cuts a multi-byte UTF-8 character (emoji, accents, CJK) in the middle.
This is a classic streaming UTF-8 issue: chunk.decode('utf-8') is called per chunk without using an IncrementalDecoder, so any byte sequence split across two chunks raises the error.
Reproduction
Hard to reproduce on demand (depends on TCP chunking + a tweet containing a multi-byte char around the chunk boundary), but in production it happens regularly (~44 crashes/day for a stream watching ~1500 tweets/day in French — emojis are very common).
Code that triggers it:
from xdk import Client
from xdk.streaming import StreamConfig
client = Client(bearer_token=BEARER)
cfg = StreamConfig(max_retries=999999, initial_backoff=1.0)
for resp in client.stream.posts(tweet_fields=[...], stream_config=cfg):
# ...
Error logs
[DISCONNECT] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 1023: unexpected end of data
Non-retryable error: Unexpected error: 'utf-8' codec can't decode byte 0xe2 in position 1023: unexpected end of data
Position is frequently 1023 (suggesting a 1024-byte read buffer is being decoded as-is), but I also see positions 572, 754, 63-65.
Affected bytes: 0xe2 (typical start of em-dash, ellipsis, emoji prefix), 0xc3 (accented Latin), 0xef (BOM / emoji), 0xec (CJK), 0xf0 (4-byte emoji).
Versions tested
- xdk 0.9.0 (2026-02-28): crash
- xdk 0.8.1 (2026-02-12): same crash
Both versions have the same streaming bug.
Suggested fix
Use a codecs.getincrementaldecoder('utf-8')() (or higher-level io.TextIOWrapper / httpx.Response.iter_lines() pattern) instead of decoding each chunk independently.
Pseudo-fix:
import codecs
decoder = codecs.getincrementaldecoder('utf-8')()
buf = ""
for chunk in raw_byte_stream:
buf += decoder.decode(chunk) # handles split multi-byte chars
while "\n" in buf:
line, buf = buf.split("\n", 1)
yield json.loads(line)
Workaround currently in use
Bypass xdk for streaming, use httpx.Client.stream("GET", url).iter_lines() directly. This works because httpx uses an IncrementalDecoder internally for iter_lines/iter_text. xdk is still useful for the REST API.
Environment
- Python 3.13
- Debian (systemd service)
- X API v2 Filtered Stream endpoint
/2/tweets/search/stream
Description
When using
xdk.Client.stream.posts(...)for the Filtered Stream, the call crashes withUnicodeDecodeErrorwhenever a TCP chunk boundary cuts a multi-byte UTF-8 character (emoji, accents, CJK) in the middle.This is a classic streaming UTF-8 issue:
chunk.decode('utf-8')is called per chunk without using anIncrementalDecoder, so any byte sequence split across two chunks raises the error.Reproduction
Hard to reproduce on demand (depends on TCP chunking + a tweet containing a multi-byte char around the chunk boundary), but in production it happens regularly (~44 crashes/day for a stream watching ~1500 tweets/day in French — emojis are very common).
Code that triggers it:
Error logs
Position is frequently
1023(suggesting a 1024-byte read buffer is being decoded as-is), but I also see positions 572, 754, 63-65.Affected bytes:
0xe2(typical start of em-dash, ellipsis, emoji prefix),0xc3(accented Latin),0xef(BOM / emoji),0xec(CJK),0xf0(4-byte emoji).Versions tested
Both versions have the same streaming bug.
Suggested fix
Use a
codecs.getincrementaldecoder('utf-8')()(or higher-levelio.TextIOWrapper/httpx.Response.iter_lines()pattern) instead of decoding each chunk independently.Pseudo-fix:
Workaround currently in use
Bypass xdk for streaming, use
httpx.Client.stream("GET", url).iter_lines()directly. This works because httpx uses anIncrementalDecoderinternally foriter_lines/iter_text. xdk is still useful for the REST API.Environment
/2/tweets/search/stream