gh-121109: Fix performance of tarfile reading with "r|*" by TomiBelan · Pull Request #121296 · python/cpython

TomiBelan · 2024-07-02T20:53:25Z

This PR fixes #121109.

Using the test files and test script described in the issue:

filename	mode	time with PR
`test.tar.gz`	`r:*`	1.075s
`test.tar.gz`	`r\|*`	0.812s
`test.tar.xz`	`r:*`	1.066s
`test.tar.xz`	`r\|*`	1.053s
`test.tar.bz2`	`r:*`	0.913s
`test.tar.bz2`	`r\|*`	0.896s

After this PR, tf.list() of r|* is the same speed as r:*, as expected. Not orders of magnitude slower.

Issue: tarfile "r|*" (stream mode) is much slower than "r:*" #121109

ghost · 2024-07-02T20:53:27Z

All commit authors signed the Contributor License Agreement.

danifus · 2024-07-04T02:21:38Z

+        if len(t) > size:
+            raise ReadError("decompress() returned too much data")


Do you see any scenario where this would be triggered? Looking at the zlib, bz2 and lzma decompressor docs for max_length, it looks like this shouldn't occur?

I have no issue with it being here but just checking if I'm missing something :)

Right - it can only happen if there is a bug in the zlib, bz2 or lzma decompressor. I haven't checked their C source, but their docs say it should not occur.
It's not a necessary check, but I figured I'd add it just in case.

github-actions · 2026-04-17T06:37:01Z

This PR is stale because it has been open for 30 days with no activity.

TomiBelan · 2026-04-17T11:35:42Z

My dearest stale bot, I wish it was only 30 days! 😢

TomiBelan · 2026-05-23T21:30:59Z

Re @serhiy-storchaka #121109 (comment)

@TomiBelan, could you please test how your change affects the case of reading files byte-by-byte or by small chunks.

It's not needed. Small reads are already well-exercised by test_tarfile.py, especially StreamReadTest. When I add a print() to _Stream._read(), it shows a variety of size values during the test, e.g. 0, 1, 512, 4096, 7011, 10239, 10240.

This becomes clearer when you realize _Stream is just the outer shell, and the tar format parser itself also needs to read small chunks sometimes.

But all right. I also tested it with this script, which succeeded.

rm -rf data; mkdir data; for i in 1 2 3; do head -c1M /dev/zero | tr '\0' 'x' > data/$i.dat; done
tar caf test1M.tar.gz data ; tar caf test1M.tar.xz data ; tar caf test1M.tar.bz2 data ; tar caf test1M.tar.zst data
rm -rf data; mkdir data; for i in 1 2 3; do head -c100M /dev/zero | tr '\0' 'x' > data/$i.dat; done
tar caf test100M.tar.gz data ; tar caf test100M.tar.xz data ; tar caf test100M.tar.bz2 data ; tar caf test100M.tar.zst data

import sys
import tarfile
for filename in ('test1M.tar.gz', 'test1M.tar.xz', 'test1M.tar.bz2', 'test1M.tar.zst'):
    for mode in ('r|*', 'r:*'):
        for chunk_size in (1, 10000, 500000):
            print('running:', filename, mode, chunk_size, file=sys.stderr)
            with tarfile.open(filename, mode) as tf:
                for tarinfo in tf:
                    if tarinfo.isreg():
                        with tf.extractfile(tarinfo) as extractf:
                            total = 0
                            while True:
                                buf = extractf.read(chunk_size)
                                if not buf: break
                                total += len(buf)
                                assert buf == b'x' * len(buf)
                                assert len(buf) == chunk_size or total == tarinfo.size

Full disclosure: this script does what you asked for, but it actually isn't a very good test. extractfile() returns a io.BufferedReader. So the 1 byte read and the 10000 byte read both become 131072 byte reads.

And a benchmark:

import sys
import time
import tarfile
for filename in ('test100M.tar.gz', 'test100M.tar.xz', 'test100M.tar.bz2', 'test100M.tar.zst'):
    for mode in ('r|*', 'r:*'):
        print('running:', filename, mode, file=sys.stderr)
        start = time.time()
        with tarfile.open(filename, mode) as tf:
            tf.list()
        print('took', time.time() - start, file=sys.stderr)

I got 1.3, 1.2, 1.9, 1.5, 1.1, 1.1, 0.2, 0.2 seconds. (This is a different machine than last time.)

TomiBelan · 2026-05-23T21:52:38Z

I made some changes:

Rebased to main.
Updated the zstd case, which didn't exist when this PR was created. (It's waiting for review almost 2 years...)
I rewrote the PR because I find my original patch hard to understand. 😳 This new version completely separates the gzip and non-gzip case. It's longer overall, and some bits are duplicated, but "explicit is better than implicit" - I hope it's clearer and easier to review.
HOWEVER: If a reviewer prefers the old patch (24110fb), I'd be happy to revert e61bccf.

TomiBelan requested a review from ethanfurman as a code owner July 2, 2024 20:53

bedevere-app Bot mentioned this pull request Jul 2, 2024

tarfile "r|*" (stream mode) is much slower than "r:*" #121109

Open

bedevere-app Bot added the awaiting review label Jul 2, 2024

danifus approved these changes Jul 4, 2024

View reviewed changes

bedevere-app Bot added awaiting core review and removed awaiting review labels Jul 4, 2024

github-actions Bot added the stale Stale PR or inactive for long period of time. label Apr 17, 2026

github-actions Bot removed the stale Stale PR or inactive for long period of time. label May 13, 2026

TomiBelan added 2 commits May 23, 2026 21:11

Fix performance of tarfile reading with "r|*"

24110fb

Merge zstd additions

42a7a3d

Refactor by splitting gzip and non-gzip branch

e61bccf

TomiBelan force-pushed the slowtar branch from 5aa9c08 to e61bccf Compare May 23, 2026 21:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gh-121109: Fix performance of tarfile reading with "r|*"#121296

gh-121109: Fix performance of tarfile reading with "r|*"#121296
TomiBelan wants to merge 3 commits into
python:mainfrom
TomiBelan:slowtar

TomiBelan commented Jul 2, 2024 •

edited by bedevere-app Bot

Loading

Uh oh!

ghost commented Jul 2, 2024 •

edited by ghost

Loading

Uh oh!

danifus Jul 4, 2024

Uh oh!

TomiBelan Jul 4, 2024

Uh oh!

github-actions Bot commented Apr 17, 2026

Uh oh!

TomiBelan commented Apr 17, 2026

Uh oh!

TomiBelan commented May 23, 2026

Uh oh!

TomiBelan commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if len(t) > size:
		raise ReadError("decompress() returned too much data")

Uh oh!

Conversation

TomiBelan commented Jul 2, 2024 • edited by bedevere-app Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghost commented Jul 2, 2024 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danifus Jul 4, 2024

Choose a reason for hiding this comment

Uh oh!

TomiBelan Jul 4, 2024

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 17, 2026

Uh oh!

TomiBelan commented Apr 17, 2026

Uh oh!

TomiBelan commented May 23, 2026

Uh oh!

TomiBelan commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TomiBelan commented Jul 2, 2024 •

edited by bedevere-app Bot

Loading

ghost commented Jul 2, 2024 •

edited by ghost

Loading