Skip to content

gh-121109: Fix performance of tarfile reading with "r|*"#121296

Open
TomiBelan wants to merge 3 commits into
python:mainfrom
TomiBelan:slowtar
Open

gh-121109: Fix performance of tarfile reading with "r|*"#121296
TomiBelan wants to merge 3 commits into
python:mainfrom
TomiBelan:slowtar

Conversation

@TomiBelan
Copy link
Copy Markdown

@TomiBelan TomiBelan commented Jul 2, 2024

This PR fixes #121109.

Using the test files and test script described in the issue:

filename mode time with PR
test.tar.gz r:* 1.075s
test.tar.gz r|* 0.812s
test.tar.xz r:* 1.066s
test.tar.xz r|* 1.053s
test.tar.bz2 r:* 0.913s
test.tar.bz2 r|* 0.896s

After this PR, tf.list() of r|* is the same speed as r:*, as expected. Not orders of magnitude slower.

@TomiBelan TomiBelan requested a review from ethanfurman as a code owner July 2, 2024 20:53
@ghost
Copy link
Copy Markdown

ghost commented Jul 2, 2024

All commit authors signed the Contributor License Agreement.
CLA signed

Comment thread Lib/tarfile.py
Comment on lines +555 to +556
if len(t) > size:
raise ReadError("decompress() returned too much data")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you see any scenario where this would be triggered? Looking at the zlib, bz2 and lzma decompressor docs for max_length, it looks like this shouldn't occur?

I have no issue with it being here but just checking if I'm missing something :)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - it can only happen if there is a bug in the zlib, bz2 or lzma decompressor. I haven't checked their C source, but their docs say it should not occur.
It's not a necessary check, but I figured I'd add it just in case.

@github-actions
Copy link
Copy Markdown

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions Bot added the stale Stale PR or inactive for long period of time. label Apr 17, 2026
@TomiBelan
Copy link
Copy Markdown
Author

My dearest stale bot, I wish it was only 30 days! 😢

@github-actions github-actions Bot removed the stale Stale PR or inactive for long period of time. label May 13, 2026
@TomiBelan
Copy link
Copy Markdown
Author

Re @serhiy-storchaka #121109 (comment)

@TomiBelan, could you please test how your change affects the case of reading files byte-by-byte or by small chunks.

It's not needed. Small reads are already well-exercised by test_tarfile.py, especially StreamReadTest. When I add a print() to _Stream._read(), it shows a variety of size values during the test, e.g. 0, 1, 512, 4096, 7011, 10239, 10240.

This becomes clearer when you realize _Stream is just the outer shell, and the tar format parser itself also needs to read small chunks sometimes.

But all right. I also tested it with this script, which succeeded.

rm -rf data; mkdir data; for i in 1 2 3; do head -c1M /dev/zero | tr '\0' 'x' > data/$i.dat; done
tar caf test1M.tar.gz data ; tar caf test1M.tar.xz data ; tar caf test1M.tar.bz2 data ; tar caf test1M.tar.zst data
rm -rf data; mkdir data; for i in 1 2 3; do head -c100M /dev/zero | tr '\0' 'x' > data/$i.dat; done
tar caf test100M.tar.gz data ; tar caf test100M.tar.xz data ; tar caf test100M.tar.bz2 data ; tar caf test100M.tar.zst data
import sys
import tarfile
for filename in ('test1M.tar.gz', 'test1M.tar.xz', 'test1M.tar.bz2', 'test1M.tar.zst'):
    for mode in ('r|*', 'r:*'):
        for chunk_size in (1, 10000, 500000):
            print('running:', filename, mode, chunk_size, file=sys.stderr)
            with tarfile.open(filename, mode) as tf:
                for tarinfo in tf:
                    if tarinfo.isreg():
                        with tf.extractfile(tarinfo) as extractf:
                            total = 0
                            while True:
                                buf = extractf.read(chunk_size)
                                if not buf: break
                                total += len(buf)
                                assert buf == b'x' * len(buf)
                                assert len(buf) == chunk_size or total == tarinfo.size

Full disclosure: this script does what you asked for, but it actually isn't a very good test. extractfile() returns a io.BufferedReader. So the 1 byte read and the 10000 byte read both become 131072 byte reads.

And a benchmark:

import sys
import time
import tarfile
for filename in ('test100M.tar.gz', 'test100M.tar.xz', 'test100M.tar.bz2', 'test100M.tar.zst'):
    for mode in ('r|*', 'r:*'):
        print('running:', filename, mode, file=sys.stderr)
        start = time.time()
        with tarfile.open(filename, mode) as tf:
            tf.list()
        print('took', time.time() - start, file=sys.stderr)

I got 1.3, 1.2, 1.9, 1.5, 1.1, 1.1, 0.2, 0.2 seconds. (This is a different machine than last time.)

@TomiBelan
Copy link
Copy Markdown
Author

I made some changes:

  1. Rebased to main.
  2. Updated the zstd case, which didn't exist when this PR was created. (It's waiting for review almost 2 years...)
  3. I rewrote the PR because I find my original patch hard to understand. 😳 This new version completely separates the gzip and non-gzip case. It's longer overall, and some bits are duplicated, but "explicit is better than implicit" - I hope it's clearer and easier to review.
    HOWEVER: If a reviewer prefers the old patch (24110fb), I'd be happy to revert e61bccf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tarfile "r|*" (stream mode) is much slower than "r:*"

2 participants