Skip to content

gh-51067: Add remove() and repack() to ZipFile#134627

Open
danny0838 wants to merge 72 commits intopython:mainfrom
danny0838:gh-51067-2
Open

gh-51067: Add remove() and repack() to ZipFile#134627
danny0838 wants to merge 72 commits intopython:mainfrom
danny0838:gh-51067-2

Conversation

@danny0838
Copy link

@danny0838 danny0838 commented May 24, 2025

This is a revised version of PR #103033, implementing two new methods in zipfile.ZipFile: remove() and repack(), as suggested in this comment.

Features

ZipFile.remove(zinfo_or_arcname)

  • Removes a file entry (by providing a str path or ZipInfo) from the central directory.
  • If there are multiple file entries with the same path, only one is removed when a str path is provided.
  • Returns the removed ZipInfo instance.
  • Supported in modes: 'a', 'w', 'x'.

ZipFile.repack(removed=None)

  • Physically removes stale local file entry data that is no longer referenced by the central directory.
  • Shrinks the archive file size.
  • If removed is passed (as a sequence of removed ZipInfos), only their corresponding local file entry data are removed.
  • Only supported in mode 'a'.

Rationales

Heuristics Used in repack()

Since repack() does not immediately clean up removed entries at the time a remove() is called, the header information of removed file entries may be missing, and thus it can be technically difficult to determine whether certain stale bytes are really previously removed files and safe to remove.

While local file entries begin with the magic signature PK\x03\x04, this alone is not a reliable indicator. For instance, a self-extracting ZIP file may contain executable code before the actual archive, which could coincidentally include such a signature, especially if it embeds ZIP-based content.

To safely reclaim space, repack() assumes that in a normal ZIP file, local file entries are stored consecutively:

  • File entries must not overlap.
    • If any entry’s data overlaps with the next, a BadZipFile error is raised and no changes are made.
  • There should be no extra bytes between entries (or between the last entry and the central directory):
    1. Data before the first referenced entry is removed only when it appears to be a sequence of consecutive entries with no extra following bytes; extra preceeding bytes are preserved.
    2. Data between referenced entries is removed only when it appears to be a sequence of consecutive entries with no extra preceding bytes; extra following bytes are preserved.

Check the doc in the source code of _ZipRepacker.repack() (which is internally called by ZipFile.repack()) for more details.

Supported Modes

There has been opinions that a repacking should support mode 'w' and 'x' (e. g. #51067 (comment)).

This is NOT introduced since such modes do not truncate the file at the end of writing, and won't really shrink the file size after a removal has been made. Although we do can change the behavior for the existing API, some further care has to be made because mode 'w' and 'x' may be used on an unseekable file and will be broken by such change. OTOH, mode 'a' is not expected to work with an unseekable file since an initial seek is made immediately when it is opened.



📚 Documentation preview 📚: https://cpython-previews--134627.org.readthedocs.build/

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants