Skip to content

gh-88500: Reduce memory use of urllib.unquote#96763

Merged
gpshead merged 5 commits intopython:mainfrom
gpshead:gh/88500/unquote_mem_use
Dec 11, 2022
Merged

gh-88500: Reduce memory use of urllib.unquote#96763
gpshead merged 5 commits intopython:mainfrom
gpshead:gh/88500/unquote_mem_use

Conversation

@gpshead
Copy link
Member

@gpshead gpshead commented Sep 12, 2022

urllib.unquote_to_bytes and urllib.unquote could both potentially generate O(len(string)) intermediate bytes or str objects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram.

This switches the implementation to using an expanding bytearray and a generator internally instead of precomputed split() style operations.

Microbenchmarks with some antagonistic inputs like mess = "\u0141%%%20a%fe"*1000 show this is 10-20% slower for unquote and unquote_to_bytes and no different for typical inputs that are short or lack much unicode or % escaping. But the functions are already quite fast anyways so not a big deal. The slowdown scales consistently linear with input size as expected.

Memory usage observed manually using /usr/bin/time -v on python -m timeit runs of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile.

Observed memory usage is ~1/2 for unquote() and <1/3 for unquote_to_bytes() using python -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)' as a test.

Closes #88500.

`urllib.unquote_to_bytes` and `urllib.unquote` could both potentially
generate `O(len(string))` intermediate `bytes` or `str` objects while
computing the unquoted final result depending on the input provided. As
Python objects are relatively large, this could consume a lot of ram.

This switches the implementation to using an expanding `bytearray` and a
generator internally instead of precomputed `split()` style operations.
@gpshead
Copy link
Member Author

gpshead commented Sep 12, 2022

Microbenchmarks with some antagonistic inputs like mess = "\u0141%%%20a%fe"*1000 show this is 10-20% slower for unquote and unquote_to_bytes and no different for typical inputs that are short or lack much unicode or % escaping. But the functions are already quite fast anyways so not a big deal. The slowdown scales consistently linear with input size as expected.

Memory usage observed manually using /usr/bin/time -v on python -m timeit runs of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile.

Memory usage is ~1/2 for unquote() and <1/3 for unquote_to_bytes() using python -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)' as a test.

@gpshead gpshead added type-feature A feature request or enhancement performance Performance or resource usage stdlib Standard Library Python modules in the Lib/ directory labels Sep 16, 2022
@gpshead gpshead marked this pull request as ready for review September 16, 2022 08:28
@gpshead
Copy link
Member Author

gpshead commented Oct 1, 2022

any thoughts from reviewers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance or resource usage stdlib Standard Library Python modules in the Lib/ directory type-feature A feature request or enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reduce memory usage of urllib.unquote and unquote_to_bytes

2 participants