gh-88500: Reduce memory use of urllib.unquote#96763
Conversation
`urllib.unquote_to_bytes` and `urllib.unquote` could both potentially generate `O(len(string))` intermediate `bytes` or `str` objects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram. This switches the implementation to using an expanding `bytearray` and a generator internally instead of precomputed `split()` style operations.
|
Microbenchmarks with some antagonistic inputs like Memory usage observed manually using Memory usage is ~1/2 for |
|
any thoughts from reviewers? |
urllib.unquote_to_bytesandurllib.unquotecould both potentially generateO(len(string))intermediatebytesorstrobjects while computing the unquoted final result depending on the input provided. As Python objects are relatively large, this could consume a lot of ram.This switches the implementation to using an expanding
bytearrayand a generator internally instead of precomputedsplit()style operations.Microbenchmarks with some antagonistic inputs like
mess = "\u0141%%%20a%fe"*1000show this is 10-20% slower for unquote and unquote_to_bytes and no different for typical inputs that are short or lack much unicode or % escaping. But the functions are already quite fast anyways so not a big deal. The slowdown scales consistently linear with input size as expected.Memory usage observed manually using
/usr/bin/time -vonpython -m timeitruns of larger inputs. Unittesting memory consumption is difficult and does not seem worthwhile.Observed memory usage is ~1/2 for
unquote()and <1/3 forunquote_to_bytes()usingpython -m timeit -s 'from urllib.parse import unquote, unquote_to_bytes; v="\u0141%01\u0161%20"*500_000' 'unquote_to_bytes(v)'as a test.Closes #88500.