gh-127563: fix UBSan failure in dictobject.c#127568
Conversation
Objects/dictobject.c
Outdated
| PyDictUnicodeEntry *newentries = DK_UNICODE_ENTRIES(newkeys); | ||
| if (oldkeys->dk_nentries == numentries && mp->ma_keys->dk_kind == DICT_KEYS_UNICODE) { | ||
| memcpy(newentries, oldentries, numentries * sizeof(PyDictUnicodeEntry)); | ||
| memcpy(newentries, (const void *)oldentries, numentries * sizeof(PyDictUnicodeEntry)); |
There was a problem hiding this comment.
I don't understand this. The only change is the explicit cast to const void *?
There was a problem hiding this comment.
IIUC this would silence the UBSan misaligned warning, since by definition a void* can be read from any location.
But IMHO this is dangerous, because the UBSan warning given in the issue states that a PyDictUnicodeEntry * is going to be read from 0x5f7d064233d1, with an alignment of 8 (so this seems to stem from a 64 bit build).
Since this results from _DK_ENTRIES, this can only result from dk_log2_index_bytes=0 due to
size_t index = (size_t)1 << dk->dk_log2_index_bytes;
If I read the code correctly, this is only set via new_keys_object which has a protective
assert(log2_size >= PyDict_LOG_MINSIZE);
so that further down in the code dk_log2_index_bytes should never be smaller than PyDict_LOG_MINSIZE=3.
Hence, I assume this is not a debug build.
Luckily, new_keys_object has only four callers, and two of them pass constants (PyDict_LOG_MINSIZE and NEXT_LOG2_SHARED_KEYS_MAX_SIZE=6), this leaves dict_new_presized and dictresize, where the latter unfortunately has many callers :(
Maybe this helps and hopefully I am not completely mistaken ...
There was a problem hiding this comment.
Py_EMPTY_KEYS has dk_log2_index_bytes=0. But I don't know why it is passed to dictresize.
Can we get stacktrace for the UBSan error?
There was a problem hiding this comment.
- When we create an empty dict, we use
Py_EMPTY_KEYSas the keys object - When we first insert into that empty dict, we resize it (via
dictresize) - In that case
oldkeysisPy_EMPTY_KEYSandoldentriesis a misalignedPyDictUnicodeEntry * memcpyis called withnumentries=0and an invalid/misaligned pointer foroldentries.- The UBSan message is confusing: there's not really any misaligned read (it's a zero-sized memcpy), but there's still undefined behavior
- It's undefined behavior to call memcpy with an invalid pointer, even if
countis zero. - It's also undefined behavior to create the misaligned pointer, even if we don't ever dereference. In other words, calling
DK_UNICODE_ENTRIES()onPy_EMPTY_KEYSis UB.
There was a problem hiding this comment.
When we first insert into that empty dict, we resize it (via dictresize)
insert_to_emptydict() is optimized for inserting into dict using Py_EMPTY_KEYS.
stacktrace can help to find where insert_to_emptydict() can be used.
Anyway, dk_log2_index_bytes in Py_EMPTY_DICT can be 3 instead of 0.
There was a problem hiding this comment.
Sorry to be pesky, I knew this from C++
But for https://en.cppreference.com/w/c/language/pointer I do not see this or anything similar.
There was a problem hiding this comment.
See the C11 standard draft, 6.5.6 bullet points 7 and 8. There are probably other references as well.
There was a problem hiding this comment.
I think this is documented in Section 6.5.9.7 from the C20 (emphasis mine):
Two pointers compare equal if and only if [...], both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.
There was a problem hiding this comment.
There are also some examples in https://en.cppreference.com/w/c/language/struct involving structs with flexible array members. The examples are probably not authoritative, but I tend to trust the site:
struct s { int n; double d[]; }; // s.d is a flexible array member
double *dp;
...
s2 = malloc(sizeof (struct s) + 6); // same, but UB to access because 2 bytes are
// missing to complete 1 double
dp = &(s2->d[0]); // OK, can take address just fine
*dp = 42; // undefined behavior
Note that constructing dp is okay even if evaluating *dp is UB.
There was a problem hiding this comment.
Yeah - flexible arrays to the rescue. Luckily, we have it here in _dictkeysobject, which is internal, since in C++ they are not (yet) allowed :)
3a3fef4 to
f164f24
Compare
|
I locally tested that Clang is happy by adding |
|
How about this? |
|
I'll merge to fix the UB reports. |
|
GH-127798 is a backport of this pull request to the 3.13 branch. |
|
GH-127813 is a backport of this pull request to the 3.12 branch. |
…-127568) This fixes a UBSan failure (unaligned zero-size memcpy) in `dictobject.c`.
…ythonGH-127568) (pythonGH-127813) This fixes a UBSan failure (unaligned zero-size memcpy) in `dictobject.c`. (cherry picked from commit 9af96f4) (cherry picked from commit 320a1dc) Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
…) (GH-127813) (#135463) [3.12] gh-127563: use `dk_log2_index_bytes=3` in empty dicts (GH-127568) (GH-127813) This fixes a UBSan failure (unaligned zero-size memcpy) in `dictobject.c`. (cherry picked from commit 9af96f4) (cherry picked from commit 320a1dc) Co-authored-by: Bénédikt Tran <10796600+picnixz@users.noreply.github.com>
Objects/dictobject.c#127563