Skip to content

Segfault/UB from expat when re-entering the XML Parser #146169

@stestagg

Description

@stestagg

Crash report

What happened?

Semi-reliable(Interstingly, python in macos doesn't segfault for me, but docker/linux aarch64 does reliably) crash with this code:

from xml.parsers import expat

p = expat.ParserCreate(encoding="utf-16")

def start(name, attrs):
    p.CharacterDataHandler = lambda data: p.Parse(data, 0)

p.StartElementHandler = start

data = b"\xff\xfe<\x00a\x00>\x00x\x00"
for i in range(len(data)):
    try:
        p.Parse(data[i:i+1], i == len(data) - 1)
    except Exception:
        pass

This code /is/ doing some pretty naughty stuff, but the main problem seems to be that the handler is being set to re-enter the parser. The expat docs do say:

To state the obvious: the three parsing functions XML_Parse, XML_ParseBuffer and XML_GetBuffer must not be called from within a handler unless they operate on a separate parser instance, that is, one that did not call the handler. For example, it is OK to call the parsing functions from within an XML_ExternalEntityRefHandler, if they apply to the parser created by XML_ExternalEntityParserCreate.

and I see that the python expat parser code tracks in_callback:

int in_callback; /* Is a callback active? */

So I wonder if we can avoid the segfault by preventing Parse calls when in_callback==true?

There's also a secondary issue in play here, that Parse() seems to call

XML_SetEncoding

(void)XML_SetEncoding(self->itself, "utf-8");

Without the check outlined in the expat docs:

Set the encoding to be used by the parser. It is equivalent to passing a non-NULL encoding argument to the parser creation functions. It must not be called after XML_Parse or XML_ParseBuffer have been called on the given parser. Returns XML_STATUS_OK on success or XML_STATUS_ERROR on error.

This is almost definitely not going to cause issues unless the encoding is actually changing, (not that common) at which point the UB will rear its head as the internal state of the parser becomes inconsistent.

Dockerfile reproducer
ARG REPO=https://github.com/python/cpython.git
ARG BRANCH=main

FROM ubuntu:24.04

ARG REPO
ARG BRANCH
ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y \
    build-essential git pkg-config \
    libssl-dev libbz2-dev libreadline-dev libsqlite3-dev \
    liblzma-dev libffi-dev zlib1g-dev uuid-dev \
    && rm -rf /var/lib/apt/lists/*

RUN git clone --branch ${BRANCH} --depth 1 \
    ${REPO} /cpython

RUN cd /cpython && \
    ./configure --prefix=/python --without-ensurepip && \
    make -j$(nproc) && \
    make install

# ── TEST SCRIPT ──────────────────────────────────────────────────
RUN cat > /test.py << 'EOF'
from xml.parsers import expat

p = expat.ParserCreate(encoding="utf-16")

def start(name, attrs):
    p.CharacterDataHandler = lambda data: p.Parse(data, 0)

p.StartElementHandler = start

data = b"\xff\xfe<\x00a\x00>\x00x\x00"
for i in range(len(data)):
    try:
        p.Parse(data[i:i+1], i == len(data) - 1)
    except Exception:
        pass
EOF
# ──────────────────────────────────────────────────────────────────────────────

CMD ["/bin/sh", "-c", "uname -m && /python/bin/python3 -VV && /python/bin/python3 /test.py"]

Gives on my pc:

docker run --rm -it expattest
aarch64
Python 3.15.0a7+ (heads/main:52c0186, Mar 19 2026, 13:06:19) [GCC 13.3.0]
52c01864c4778a351e5aa3584e86ed6fd212a5a4
Segmentation fault (core dumped)

CPython versions tested on:

CPython main branch

Operating systems tested on:

macOS

Output from running 'python -VV' on the command line:

Python 3.15.0a7+ (heads/main:52c0186, Mar 19 2026, 13:06:19) [GCC 13.3.0]

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    extension-modulesC modules in the Modules dirtopic-XMLtype-crashA hard crash of the interpreter, possibly with a core dump

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions