Skip to content

Optimize fast_locals and atomic ordering#7289

Merged
youknowone merged 5 commits intoRustPython:mainfrom
youknowone:worktree-perf-step1
Mar 1, 2026
Merged

Optimize fast_locals and atomic ordering#7289
youknowone merged 5 commits intoRustPython:mainfrom
youknowone:worktree-perf-step1

Conversation

@youknowone
Copy link
Member

@youknowone youknowone commented Mar 1, 2026

Summary by CodeRabbit

Release Notes

  • Performance Improvements
    • Optimized synchronization and atomic operations for enhanced runtime efficiency
    • Implemented lock-free storage for local variables
    • Streamlined execution path when tracing is disabled

- inc/inc_by/get: SeqCst → Relaxed
- safe_inc CAS: SeqCst → Relaxed + compare_exchange_weak
- dec: SeqCst → Release + Acquire fence when count drops to 0
- leak CAS: SeqCst → AcqRel/Relaxed + compare_exchange_weak
Replace vec![self_val] + extend(args.args) with
FuncArgs::prepend_arg() to avoid a second heap allocation
on every method call.
Early return in PyCallable::invoke() when use_tracing is false,
avoiding two downcast_ref type checks on every function call.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 1, 2026

📝 Walkthrough

Walkthrough

This PR introduces lock-free frame locals storage via a new FastLocals type using UnsafeCell, replaces mutex-based access patterns with unsafe borrows across call sites, optimizes atomic operation orderings in RefCount to use Relaxed semantics, and adds a fast-path for disabled tracing in callable invocation.

Changes

Cohort / File(s) Summary
Atomic Ordering Optimizations
crates/common/src/refcount.rs
Replaces SeqCst with Relaxed ordering across RefCount methods; adds Release and Acquire/AcqRel fence in deallocation path; switches to compare_exchange_weak variants for reduced synchronization overhead.
Lock-Free Frame Locals Storage
crates/vm/src/frame.rs
Introduces new FastLocals struct wrapping UnsafeCell to replace PyMutex-based locals. Adds Send/Sync impls (with threading), borrow()/borrow_mut() methods, and Traverse impl. Updates Frame and ExecutingFrame field types accordingly.
FastLocals Access Migration
crates/vm/src/builtins/frame.rs, crates/vm/src/builtins/function.rs, crates/vm/src/builtins/super.rs
Replaces mutex .lock() calls with unsafe borrow()/borrow_mut() borrows for fastlocals access; includes safety comments indicating frame state assumptions.
Tracing Fast-Path
crates/vm/src/protocol/callable.rs
Adds short-circuit return in PyCallable::invoke when tracing is disabled, bypassing multi-branch tracing logic for direct call invocation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Poem

🐰 Hops through memory with careful grace,
Unsafe borrows replace the locks' embrace,
Relaxed atoms dance with less delay,
FastLocals stores frames in a lock-free way,
And when tracing sleeps, the short-paths play!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Optimize fast_locals and atomic ordering' directly and concisely summarizes the primary changes: replacing mutex-based fastlocals with a lock-free FastLocals type and optimizing atomic operation orderings from SeqCst to Relaxed.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 1, 2026

Code has been automatically formatted

The code in this PR has been formatted using:

  • cargo fmt --all
    Please pull the latest changes before pushing again:
git pull origin worktree-perf-step1

Eliminate per-instruction mutex lock/unlock overhead for local
variable access. FastLocals uses UnsafeCell with safety guaranteed
by the frame's state mutex and sequential same-thread execution.

Affects 14+ lock() call sites in hot instruction paths (LoadFast,
StoreFast, DeleteFast, and their paired variants).
@youknowone youknowone force-pushed the worktree-perf-step1 branch from 2a5b8ce to 958853f Compare March 1, 2026 09:26
@youknowone youknowone marked this pull request as ready for review March 1, 2026 11:38
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/common/src/refcount.rs (1)

124-137: ⚠️ Potential issue | 🟠 Major

Use Acquire on successful safe_inc CAS for weak→strong upgrade synchronization.

The safe_inc method is used in weak reference upgrade paths (e.g., PyWeak::upgrade at line 688 in crates/vm/src/object/core.rs). Currently, line 136 uses Ordering::Relaxed on successful CAS, which provides no synchronization guarantees.

When upgrading from weak to strong, the CAS success must acquire visibility of memory written by previous strong reference holders. The dec() method uses Ordering::Release (line 149), establishing a release-acquire synchronization point. The successful safe_inc must use Ordering::Acquire to properly pair with this Release and ensure visibility of modifications made under the previous strong reference.

Proposed patch
             match self.state.compare_exchange_weak(
                 old.as_raw(),
                 new_state.as_raw(),
-                Ordering::Relaxed,
+                Ordering::Acquire,
                 Ordering::Relaxed,
             ) {
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/common/src/refcount.rs` around lines 124 - 137, The CAS in safe_inc
(the compare_exchange_weak call that uses old.as_raw/new_state.as_raw) currently
uses Ordering::Relaxed for the success case; change the success ordering to
Ordering::Acquire while keeping the failure ordering relaxed, so the weak→strong
upgrade acquires the release from dec() and sees prior writes. Locate safe_inc,
the State::from_raw / add_strong usage and adjust the compare_exchange_weak
success ordering to Acquire.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@crates/vm/src/frame.rs`:
- Around line 96-133: The FastLocals UnsafeCell can be accessed concurrently via
the Python-exposed f_locals() path (which calls locals()), violating the
assumption that access is serialized by with_exec(); to fix, either (A) remove
the unsafe impls unsafe impl Send/Sync for FastLocals and add runtime assertions
to prevent cross-thread access, or (B) gate all fastlocals borrows behind the
frame execution guard/token (have f_locals() acquire the same with_exec() guard
or a borrow token before calling locals()/FastLocals::borrow/borrow_mut), or (C)
add a runtime thread-check that panics on unsynchronized cross-thread access;
update the FastLocals docs to reflect the chosen model and adjust callers
(f_locals, locals, and any direct FastLocals usage) to obtain the guard/token or
to compile without Send/Sync accordingly.

---

Outside diff comments:
In `@crates/common/src/refcount.rs`:
- Around line 124-137: The CAS in safe_inc (the compare_exchange_weak call that
uses old.as_raw/new_state.as_raw) currently uses Ordering::Relaxed for the
success case; change the success ordering to Ordering::Acquire while keeping the
failure ordering relaxed, so the weak→strong upgrade acquires the release from
dec() and sees prior writes. Locate safe_inc, the State::from_raw / add_strong
usage and adjust the compare_exchange_weak success ordering to Acquire.

ℹ️ Review info

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae8c0b3 and 516ff0f.

📒 Files selected for processing (6)
  • crates/common/src/refcount.rs
  • crates/vm/src/builtins/frame.rs
  • crates/vm/src/builtins/function.rs
  • crates/vm/src/builtins/super.rs
  • crates/vm/src/frame.rs
  • crates/vm/src/protocol/callable.rs

Comment on lines +96 to +133
/// Lock-free storage for local variables (localsplus).
///
/// # Safety
/// Access is serialized by the frame's state mutex in `with_exec()`, which
/// prevents concurrent frame execution. Trace callbacks that access `f_locals`
/// run sequentially on the same thread as instruction execution.
pub struct FastLocals {
inner: UnsafeCell<Box<[Option<PyObjectRef>]>>,
}

// SAFETY: Frame execution is serialized by the state mutex.
#[cfg(feature = "threading")]
unsafe impl Send for FastLocals {}
#[cfg(feature = "threading")]
unsafe impl Sync for FastLocals {}

impl FastLocals {
fn new(data: Box<[Option<PyObjectRef>]>) -> Self {
Self {
inner: UnsafeCell::new(data),
}
}

/// # Safety
/// Caller must ensure exclusive access (frame state locked or frame
/// not executing).
#[inline(always)]
pub unsafe fn borrow(&self) -> &[Option<PyObjectRef>] {
unsafe { &*self.inner.get() }
}

/// # Safety
/// Caller must ensure exclusive mutable access.
#[inline(always)]
#[allow(clippy::mut_from_ref)]
pub unsafe fn borrow_mut(&self) -> &mut [Option<PyObjectRef>] {
unsafe { &mut *self.inner.get() }
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "1) Enumerate all fastlocals unsafe borrow call sites"
rg -n 'fastlocals\.borrow(_mut)?\(' \
  crates/vm/src/frame.rs \
  crates/vm/src/builtins/frame.rs \
  crates/vm/src/builtins/function.rs \
  crates/vm/src/builtins/super.rs

echo
echo "2) Inspect synchronization context around main externally reachable paths"
rg -n -A25 -B8 'fn locals\(|fn clear\(|fn f_locals\(|fn with_exec\(' \
  crates/vm/src/frame.rs \
  crates/vm/src/builtins/frame.rs

echo
echo "3) Inspect owner synchronization primitives (load/store vs CAS)"
rg -n 'owner\.(load|store|compare_exchange|compare_exchange_weak)' \
  crates/vm/src/frame.rs \
  crates/vm/src/builtins/frame.rs

Repository: RustPython/RustPython

Length of output: 9197


FastLocals synchronization contract is violated by externally reachable paths; f_locals() getter permits concurrent unsynchronized access.

The doc comment claims access is "serialized by the frame's state mutex in with_exec()", but the f_locals() getter (line 456 in builtins/frame.rs) directly calls locals() which accesses fastlocals without holding that guard (line 395 in frame.rs). The comment at lines 405–406 acknowledges fastlocals is intentionally accessed outside with_exec() to avoid blocking. Since f_locals() is Python-exposed and callable from any thread, and FastLocals is marked Send + Sync, concurrent calls from multiple threads will perform unsynchronized reads/writes to the same UnsafeCell, violating Rust's aliasing guarantees.

Gate fastlocals borrows behind a guard/token tied to frame execution, or document that frames must not be accessed concurrently and remove Send + Sync (or add runtime checks that panic if accessed cross-thread).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@crates/vm/src/frame.rs` around lines 96 - 133, The FastLocals UnsafeCell can
be accessed concurrently via the Python-exposed f_locals() path (which calls
locals()), violating the assumption that access is serialized by with_exec(); to
fix, either (A) remove the unsafe impls unsafe impl Send/Sync for FastLocals and
add runtime assertions to prevent cross-thread access, or (B) gate all
fastlocals borrows behind the frame execution guard/token (have f_locals()
acquire the same with_exec() guard or a borrow token before calling
locals()/FastLocals::borrow/borrow_mut), or (C) add a runtime thread-check that
panics on unsynchronized cross-thread access; update the FastLocals docs to
reflect the chosen model and adjust callers (f_locals, locals, and any direct
FastLocals usage) to obtain the guard/token or to compile without Send/Sync
accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant