Skip to content

tests: replace timing race in fs-locking test with explicit Events#893

Open
pmarreck wants to merge 1 commit into
liquidctl:mainfrom
pmarreck:fix-fs-locking-test-race
Open

tests: replace timing race in fs-locking test with explicit Events#893
pmarreck wants to merge 1 commit into
liquidctl:mainfrom
pmarreck:fix-fs-locking-test-race

Conversation

@pmarreck
Copy link
Copy Markdown

@pmarreck pmarreck commented May 5, 2026

Summary

tests/test_keyval.py::test_fs_backend_stores_honor_load_store_locking uses time.sleep(0.1) to "ensure" the holder process has taken the lock before the writer process starts. Under any kind of CPU load, that race goes the other way and the test fails:

ps[0].start()
time.sleep(0.1)        # ← assumes ps[0] grabs the lock in <100ms
ps[1].start()
ps[1].join()
elapsed = time.monotonic() - start_time
assert elapsed >= 0.2   # ← timing proxy for "writer was blocked"
ps[0].join()
assert store.load("key") == -1

If ps[1] (the writer) wins the race, it stores -1 first; then ps[0] (the incrementer) reads -1, sleeps 0.2s, writes 0. Final value is 0, not -1:

AssertionError: assert 0 == -1
 +  where 0 = load('key')
 +    where load = <liquidctl.keyval._FilesystemBackend object at 0x...>.load

This reproduces ~100% of the time on a single-shot run when something else on the host is using CPU. Picked up downstream as a hard build failure on NixOS (liquidctl-1.15.0 failing in nixpkgs build).

Fix

Replace the time.sleep with two multiprocessing.Event objects that explicitly synchronize the two processes:

  • holding_lock: signaled by the holder inside the load_store closure, i.e. once the lock has been acquired.
  • may_release: awaited by the holder inside the closure, so the test can keep the lock contended for as long as it likes.

Sequence is now deterministic:

  1. Start holder.
  2. Wait for holding_lock (with timeout).
  3. Holder is guaranteed to be inside the critical section. Start writer.
  4. Signal may_release.
  5. Both processes finish; assert final value is -1.

Also dropped the assert elapsed >= 0.2 wall-clock check (it was a proxy for "the writer was blocked," which is now enforced directly by the event handshake), and added exitcode == 0 checks for both children so unrelated child-side failures don't get silently swallowed.

Verification

Scenario Before After
Single run, idle machine passes (usually) passes
5 sequential runs, idle machine 4–5 / 5 5 / 5
10 sequential runs under 8 × yes > /dev/null 0 / 10 10 / 10
Full tests/test_keyval.py runtime 4.10s 1.72s

The runtime drop is a free side effect of removing the time.sleep(0.2) inside the critical section — no longer needed once the test doesn't depend on widening the race window.

Out of scope (happy to follow up)

The two sibling tests in this file follow the same pattern and would benefit from the same treatment:

  • test_fs_backend_loads_honor_load_store_locking — uses time.sleep(0.1), and additionally doesn't check child exit codes, so it can silently pass even when its assertion fires inside the child process.
  • test_fs_backend_load_store_is_atomic — has an assert elapsed >= 0.2 * len(ps) wall-clock assertion.

Kept this PR focused on the one test that's actively breaking downstream builds. If you'd like the others fixed in the same vein, happy to send a follow-up.

Test plan

  • Target test passes 5× sequentially on idle host
  • Target test passes 10× sequentially under 8-core CPU pressure
  • Full tests/test_keyval.py passes (12/12)
  • CI matrix (Linux/macOS/Windows × Python 3.10–3.14)

`test_fs_backend_stores_honor_load_store_locking` used `time.sleep(0.1)`
to ensure the holder process took the lock before the writer started.
Under load (busy CI, distro build farms — observed reliably on NixOS
nixpkgs builds), the writer races ahead of the holder, the lock is
never contended, and the final assertion fires:

    assert store.load("key") == -1
    AssertionError: assert 0 == -1

Replace the sleep with two `multiprocessing.Event`s:

  - `holding_lock` is set inside the load_store closure once the lock
    has actually been acquired.
  - `may_release` blocks the holder inside the closure until the test
    explicitly signals it, so the writer is guaranteed to be queued
    behind the lock before the holder lets go.

Also drop the `assert elapsed >= 0.2` wall-clock check, which was a
proxy for "the writer was blocked" — now enforced directly by the
event handshake — and add `exitcode == 0` checks for both children so
unrelated child-side failures stop being silently swallowed.

Test now passes 5/5 on an idle machine and 10/10 under 8 concurrent
`yes` processes pinning all cores. File runtime drops from 4.10s to
1.72s as a side effect (no more 0.2s in-critical sleeps).

The sibling tests in this file (test_fs_backend_loads_honor_load_store_locking,
test_fs_backend_load_store_is_atomic) follow the same time.sleep(0.1)
pattern and would benefit from the same treatment, but are out of scope
for this focused fix. Happy to follow up if of interest.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant