tests: replace timing race in fs-locking test with explicit Events by pmarreck · Pull Request #893 · liquidctl/liquidctl

pmarreck · 2026-05-05T12:26:48Z

Summary

tests/test_keyval.py::test_fs_backend_stores_honor_load_store_locking uses time.sleep(0.1) to "ensure" the holder process has taken the lock before the writer process starts. Under any kind of CPU load, that race goes the other way and the test fails:

ps[0].start()
time.sleep(0.1)        # ← assumes ps[0] grabs the lock in <100ms
ps[1].start()
ps[1].join()
elapsed = time.monotonic() - start_time
assert elapsed >= 0.2   # ← timing proxy for "writer was blocked"
ps[0].join()
assert store.load("key") == -1

If ps[1] (the writer) wins the race, it stores -1 first; then ps[0] (the incrementer) reads -1, sleeps 0.2s, writes 0. Final value is 0, not -1:

AssertionError: assert 0 == -1
 +  where 0 = load('key')
 +    where load = <liquidctl.keyval._FilesystemBackend object at 0x...>.load

This reproduces ~100% of the time on a single-shot run when something else on the host is using CPU. Picked up downstream as a hard build failure on NixOS (liquidctl-1.15.0 failing in nixpkgs build).

Fix

Replace the time.sleep with two multiprocessing.Event objects that explicitly synchronize the two processes:

holding_lock: signaled by the holder inside the load_store closure, i.e. once the lock has been acquired.
may_release: awaited by the holder inside the closure, so the test can keep the lock contended for as long as it likes.

Sequence is now deterministic:

Start holder.
Wait for holding_lock (with timeout).
Holder is guaranteed to be inside the critical section. Start writer.
Signal may_release.
Both processes finish; assert final value is -1.

Also dropped the assert elapsed >= 0.2 wall-clock check (it was a proxy for "the writer was blocked," which is now enforced directly by the event handshake), and added exitcode == 0 checks for both children so unrelated child-side failures don't get silently swallowed.

Verification

Scenario	Before	After
Single run, idle machine	passes (usually)	passes
5 sequential runs, idle machine	4–5 / 5	5 / 5
10 sequential runs under 8 × `yes > /dev/null`	0 / 10	10 / 10
Full `tests/test_keyval.py` runtime	4.10s	1.72s

The runtime drop is a free side effect of removing the time.sleep(0.2) inside the critical section — no longer needed once the test doesn't depend on widening the race window.

Out of scope (happy to follow up)

The two sibling tests in this file follow the same pattern and would benefit from the same treatment:

test_fs_backend_loads_honor_load_store_locking — uses time.sleep(0.1), and additionally doesn't check child exit codes, so it can silently pass even when its assertion fires inside the child process.
test_fs_backend_load_store_is_atomic — has an assert elapsed >= 0.2 * len(ps) wall-clock assertion.

Kept this PR focused on the one test that's actively breaking downstream builds. If you'd like the others fixed in the same vein, happy to send a follow-up.

Test plan

Target test passes 5× sequentially on idle host
Target test passes 10× sequentially under 8-core CPU pressure
Full tests/test_keyval.py passes (12/12)
CI matrix (Linux/macOS/Windows × Python 3.10–3.14)

`test_fs_backend_stores_honor_load_store_locking` used `time.sleep(0.1)` to ensure the holder process took the lock before the writer started. Under load (busy CI, distro build farms — observed reliably on NixOS nixpkgs builds), the writer races ahead of the holder, the lock is never contended, and the final assertion fires: assert store.load("key") == -1 AssertionError: assert 0 == -1 Replace the sleep with two `multiprocessing.Event`s: - `holding_lock` is set inside the load_store closure once the lock has actually been acquired. - `may_release` blocks the holder inside the closure until the test explicitly signals it, so the writer is guaranteed to be queued behind the lock before the holder lets go. Also drop the `assert elapsed >= 0.2` wall-clock check, which was a proxy for "the writer was blocked" — now enforced directly by the event handshake — and add `exitcode == 0` checks for both children so unrelated child-side failures stop being silently swallowed. Test now passes 5/5 on an idle machine and 10/10 under 8 concurrent `yes` processes pinning all cores. File runtime drops from 4.10s to 1.72s as a side effect (no more 0.2s in-critical sleeps). The sibling tests in this file (test_fs_backend_loads_honor_load_store_locking, test_fs_backend_load_store_is_atomic) follow the same time.sleep(0.1) pattern and would benefit from the same treatment, but are out of scope for this focused fix. Happy to follow up if of interest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: replace timing race in fs-locking test with explicit Events#893

tests: replace timing race in fs-locking test with explicit Events#893
pmarreck wants to merge 1 commit into
liquidctl:mainfrom
pmarreck:fix-fs-locking-test-race

pmarreck commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pmarreck commented May 5, 2026

Summary

Fix

Verification

Out of scope (happy to follow up)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant