tests: replace timing race in fs-locking test with explicit Events#893
Open
pmarreck wants to merge 1 commit into
Open
tests: replace timing race in fs-locking test with explicit Events#893pmarreck wants to merge 1 commit into
pmarreck wants to merge 1 commit into
Conversation
`test_fs_backend_stores_honor_load_store_locking` used `time.sleep(0.1)`
to ensure the holder process took the lock before the writer started.
Under load (busy CI, distro build farms — observed reliably on NixOS
nixpkgs builds), the writer races ahead of the holder, the lock is
never contended, and the final assertion fires:
assert store.load("key") == -1
AssertionError: assert 0 == -1
Replace the sleep with two `multiprocessing.Event`s:
- `holding_lock` is set inside the load_store closure once the lock
has actually been acquired.
- `may_release` blocks the holder inside the closure until the test
explicitly signals it, so the writer is guaranteed to be queued
behind the lock before the holder lets go.
Also drop the `assert elapsed >= 0.2` wall-clock check, which was a
proxy for "the writer was blocked" — now enforced directly by the
event handshake — and add `exitcode == 0` checks for both children so
unrelated child-side failures stop being silently swallowed.
Test now passes 5/5 on an idle machine and 10/10 under 8 concurrent
`yes` processes pinning all cores. File runtime drops from 4.10s to
1.72s as a side effect (no more 0.2s in-critical sleeps).
The sibling tests in this file (test_fs_backend_loads_honor_load_store_locking,
test_fs_backend_load_store_is_atomic) follow the same time.sleep(0.1)
pattern and would benefit from the same treatment, but are out of scope
for this focused fix. Happy to follow up if of interest.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tests/test_keyval.py::test_fs_backend_stores_honor_load_store_lockingusestime.sleep(0.1)to "ensure" the holder process has taken the lock before the writer process starts. Under any kind of CPU load, that race goes the other way and the test fails:If
ps[1](the writer) wins the race, it stores-1first; thenps[0](the incrementer) reads-1, sleeps 0.2s, writes0. Final value is0, not-1:This reproduces ~100% of the time on a single-shot run when something else on the host is using CPU. Picked up downstream as a hard build failure on NixOS (
liquidctl-1.15.0failing in nixpkgs build).Fix
Replace the
time.sleepwith twomultiprocessing.Eventobjects that explicitly synchronize the two processes:holding_lock: signaled by the holder inside theload_storeclosure, i.e. once the lock has been acquired.may_release: awaited by the holder inside the closure, so the test can keep the lock contended for as long as it likes.Sequence is now deterministic:
holding_lock(with timeout).may_release.-1.Also dropped the
assert elapsed >= 0.2wall-clock check (it was a proxy for "the writer was blocked," which is now enforced directly by the event handshake), and addedexitcode == 0checks for both children so unrelated child-side failures don't get silently swallowed.Verification
yes > /dev/nulltests/test_keyval.pyruntimeThe runtime drop is a free side effect of removing the
time.sleep(0.2)inside the critical section — no longer needed once the test doesn't depend on widening the race window.Out of scope (happy to follow up)
The two sibling tests in this file follow the same pattern and would benefit from the same treatment:
test_fs_backend_loads_honor_load_store_locking— usestime.sleep(0.1), and additionally doesn't check child exit codes, so it can silently pass even when its assertion fires inside the child process.test_fs_backend_load_store_is_atomic— has anassert elapsed >= 0.2 * len(ps)wall-clock assertion.Kept this PR focused on the one test that's actively breaking downstream builds. If you'd like the others fixed in the same vein, happy to send a follow-up.
Test plan
tests/test_keyval.pypasses (12/12)