fix: live pre-merge build-status revalidation (BERTE-602) by charlesprost · Pull Request #268 · scality/bert-e

charlesprost · 2026-05-20T17:04:59Z

Summary

Adds revalidate_build_status(), called immediately before the merge/queue decision in _handle_pull_request. It bypasses BUILD_STATUS_CACHE and fetches live state from the GitHub Actions API, filtering workflow runs strictly to the current w-branch before aggregating.
Adds AggregatedWorkflowRuns.state_for_branch(name): filters and deduplicates within a single branch, preventing a sibling branch sharing the same SHA from producing a false SUCCESSFUL.
Regression tests explicitly replaying the artesca#5155 incident shape (renamed source branch, same SHA on old and new w-branches, old completed/new in-progress → must raise BuildInProgress).

Root cause (artesca#5155, 2026-05-20)

The source branch was renamed; the new w-branch shared its commit SHA with the old w-branch. GitHub's API returned workflow runs for both names. AggregatedWorkflowRuns.state picked the best status across branches, returning SUCCESSFUL because the old w-branch completed. That result was latched into BUILD_STATUS_CACHE (which never overwrites SUCCESSFUL). When the peer approval landed 15 s later, Bert-E read SUCCESSFUL from cache and merged — while the actual push CI on the new w-branch was still running.

What this PR does

bert_e/git_host/github/__init__.py

state_for_branch(branch_name): filters _workflow_runs to branch_name before deduplicating by workflow_id, so a sibling branch cannot mask the result. Logs the filtered run snapshot (branch + runs + conclusion) for post-mortem diagnostics.

bert_e/workflow/gitwaterflow/__init__.py

revalidate_build_status(job, wbranches): live, cache-bypassing final check at the merge decision boundary. Calls get_commit_status() directly (always hits the API), uses state_for_branch for GitHub Actions keys, skips gracefully for hosts without get_commit_status (e.g. Bitbucket).
Called in _handle_pull_request after the interactive confirm and before build_queue_collection.

Test plan

test_artesca5155_renamed_branch_inprogress_blocked — exact incident shape, asserts BuildInProgress
test_artesca5155_old_aggregator_would_have_passed — documents the pre-fix bug: AggregatedWorkflowRuns.state returns SUCCESSFUL for the same data
test_pr5155_shape_raises_build_in_progress — same SHA on w-branch (in_progress) and destination (success)
Full matrix: bypass/no-key/no-method → skipped; clean PR → passes; failed/no-runs → correct exception
state_for_branch unit tests: cross-branch poisoning, dedup within branch, workflow_dispatch exclusion, no-runs case

🤖 Generated with Claude Code

Documents pytest binary location, GitHub release process, and the devdocs release checklist reference. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

codecov · 2026-05-20T17:24:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.51%. Comparing base (24545a4) to head (93143c7).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #268      +/-   ##
==========================================
+ Coverage   89.28%   89.51%   +0.22%     
==========================================
  Files          77       78       +1     
  Lines       10287    10493     +206     
==========================================
+ Hits         9185     9393     +208     
+ Misses       1102     1100       -2

Flag	Coverage Δ
integration	`87.19% <23.80%> (-0.33%)`	⬇️
tests	`87.15% <23.80%> (-0.33%)`	⬇️
tests-BuildFailedTest	`26.68% <4.76%> (-0.12%)`	⬇️
tests-QuickTest	`34.23% <4.76%> (-0.15%)`	⬇️
tests-RepositoryTests	`26.34% <4.76%> (-0.11%)`	⬇️
tests-TaskQueueTests	`51.46% <11.90%> (-0.21%)`	⬇️
tests-TestBertE	`65.38% <23.80%> (-0.22%)`	⬇️
tests-TestQueueing	`53.69% <11.90%> (-0.22%)`	⬇️
tests-api-mock	`15.51% <0.48%> (-0.31%)`	⬇️
tests-noqueue	`77.35% <23.80%> (-0.28%)`	⬇️
tests-noqueue-BuildFailedTest	`26.68% <4.76%> (-0.12%)`	⬇️
tests-noqueue-QuickTest	`34.23% <4.76%> (-0.15%)`	⬇️
tests-noqueue-RepositoryTests	`26.34% <4.76%> (-0.11%)`	⬇️
tests-noqueue-TaskQueueTests	`51.46% <11.90%> (-0.21%)`	⬇️
tests-noqueue-TestBertE	`61.75% <23.80%> (-0.20%)`	⬇️
tests-noqueue-TestQueueing	`26.37% <4.76%> (-0.11%)`	⬇️
tests-server	`28.57% <0.97%> (-0.56%)`	⬇️
unittests	`42.80% <99.51%> (+1.16%)`	⬆️
utests	`27.38% <99.51%> (+1.47%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Introduces revalidate_build_status(), called immediately before the merge/queue decision in _handle_pull_request. Unlike check_build_status, it bypasses BUILD_STATUS_CACHE and fetches live state from the GitHub Actions API, then filters workflow runs strictly to each wbranch's own head_branch before aggregating. Root cause (PR artesca#5155): a push run on the PR branch was still in_progress when bert-e polled. The run shared a SHA with development/4, which had already completed successfully. AggregatedWorkflowRuns.state returned SUCCESSFUL because it grouped by branch then returned SUCCESSFUL if *any* branch succeeded, poisoning the cache entry and hiding the in-progress run on the w-branch. New AggregatedWorkflowRuns.state_for_branch(name) filters and deduplicates within a single branch before calling branch_state(), preventing cross- branch SHA poisoning. revalidate_build_status uses this method. Hosts without get_commit_status (e.g. Bitbucket) are skipped gracefully. Tests added in test_github_build_status.py document the poisoning scenario and verify the per-branch fix. test_revalidate_build_status.py covers the full decision matrix: PR 5155 shape → BuildInProgress, clean PR → passes, bypass/no-key/no-method → skipped, failed/no-runs → appropriate exception. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

matthiasL-scality

Code review — non-blocking observations

matthiasL-scality · 2026-05-20T18:08:27Z

+    )
+
+
+def _make_run(*, run_id, sha, branch, status, conclusion,


Duplicate helper — _make_run is defined identically in test_github_build_status.py:232. Consider moving it to a shared conftest.py so both test modules can import it. Not a blocker, but avoids drift if the run schema changes.

matthiasL-scality · 2026-05-20T18:08:27Z

+        that shares the same SHA cannot cause a false SUCCESSFUL result.
+        Unlike .state, this never returns SUCCESSFUL for a different branch.
+        """
+        conclusion_ranking = {


Duplicate dedup logic — conclusion_ranking and the best-by-workflow_id loop replicate the same logic already in remove_unwanted_workflows (lines 660–671). Consider extracting a private _best_runs_for(runs) helper that both methods can call. Not a blocker.

matthiasL-scality · 2026-05-20T18:08:27Z

    assert worst_status == 'SUCCESSFUL'
+
+
+def revalidate_build_status(job, wbranches):


Non-blocker — long-term fix should be tracked as an improvement.

This function is a sound safety net for the immediate incident. However, the root cause is that BUILD_STATUS_CACHE is keyed only by SHA (BUILD_STATUS_CACHE[build_key][sha]), so two branches sharing the same SHA compete for the same cache slot. The "never overwrite SUCCESSFUL" guard then permanently locks the slot once any branch on that SHA succeeds.

The durable fix would be to include the branch name in the cache key:

# today BUILD_STATUS_CACHE[build_key][sha] # proposed BUILD_STATUS_CACHE[build_key][(sha, branch_name)]

Each w-branch would then have its own independent cache slot; a renamed branch can never inherit a SUCCESSFUL result from its predecessor. This would also allow removing revalidate_build_status (or reducing it to a lightweight assertion) once the cache is correct.

This is not a blocker for this PR. Please open a follow-up Jira improvement ticket to track the cache key refactor.

Add CLAUDE.md with project-specific tooling notes

9c39c38

Documents pytest binary location, GitHub release process, and the devdocs release checklist reference. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

charlesprost requested a review from a team as a code owner May 20, 2026 17:05

charlesprost requested a review from matthiasL-scality May 20, 2026 17:05

charlesprost force-pushed the fix/BERTE-602-live-build-status-revalidation branch from cd4b948 to 61b5e95 Compare May 20, 2026 17:13

charlesprost force-pushed the fix/BERTE-602-live-build-status-revalidation branch from 61b5e95 to 93143c7 Compare May 20, 2026 17:34

matthiasL-scality reviewed May 20, 2026

View reviewed changes

matthiasL-scality approved these changes May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: live pre-merge build-status revalidation (BERTE-602)#268

fix: live pre-merge build-status revalidation (BERTE-602)#268
charlesprost wants to merge 2 commits into
mainfrom
fix/BERTE-602-live-build-status-revalidation

charlesprost commented May 20, 2026

Uh oh!

codecov Bot commented May 20, 2026 •

edited

Loading

Uh oh!

matthiasL-scality left a comment

Uh oh!

matthiasL-scality May 20, 2026

Uh oh!

matthiasL-scality May 20, 2026

Uh oh!

matthiasL-scality May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		assert worst_status == 'SUCCESSFUL'


		def revalidate_build_status(job, wbranches):

Conversation

charlesprost commented May 20, 2026

Summary

Root cause (artesca#5155, 2026-05-20)

What this PR does

Test plan

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

matthiasL-scality left a comment

Choose a reason for hiding this comment

Uh oh!

matthiasL-scality May 20, 2026

Choose a reason for hiding this comment

Uh oh!

matthiasL-scality May 20, 2026

Choose a reason for hiding this comment

Uh oh!

matthiasL-scality May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 20, 2026 •

edited

Loading