Skip to main content
AcademytutorialHydra tutorial series — Part 6: Troubleshooting & escalation

Hydra tutorial series — Part 6: Troubleshooting & escalation

What to do when the pipeline gets stuck. The menu of choices between development merge, label reset, retry:queued, and rebuild:queued — ordered from cheap to expensive. Last of six short modules.

TutorialHydraTroubleshootingRecoveryTutorial series
13 min read

In part 5 the pipeline was running. Now the final part: what do you do when it doesn't go green? This part gives you the decision tree — ordered from cheapest intervention to most expensive — and the patterns that get needs-input issues moving again.

What does needs-input mean?

needs-input is not a failure state in the sense of "Hydra has failed". It's an explicit escalation: Hydra has done what it could, one run, and has reached a terminal point where it can't proceed on its own without a human decision.

Possible causes:

  • Both reviewers pass but the quality recheck after their fixes is red (a fix introduced a new violation).
  • One of the two reviews :fail.
  • Applier Axel Pliér :fail.
  • Build crashed mid-stage (container OOM, rate-limit on all accounts, …).
  • Agent-maxed-out: a persona used up its turn budget.

Your role as a human: decide which of the four options below is the best fit and apply it. Don't blindly slap on another retry:queued in the hope that this time it works.

The intervention ladder

#SituationInterventionCostWhat you keep
1Development has moved on, PR is stale (new lint rules, ADRs, fixtures on development)gh pr update-branch <N> (or the cron picks it up)FreeAll build/review/applier work. Just refresh dependencies.
2Review contracts were adjusted and you want reviewers to judge again without a rebuildReset reviewer labels: strip stage labels, set code-review:queued again. PR stays.2× reviewer + applierThe builder output. Only the review part rewinds.
3Reviewers/applier flagged concrete findings and the builder just needs to polishRemove needs-input + fail labels first, then add retry:queued (see below)1× builder + 2× reviewer + applierBranch + PR stay. Orchestrator builds feedback.md from hydra.json (unfixed + applier blockers + already-fixed) and dispatches builder in HYDRA_MODE=fix. Single shot — no loop.
4Builder output was fundamentally wrong (wrong approach, stub implementation, missed core requirements)Remove needs-input + fail labels first, then add rebuild:queued (see below)Full cycleOrchestrator closes the open PR, hard-resets feature/<issue>/* to development, strips all cycle labels, sets build:queued. Next cycle starts from scratch.
5You closed the PR yourself and stale labels are still on the issueWait for reconcile.sh (every 10 min)Next reconcile runcheck_stage_without_open_pr sees the inconsistency and automatically sets build:queued. Self-healing.
6Pipeline infra is broken (container crash, all tokens exhausted, image not findable)Escalate to a human (you or a team lead), fix infra, and only THEN pick one of 1-4 aboveTime

Order: always 1 → 2 → 3 → 4. Take the cheapest recovery first. Reach for rebuild:queued only if the build itself was fundamentally wrong.

retry:queued in detail

The supervisor will not dispatch retry:queued while needs-input is still on the issue. You must remove stale labels before applying the trigger — otherwise the issue sits in the queue forever and nothing moves.

Before adding retry:queued, run these commands (use scripts/hydra-label.sh — it keeps the project board in sync):

# Remove stale labels
./scripts/hydra-label.sh ConductionNL/<app> <issue> remove needs-input
./scripts/hydra-label.sh ConductionNL/<app> <issue> remove code-review:fail
./scripts/hydra-label.sh ConductionNL/<app> <issue> remove security-review:fail

# Now apply the trigger
./scripts/hydra-label.sh ConductionNL/<app> <issue> add retry:queued

If you're running with HYDRA_LABEL_PREFIX (e.g. wilco), strip the prefixed variants too:

./scripts/hydra-label.sh ConductionNL/<app> <issue> remove wilco-needs-input
./scripts/hydra-label.sh ConductionNL/<app> <issue> remove wilco-code-review:fail
./scripts/hydra-label.sh ConductionNL/<app> <issue> remove wilco-security-review:fail
./scripts/hydra-label.sh ConductionNL/<app> <issue> add wilco-retry:queued

See the retry-and-rebuild operations guide for the full checklist and rebuild:queued equivalent.

What happens next:

  1. Supervisor picks up retry:queued as a regular queue job.
  2. Orchestrator finds the open PR on feature/<issue>/*.
  3. Orchestrator clones development, copies openspec/changes/<slug>/hydra.json and runs scripts/lib/build-feedback-brief.py → writes feedback.md with (a) applier blockers, (b) unfixed reviewer findings, (c) already-fixed items, (d) Scope list of all flagged files.
  4. Orchestrator flips retry:queuedretry:running, dispatches the builder with HYDRA_MODE=fix and feedback.md mounted at /workspace/feedback.md.
  5. Builder reads the brief, fixes all findings in one pass, restricts itself to Scope files (plus new files explicitly required by a blocker), commits with fix (retry): messages, pushes.
  6. Success: orchestrator strips all review/applier labels + retry:running, sets code-review:queued. Reviewers run again against the fixed code.
  7. Failure: retry:runningneeds-input, a comment on the issue explains why and points to rebuild:queued as the next lever.

No loop. One retry:queued = one iteration. If the fix builder pushes and the next review fails again, you can request another retry:queued — but that's then a new single-shot, not an auto-retry.

rebuild:queued in detail

Same label cleanup requirement as retry:queued — the supervisor won't dispatch while needs-input is set:

./scripts/hydra-label.sh ConductionNL/<app> <issue> remove needs-input
./scripts/hydra-label.sh ConductionNL/<app> <issue> remove code-review:fail
./scripts/hydra-label.sh ConductionNL/<app> <issue> remove security-review:fail
./scripts/hydra-label.sh ConductionNL/<app> <issue> add rebuild:queued

Do not close the PR yourself first — the orchestrator closes it as part of the reset sequence. If you already closed it, reconcile.sh (runs every 10 min) will detect the inconsistency and auto-set build:queued.

When is it the gate, and when is it the builder?

A recurring pitfall from our own retrospectives: the issue goes to needs-input 3 times for the same reason — typically spdx-headers: fail or a similar mechanical gate. Each retry:queued reproduces the same pattern.

In that case a fourth retry is not the right move. Be disciplined:

  1. Sanity-check the gate itself. Is this a false positive? (See part 3.) A classic: a grep without a word-boundary that matches more than intended. In that case fix scripts/run-quality.sh or the gate skill.
  2. Hand-fix the mechanical violation. Sometimes it's faster to add the SPDX headers yourself, commit on the feature branch, and call the reviewers again via a label reset (situation 2 above). Time: minutes.
  3. Consider a stronger agent. If the same class of mistake shows the same "blind spot" across multiple repos, that's a signal that the current builder prompt or the mechanical gate isn't enough. Open an issue in hydra/ to tighten the gate or build a "housekeeping" agent.

Watch-list patterns

Patterns we've already seen and that are worth comparing against your situation first:

Diagnostics: where do you look?

Quick checklist when an issue goes to needs-input:

# 1. Which labels are currently set?
gh issue view <N> --repo ConductionNL/<app> --json labels --jq '[.labels[].name]'

# 2. What does hydra.json say (if present on the feature branch)?
gh api repos/ConductionNL/<app>/contents/openspec/changes/<slug>/hydra.json?ref=feature/<N>/<slug> \
    --jq '.content' | base64 -d | jq '.cycles[-1] | {outcome, outcome_reason, pattern_tags}'

# 3. The supervisor log for the relevant time window
grep "issue/<N>" logs/supervisor.log | tail -50

# 4. Pipeline logs (per stage) on the feature branch
gh api repos/ConductionNL/<app>/contents/openspec/changes/<slug>/pipeline-logs?ref=feature/<N>/<slug>

hydra.json is your most important source. The outcome_reason and pattern_tags of the last cycle usually explain in a single line what went wrong.

When is "human takes over" simply the right call?

Not every needs-input issue deserves an automated fix path. Sometimes the most pragmatic action is: human takes over, fixes it in a commit on the feature branch, pushes, marks resolved.

Examples:

  • A small semantic decision the Builder can't make without context outside the change (e.g. "should this field be optional or required?").
  • An interaction with an external party (open ticket with a vendor).
  • A gate that is a false positive and requires a fix in hydra/ itself (which you won't resolve within one retry cycle).

Hydra is a factory, not an ideology. Take the human shortcut when it's the fairer call.

Test yourself

Four short questions to check whether you've grasped this part. Stuck? Click Hint. Curious about the answer? Click Answer.

1. In what order do you try the interventions in the ladder, and why?

Hint

One dimension: cost. The other: how much already-done work you lose.

Answer

From cheap to expensive, with the goal of preserving as much already-done work as possible:

  1. gh pr update-branch — free. All build/review/applier work stays, just refresh dependencies.
  2. Reset reviewer labels — costs 2× reviewer + applier. Builder output stays.
  3. retry:queued — costs 1× builder + 2× reviewer + applier. Branch + PR stay; builder fixes scoped via feedback.md.
  4. rebuild:queued — full cycle. PR closes, branch resets, from scratch.
  5. Manual fix — when the loop hangs on a false-positive gate or a human needs to make a semantic decision.

Reach for rebuild:queued only if the original build was fundamentally wrong; otherwise you waste good work.

2. What's the difference between "reset reviewer labels" and retry:queued?

Hint

The question is: is the builder called again, yes or no?

Answer
  • Reset reviewer labels (strip stage labels + set code-review:queued again): the builder is NOT called again. PR + builder output stay exactly as they were; only Juan and Clyde run again. Appropriate when the review contracts were adjusted (new ADR, new gate skill) and you want to re-judge without changing the code.
  • retry:queued: the builder IS called again, in HYDRA_MODE=fix, with feedback.md (from hydra.json) as input. It makes new commits, then reviewers run again. Appropriate when reviewers had legitimate findings and the builder just needs to polish.

3. When does it make more sense to do the fix by hand instead of another retry:queued?

Hint

If the same pattern keeps repeating on the same mechanical gate, throwing more money at it is not the answer.

Answer

When the same issue has gone to needs-input 3 times or more for the same reason — typically a mechanical gate like spdx-headers: fail. Each retry:queued reproduces the same pattern: builder makes the same mistake, reviewer sees it as boilerplate, recheck goes red.

In that case:

  • First check whether it's a false-positive gate (see part 3). If so: tighten gate detection.
  • If not: fix it by hand (e.g. commit SPDX headers yourself on the feature branch) and use a label reset to have the reviewers judge again.

Time: minutes. Cheaper than wasting another Sonnet cycle on something that keeps going wrong.

4. What do you do if you see 3+ identical needs-input comments within a short period?

Hint

This says something about the pipeline itself, not about one issue. What kind of fix is that?

Answer

This is an infra bug, not a retry question. Somewhere in the completion handler a terminal-state guard is missing, causing a tight loop to post a new escalation comment every tick on an already-terminal issue. Example: 134 duplicate comments overnight on 23 April 2026 — the fix was a terminal-state guard before all side-effects (see CLAUDE.md section "Terminal-state guards").

What you do: open an infra issue in hydra/, not another retry:queued. A retry does nothing about the comment storm and may even add more chaos.

Done — what now?

You've worked through the whole tutorial series. What's sensible next?