Skip to content

Job Attempt Lineage (GitHub-Actions-style retry UI)

Problem

Today every retry of a migration job creates a new jobs row that is only linked to its predecessor through rerun_of_job_id. The UI renders each row as a separate item in the list, so a single logical "import tenant X" can appear as 3 rows (attempt 1 failed, attempt 2 failed, attempt 3 succeeded) with no visual grouping. Operators scrolling the list have to manually correlate jobs by tenant/target/config to understand history.

Target UX — the same pattern GitHub Actions uses for workflow re-runs:

  • One row per logical job lineage (root + all its retries collapsed).
  • The row shows the latest attempt's status and a badge Latest #N.
  • Clicking the row opens the detail panel for the latest attempt; a dropdown lets the operator switch between prior attempts (logs, artifacts, duration).
  • Each attempt keeps its own job_logs and job_artifacts rows; nothing is merged at the data layer.

Current state (baseline)

  • jobs.rerun_of_job_id UUID NULL REFERENCES jobs(id) already persists the predecessor link. Added in 005_jobs_rerun_of_job_id.sql.
  • buildRerunJob sets RerunOfJobID = src.ID for every retry (repository.go).
  • loadInheritedInstanceReplaceResults walks one predecessor only (the direct parent). Repeated retries still work because each retry persists a merged artifact, so the chain's latest predecessor carries the full succeeded-tenant set.

Data model

Two new columns on jobs:

ColumnTypeMeaning
root_job_idUUID NOT NULLThe root of the retry chain. For attempt #1 (no predecessor), root_job_id = id (self-reference). For attempt #N, copied from the predecessor's root_job_id. Indexed.
attempt_numberINT NOT NULL DEFAULT 11 for the root, N for the Nth retry. Denormalised (predecessor's attempt_number + 1).

Chosen over a recursive CTE because:

  • List queries must filter / aggregate by lineage cheaply. A denormalised root lets us GROUP BY root_job_id or fetch latest with a single index hit.
  • attempt_number gives the UI its #N badge without another round-trip.
  • Backfill is a one-time UPDATE; derivations stay local to buildRerunJob.

Migration sketch

sql
-- +goose Up
ALTER TABLE jobs ADD COLUMN root_job_id UUID;
ALTER TABLE jobs ADD COLUMN attempt_number INT NOT NULL DEFAULT 1;

-- Backfill: walk the rerun_of_job_id chain to the root, count hops.
WITH RECURSIVE chain AS (
  SELECT id, rerun_of_job_id, id AS root, 1 AS attempt
  FROM jobs
  WHERE rerun_of_job_id IS NULL
  UNION ALL
  SELECT j.id, j.rerun_of_job_id, c.root, c.attempt + 1
  FROM jobs j
  JOIN chain c ON j.rerun_of_job_id = c.id
)
UPDATE jobs
SET root_job_id = chain.root,
    attempt_number = chain.attempt
FROM chain
WHERE jobs.id = chain.id;

ALTER TABLE jobs ALTER COLUMN root_job_id SET NOT NULL;
CREATE INDEX idx_jobs_root_job_id ON jobs (root_job_id);

-- +goose Down
DROP INDEX IF EXISTS idx_jobs_root_job_id;
ALTER TABLE jobs DROP COLUMN IF EXISTS attempt_number;
ALTER TABLE jobs DROP COLUMN IF EXISTS root_job_id;

Backend

Repository / handler

  • buildRerunJob sets:
    • RootJobID = src.RootJobID (copy from predecessor).
    • AttemptNumber = src.AttemptNumber + 1.
  • CreateJobIfUnderCap (and CreateJob) default the root for a fresh (non-retry) job: when RerunOfJobID == nil, set RootJobID = j.ID right after stampNewJob(j) assigns the UUID, and AttemptNumber = 1.
  • jobInsertArgs / insertJobSQL / scanJob / jobColumns extended (same drift-guard pattern that's already in the file).
  • model.Job gets RootJobID uuid.UUID + AttemptNumber int with JSON tags.

New endpoint

GET /api/v1/jobs/lineages — lineage-aware list.

Returns one entry per root_job_id, with the latest attempt's summary plus minimal metadata about prior attempts:

json
[
  {
    "rootJobId": "…",
    "type": "tenant_import",
    "latest": { "id": "…", "status": "succeeded", "attemptNumber": 3, "startedAt": "…", "completedAt": "…" },
    "attempts": [
      { "id": "…", "attemptNumber": 1, "status": "failed", "completedAt": "…" },
      { "id": "…", "attemptNumber": 2, "status": "failed", "completedAt": "…" },
      { "id": "…", "attemptNumber": 3, "status": "succeeded", "completedAt": "…" }
    ],
    "tenantId": "…",
    "tenantName": "…",
    …rest of the latest attempt's config fields needed by the list row
  }
]

Implementation: single SQL query with DISTINCT ON (root_job_id) ORDER BY attempt_number DESC for latest summary, plus a second query (or CTE) for the compact attempt list.

Existing endpoints

  • GET /jobs / GET /jobs/:id / GET /jobs/:id/logs / GET /jobs/:id/artifacts — unchanged. Attempts keep their own IDs, so selecting a specific attempt in the new UI simply fetches by that attempt's job id.
  • GET /instances/:id/jobs — still returns every attempt row; the UI aggregates per lineage client-side, same as the main list.

Frontend

List page

  • Query migrationsApi.listLineages() instead of listJobs() (or a new helper that calls the same underlying endpoint — decided during impl).
  • Row renders latest.status + Latest #N badge + "N attempts" meta. Click opens detail for latest.id.

Detail panel

  • Accept jobId as today.
  • On mount, fetch the job, then fetch the lineage (GET /jobs/lineages?root=<rootJobId> or reuse the list endpoint filtered to one lineage). Show dropdown in the header listing #1 failed, #2 failed, #3 succeeded with timestamps.
  • Switching attempts swaps jobId via setSelectedJobId — everything below the header (status, logs, artifacts, retry button) re-queries.
  • Retry button stays on the latest attempt only (prevents retrying an older attempt out of order; retry-of-retry is always "retry from latest").

Visual reference

Target look (operator's screenshot):

  • Page header: root job display name + global Latest #N dropdown.
  • Right-side popover on the dropdown shows per-attempt status + timestamp + actor.
  • Retry status line ("Re-run triggered now") while a new attempt is queued / running.

Open questions

  1. Should the list include non-terminal attempts in the "latest" slot? Probably yes — if attempt #3 is running, the list row should show it running, not the last terminal attempt. DISTINCT ON ordered by attempt_number DESC gives this for free.
  2. Retry-of-specific-attempt — do we want to allow retrying attempt #2 (skipping #3)? Adds complexity for no real use case I can see. Start with "retry always targets latest"; revisit only if requested.
  3. Cross-type lineagebuildRerunJob never changes type, so lineages are always single-type. UI can rely on that.
  4. Display limit — if someone chains 30 retries, the attempts dropdown shouldn't scroll forever. Cap the dropdown at 10 most-recent + "view all" link that drops into a modal. Defer to UX pass during impl.

Rollout plan

  1. SQL migration + Job model + repo changes (backfill safe; existing rows get root_job_id = id, attempt_number = 1).
  2. buildRerunJob and CreateJobIfUnderCap/CreateJob populate the new fields on new rows.
  3. New lineage list endpoint behind a feature flag (FEATURE_JOB_LINEAGE_UI) — deployable before the frontend is ready.
  4. Frontend swap on the same flag once the endpoint is in prod.
  5. Remove flag once both sides have been in prod for a week with no regressions.

Estimate

  • Backend: ~1 day (migration + backfill SQL + model/repo + endpoint + tests).
  • Frontend: ~1 day (list aggregation, detail-panel dropdown, retry-latest gating).
  • Manual QA + flag rollout: ~0.5 day.

Total: ~2.5 days for a clean ship.