Appearance
Job Attempt Lineage (GitHub-Actions-style retry UI)
Problem
Today every retry of a migration job creates a new jobs row that is only linked to its predecessor through rerun_of_job_id. The UI renders each row as a separate item in the list, so a single logical "import tenant X" can appear as 3 rows (attempt 1 failed, attempt 2 failed, attempt 3 succeeded) with no visual grouping. Operators scrolling the list have to manually correlate jobs by tenant/target/config to understand history.
Target UX — the same pattern GitHub Actions uses for workflow re-runs:
- One row per logical job lineage (root + all its retries collapsed).
- The row shows the latest attempt's status and a badge
Latest #N. - Clicking the row opens the detail panel for the latest attempt; a dropdown lets the operator switch between prior attempts (logs, artifacts, duration).
- Each attempt keeps its own
job_logsandjob_artifactsrows; nothing is merged at the data layer.
Current state (baseline)
jobs.rerun_of_job_id UUID NULL REFERENCES jobs(id)already persists the predecessor link. Added in 005_jobs_rerun_of_job_id.sql.buildRerunJobsetsRerunOfJobID = src.IDfor every retry (repository.go).loadInheritedInstanceReplaceResultswalks one predecessor only (the direct parent). Repeated retries still work because each retry persists a merged artifact, so the chain's latest predecessor carries the full succeeded-tenant set.
Data model
Two new columns on jobs:
| Column | Type | Meaning |
|---|---|---|
root_job_id | UUID NOT NULL | The root of the retry chain. For attempt #1 (no predecessor), root_job_id = id (self-reference). For attempt #N, copied from the predecessor's root_job_id. Indexed. |
attempt_number | INT NOT NULL DEFAULT 1 | 1 for the root, N for the Nth retry. Denormalised (predecessor's attempt_number + 1). |
Chosen over a recursive CTE because:
- List queries must filter / aggregate by lineage cheaply. A denormalised root lets us
GROUP BY root_job_idor fetch latest with a single index hit. attempt_numbergives the UI its#Nbadge without another round-trip.- Backfill is a one-time UPDATE; derivations stay local to
buildRerunJob.
Migration sketch
sql
-- +goose Up
ALTER TABLE jobs ADD COLUMN root_job_id UUID;
ALTER TABLE jobs ADD COLUMN attempt_number INT NOT NULL DEFAULT 1;
-- Backfill: walk the rerun_of_job_id chain to the root, count hops.
WITH RECURSIVE chain AS (
SELECT id, rerun_of_job_id, id AS root, 1 AS attempt
FROM jobs
WHERE rerun_of_job_id IS NULL
UNION ALL
SELECT j.id, j.rerun_of_job_id, c.root, c.attempt + 1
FROM jobs j
JOIN chain c ON j.rerun_of_job_id = c.id
)
UPDATE jobs
SET root_job_id = chain.root,
attempt_number = chain.attempt
FROM chain
WHERE jobs.id = chain.id;
ALTER TABLE jobs ALTER COLUMN root_job_id SET NOT NULL;
CREATE INDEX idx_jobs_root_job_id ON jobs (root_job_id);
-- +goose Down
DROP INDEX IF EXISTS idx_jobs_root_job_id;
ALTER TABLE jobs DROP COLUMN IF EXISTS attempt_number;
ALTER TABLE jobs DROP COLUMN IF EXISTS root_job_id;Backend
Repository / handler
buildRerunJobsets:RootJobID = src.RootJobID(copy from predecessor).AttemptNumber = src.AttemptNumber + 1.
CreateJobIfUnderCap(andCreateJob) default the root for a fresh (non-retry) job: whenRerunOfJobID == nil, setRootJobID = j.IDright afterstampNewJob(j)assigns the UUID, andAttemptNumber = 1.jobInsertArgs/insertJobSQL/scanJob/jobColumnsextended (same drift-guard pattern that's already in the file).model.JobgetsRootJobID uuid.UUID+AttemptNumber intwith JSON tags.
New endpoint
GET /api/v1/jobs/lineages — lineage-aware list.
Returns one entry per root_job_id, with the latest attempt's summary plus minimal metadata about prior attempts:
json
[
{
"rootJobId": "…",
"type": "tenant_import",
"latest": { "id": "…", "status": "succeeded", "attemptNumber": 3, "startedAt": "…", "completedAt": "…" },
"attempts": [
{ "id": "…", "attemptNumber": 1, "status": "failed", "completedAt": "…" },
{ "id": "…", "attemptNumber": 2, "status": "failed", "completedAt": "…" },
{ "id": "…", "attemptNumber": 3, "status": "succeeded", "completedAt": "…" }
],
"tenantId": "…",
"tenantName": "…",
…rest of the latest attempt's config fields needed by the list row
}
]Implementation: single SQL query with DISTINCT ON (root_job_id) ORDER BY attempt_number DESC for latest summary, plus a second query (or CTE) for the compact attempt list.
Existing endpoints
GET /jobs/GET /jobs/:id/GET /jobs/:id/logs/GET /jobs/:id/artifacts— unchanged. Attempts keep their own IDs, so selecting a specific attempt in the new UI simply fetches by that attempt's job id.GET /instances/:id/jobs— still returns every attempt row; the UI aggregates per lineage client-side, same as the main list.
Frontend
List page
- Query
migrationsApi.listLineages()instead oflistJobs()(or a new helper that calls the same underlying endpoint — decided during impl). - Row renders
latest.status+Latest #Nbadge + "N attempts" meta. Click opens detail forlatest.id.
Detail panel
- Accept
jobIdas today. - On mount, fetch the job, then fetch the lineage (
GET /jobs/lineages?root=<rootJobId>or reuse the list endpoint filtered to one lineage). Show dropdown in the header listing#1 failed,#2 failed,#3 succeededwith timestamps. - Switching attempts swaps
jobIdviasetSelectedJobId— everything below the header (status, logs, artifacts, retry button) re-queries. - Retry button stays on the latest attempt only (prevents retrying an older attempt out of order; retry-of-retry is always "retry from latest").
Visual reference
Target look (operator's screenshot):
- Page header: root job display name + global
Latest #Ndropdown. - Right-side popover on the dropdown shows per-attempt status + timestamp + actor.
- Retry status line ("Re-run triggered now") while a new attempt is queued / running.
Open questions
- Should the list include non-terminal attempts in the "latest" slot? Probably yes — if attempt #3 is
running, the list row should show it running, not the last terminal attempt.DISTINCT ONordered byattempt_number DESCgives this for free. - Retry-of-specific-attempt — do we want to allow retrying attempt #2 (skipping #3)? Adds complexity for no real use case I can see. Start with "retry always targets latest"; revisit only if requested.
- Cross-type lineage —
buildRerunJobnever changes type, so lineages are always single-type. UI can rely on that. - Display limit — if someone chains 30 retries, the attempts dropdown shouldn't scroll forever. Cap the dropdown at 10 most-recent + "view all" link that drops into a modal. Defer to UX pass during impl.
Rollout plan
- SQL migration + Job model + repo changes (backfill safe; existing rows get
root_job_id = id,attempt_number = 1). buildRerunJobandCreateJobIfUnderCap/CreateJobpopulate the new fields on new rows.- New lineage list endpoint behind a feature flag (
FEATURE_JOB_LINEAGE_UI) — deployable before the frontend is ready. - Frontend swap on the same flag once the endpoint is in prod.
- Remove flag once both sides have been in prod for a week with no regressions.
Estimate
- Backend: ~1 day (migration + backfill SQL + model/repo + endpoint + tests).
- Frontend: ~1 day (list aggregation, detail-panel dropdown, retry-latest gating).
- Manual QA + flag rollout: ~0.5 day.
Total: ~2.5 days for a clean ship.

