Skip to content

Known Issues

Tracked bugs and design gaps deferred for dedicated follow-up. Entries stay until fixed or explicitly accepted.


TOCTOU on 3-concurrent job cap

Symptom: Two concurrent POSTs to job-start endpoints can both pass the CountRunningJobs < 3 check and both insert, exceeding the 3-concurrent cap.

Scope: Pre-existing in every start endpoint (internal/handler/migrations/handler.go) — not introduced by rerun work.

Cause: CountRunningJobs runs outside the insert transaction. Race window between count and insert.

Severity: Low. Cap is soft guard, not safety-critical. Race window tiny.

Fix options:

  • pg advisory lock keyed on tenant/instance around count+insert
  • COUNT + INSERT in single tx with SERIALIZABLE isolation
  • Unique partial index on (status) where status='running' — not trivial given cap=3, not 1

Status: Deferred. Fix in dedicated PR touching all start endpoints.


instance_replace Resume State Dependency

Symptom: Rerunning an instance_replace job that failed a long time ago (e.g., >9 months) might lose its progress and re-migrate every tenant from scratch.

Scope: loadInheritedInstanceReplaceResults in executor_instance_replace.go.

Cause: The "what succeeded" list for instance_replace is stored in the job_artifacts table. While no SQL-level TTL exists today, the disk-file cleanup worker (internal/worker/cleanup.go) has a misleading comment claiming artifacts have MongoDB TTLs. If a global retention policy is ever added to sweep job_artifacts, instance_replace resume logic will break while other jobs (import/clone) remain safe since their resume state is in the core jobs table.

Severity: Low. No data loss; merely redundant work and naming collisions during cloning if not handled.

Fix options:

  • Promote instance_replace_results to a first-class table or a durable storage location (like Azure Fileshare).
  • Explicitly whitelist this artifact type from future retention policies.

Status: Accepted as design debt. Monitor Postgres storage growth for job_artifacts.


Migration Advisory Lock Breaks Under Transaction-Pooled PgBouncer

Symptom: On deploy, multiple API/worker replicas would race on goose.Up and one of them would fail with a goose_db_version unique-constraint violation (SQLSTATE 23505), crash-looping until the losing pods retried.

Scope: internal/db/migrate.go — preflight runs migrations from every process at startup.

Cause: Goose v3 legacy API acquires no DB lock. N replicas all read the pending version, all run the migration body, all try to INSERT the same version_id row.

Fix applied: Switched to goose.NewProvider(..., WithSessionLocker(lock.NewPostgresSessionLocker())). The session locker uses pg_advisory_lock — a core Postgres feature, no extension needed. Losing replicas block on the lock, observe the applied version when they acquire it, and exit clean.

Remaining gotcha: The session locker relies on the same Postgres connection surviving across statements. This breaks if DATABASE_URL points at a transaction-pooled PgBouncer (e.g., Azure Flexible Server's built-in pooler on port 6432, or pgcat in transaction mode). The pool rotates connections between statements; the advisory lock is released behind goose's back, and the serialization guarantee silently disappears.

How to check:

  • Confirm DATABASE_URL uses the direct Postgres port (5432 on Azure), not the pooler port (6432).
  • If the app runtime ever moves to a transaction-pool DSN, keep the migration DSN on 5432 or switch to lock.NewPostgresTableLocker() which doesn't depend on session persistence.

Severity: Low today (we're on 5432). Latent footgun if infra changes.

Status: Monitor connection routing. Revisit if a pooler is introduced between the app and Postgres.