Skip to content

Tenant Import — Idempotent Rewrite Plan

Problem

Current tenant-import is not idempotent. Every non-user document receives a fresh bson.NewObjectID() on insert (internal/migrate/import.go:306), but cross-document _id references (e.g. parentId in a sub-table pointing at a main-table record's _id) are not rewired. Re-runs also insert duplicate documents because the insert path uses plain InsertMany, not upserts.

Consequences:

  • Parent→child relationships broken silently after import (child's parentId still holds the source _id, but the parent now has a fresh one).
  • Re-running the tool against a partially-imported tenant creates duplicates instead of resuming.
  • Only the user collection has a de-duplication path (--reuse-existing-users).

Goals

  1. Idempotent imports: re-running the tool against the same target must converge, not duplicate.
  2. Preserve cross-document _id references after import.
  3. Bounded memory footprint suitable for k8s pods (≤ 512 MB).
  4. Resumable / retry-safe on partial failure.

Algorithm

Classification (three cases per source _id)

For each source document, look up {_id: src._id} in the target collection:

CaseTarget stateAction
1Exists, tenantId == targetTenantIDKeep _id. No remap. Upsert by {_id, tenantId}.
2Does not existKeep _id. No remap. Upsert by {_id, tenantId}.
3Exists, tenantId != targetTenantID (cross-tenant clash)Allocate new _id. Record srcHex → newOID in idMap. Upsert.

Only case 3 produces an idMap entry. Cases 1 and 2 preserve the source _id so references remain intact without any rewriting.

Two-pass flow

Pass 1 must complete for every collection before Pass 2 touches the database, because a document in collection A may reference a document in collection B via _id; the deep-walk in Pass 2 needs the full global idMap.

Pass 1 — Classify (no writes):

  1. Walk zip entries. Skip empty collections entirely (don't load the zip entry).
  2. For each non-empty source collection (that isn't in the skip list below):
    • Stream source _ids in chunks of 10 000.
    • For each chunk: target.find({_id: {$in: chunk}}, {_id:1, tenantId:1}).
    • Classify each source _id into case 1 / 2 / 3.
    • Case 3 entries: append to global idMap and per-collection remapSet.
  3. When the user collection is processed via the existing ReuseExistingUsers flow, its OID remappings are written into the same global idMap so non-user documents that reference users are rewired correctly by Pass 2.

Pass 2 — Apply and upsert:

  1. If len(idMap) == 0, skip the deep-walk entirely (fast path).
  2. Otherwise, for each collection:
    • Re-stream source documents one at a time.
    • If doc._id ∈ remapSet, swap in the new OID from idMap.
    • Deep-walk every field at every depth, replacing any bson.ObjectID value matching an idMap key.
    • Apply tenant-ref rewrite (rewriteTenantRefs, unchanged).
    • Append to batch. On reaching batchSize, issue an unordered bulkWrite of upserts filtered by {_id, tenantId}.

Scope

Collections skipped by the new algorithm

These collections are not classified, have no _id map entry created, and are not deep-walked in Pass 2:

  • user — owned by the existing ReuseExistingUsers path (email-based dedupe). Its OID remappings still flow into the global idMap for use by other collections.
  • user-session — ephemeral, no meaningful cross-document references.
  • customer — dedupe driven by business keys, not _id.

Deep-walk scope

Two independent walks with different scopes — do not conflate:

WalkPurposeScope
User remapExisting email/OID anonymisationKnown field list (remap.go:52-89) + customFields subtree
idMap rewrite (new)Cross-document _id integrityAll fields, every depth — any bson.ObjectID value anywhere in the document

idMap storage

  • In-memory map[string]bson.ObjectID by default.
  • Threshold: 500 000 entries (~40 MB in memory). Above this, spill to disk.
  • Disk backend: bbolt (go.etcd.io/bbolt) — pure Go, no CGO, single file.
    • File: <os.TempDir()>/tenant-import-<jobID>.bolt
    • Single bucket idmap. Key = source hex string (24 bytes). Value = 12-byte ObjectID.
    • In-memory LRU cache (100k hot entries) sits in front of bolt to keep Pass 2 deep-walk lookups fast.
  • Cleanup: defer os.Remove(path) at the end of RunImport. Also invoked via panic-recovery wrapper so a crash never leaks the file.

IDMap interface

go
type IDMap interface {
    Get(srcHex string) (bson.ObjectID, bool)
    Put(srcHex string, newOID bson.ObjectID)
    Len() int
    Close() error // releases bolt resources and deletes file if backed by disk
}

Implementations:

  • memIDMap — plain map[string]bson.ObjectID.
  • boltIDMap — bbolt-backed with LRU cache. Constructed when memIDMap.Len() crosses the threshold by draining the map into bolt.

Deep-walk code consumes the interface and is agnostic to storage.

Memory profile

Target: comfortably under 512 MB in a k8s pod for 2 M documents × 300 collections.

StageIn-mem working set
Pass 1 chunk10 000 OIDs × 12 B = ~240 KB + target reply
Pass 1 idMap (in-mem)Case-3 entries × 80 B; capped at 500 k = 40 MB before spill
Pass 1 idMap (spilled)LRU cache 100 k × 80 B = 8 MB
Pass 2 batchbatchSize × avg doc size (~5–10 MB at batch = 1000)
Pass 2 deep-walk scratchBounded per document; no accumulation

Notes:

  • remapSet per collection is typically empty (case 3 is rare). If large, also eligible for bolt spill (deferred; implement only if observed).
  • Source document streaming replaces the current full-collection-into-slice read at import.go:194.

Implementation steps

  1. Streaming doc reader. Replace readJSONLDocs / readBSONDocs with iterator variants (forEachJSONLDoc, forEachBSONDoc) that invoke a callback per document. Keep the slice-returning versions for callers that still need them (tests, user-coll pre-scan).
  2. IDMap interface + backends.
    • memIDMap (trivial).
    • boltIDMap with LRU. Add go.etcd.io/bbolt to go.mod.
    • Migration helper: drainMemToBolt(mem, bolt).
  3. Pass 1 classifier.
    • New file internal/migrate/classify.go.
    • Entry point: func classifyCollections(ctx, zr, targetDB, targetTenantID, skipList) (IDMap, map[string]*RemapSet, error).
    • Per collection: stream _ids, chunk $in queries, populate IDMap and RemapSet.
  4. Integrate user-coll flow.
    • Feed existing ReuseExistingUsers OID remappings into the same IDMap so Pass 2 rewires user references in non-user docs.
    • Disable the post-insert user remap sweep for case 3 when running under the new algorithm (its work is done during Pass 2).
  5. Pass 2 rewriter.
    • Replace the current "assign new _id and insert" block (import.go:300-334).
    • Per doc: remapSet check → _id swap → full-field deep-walk → tenant-ref rewrite → append to batch.
    • Batch write: bulkWrite with ordered: false, filter {_id, tenantId}, update $set: <doc>, upsert: true.
  6. Cleanup.
    • defer idMap.Close() removes bolt file.
    • Panic-recover wrapper in RunImport to guarantee deletion on crash.
  7. Tests.
    • Unit: classifier (three cases), IDMap backends (including mem→bolt spill), deep-walk (nested arrays, customFields, OIDs at depth).
    • Integration: re-run import against same target → second run produces zero new documents, zero reference drift.
    • Cross-tenant clash: pre-seed target with a clashing _id under a different tenant → verify case 3 path rewires correctly.

Progress reporting

ProgressFn contract extends to cover two passes:

  • 0–30 %: Pass 1 (classification + user pre-scan).
  • 30–95 %: Pass 2 (streaming import + upserts).
  • 95–100 %: user remap sweep (if any) + cleanup.

Weights are heuristic; adjust after observing real workloads.

Risks and open questions

  • Target scan cost at scale. Pre-scan issues one $in query per chunk per collection. For 300 collections × 200 chunks = 60 000 queries. Batched and projection-limited, this should be seconds, but needs validation on prod-sized tenants. If hot, consider parallelising per-collection scans with a small worker pool.
  • Bolt sync mode. Use db.NoSync = true. The bolt file is ephemeral (deleted at end of job, never reopened across process restarts). Reads within the same process always see prior writes regardless of NoSync because bbolt serves them from its in-memory B+tree, so Pass 2 lookups are consistent with Pass 1 writes. The only durability NoSync sacrifices is "survive a process crash" — and in that scenario we throw the file away and re-run the job anyway.
  • Business-key dedupe on other collections. Collections beyond user / customer may have unique indexes on non-_id fields (e.g. code, slug). Upsert on _id won't catch these. Not in scope for this change, but flag in the PR description.
  • Upsert with $set: <entire doc> replaces the existing target document wholesale on re-run. Confirm this matches the desired idempotent semantics (rather than merge or skip-if-exists).
  • Indexes. applyIndexesFromZip already idempotent (mongo no-ops on identical specs). No change needed.

Out of scope (future)

  • Parallel per-collection workers for both passes.
  • Incremental / delta imports (only changed documents).
  • Dedupe on unique business keys across non-user collections.
  • Exposing idMap size / spill state through the worker progress UI.