Tenant Import — Idempotent Rewrite Plan

Problem

Current tenant-import is not idempotent. Every non-user document receives a fresh bson.NewObjectID() on insert (internal/migrate/import.go:306), but cross-document _id references (e.g. parentId in a sub-table pointing at a main-table record's _id) are not rewired. Re-runs also insert duplicate documents because the insert path uses plain InsertMany, not upserts.

Consequences:

Parent→child relationships broken silently after import (child's parentId still holds the source _id, but the parent now has a fresh one).
Re-running the tool against a partially-imported tenant creates duplicates instead of resuming.
Only the user collection has a de-duplication path (--reuse-existing-users).

Goals

Idempotent imports: re-running the tool against the same target must converge, not duplicate.
Preserve cross-document _id references after import.
Bounded memory footprint suitable for k8s pods (≤ 512 MB).
Resumable / retry-safe on partial failure.

Algorithm

Classification (three cases per source `_id`)

For each source document, look up {_id: src._id} in the target collection:

Case	Target state	Action
1	Exists, `tenantId == targetTenantID`	Keep `_id`. No remap. Upsert by `{_id, tenantId}`.
2	Does not exist	Keep `_id`. No remap. Upsert by `{_id, tenantId}`.
3	Exists, `tenantId != targetTenantID` (cross-tenant clash)	Allocate new `_id`. Record `srcHex → newOID` in `idMap`. Upsert.

Only case 3 produces an idMap entry. Cases 1 and 2 preserve the source _id so references remain intact without any rewriting.

Two-pass flow

Pass 1 must complete for every collection before Pass 2 touches the database, because a document in collection A may reference a document in collection B via _id; the deep-walk in Pass 2 needs the full global idMap.

Pass 1 — Classify (no writes):

Walk zip entries. Skip empty collections entirely (don't load the zip entry).
For each non-empty source collection (that isn't in the skip list below):
- Stream source _ids in chunks of 10 000.
- For each chunk: target.find({_id: {$in: chunk}}, {_id:1, tenantId:1}).
- Classify each source _id into case 1 / 2 / 3.
- Case 3 entries: append to global idMap and per-collection remapSet.
When the user collection is processed via the existing ReuseExistingUsers flow, its OID remappings are written into the same global idMap so non-user documents that reference users are rewired correctly by Pass 2.

Pass 2 — Apply and upsert:

If len(idMap) == 0, skip the deep-walk entirely (fast path).
Otherwise, for each collection:
- Re-stream source documents one at a time.
- If doc._id ∈ remapSet, swap in the new OID from idMap.
- Deep-walk every field at every depth, replacing any bson.ObjectID value matching an idMap key.
- Apply tenant-ref rewrite (rewriteTenantRefs, unchanged).
- Append to batch. On reaching batchSize, issue an unordered bulkWrite of upserts filtered by {_id, tenantId}.

Scope

Collections skipped by the new algorithm

These collections are not classified, have no _id map entry created, and are not deep-walked in Pass 2:

user — owned by the existing ReuseExistingUsers path (email-based dedupe). Its OID remappings still flow into the global idMap for use by other collections.
user-session — ephemeral, no meaningful cross-document references.
customer — dedupe driven by business keys, not _id.

Deep-walk scope

Two independent walks with different scopes — do not conflate:

Walk	Purpose	Scope
User remap	Existing email/OID anonymisation	Known field list (remap.go:52-89) + `customFields` subtree
`idMap` rewrite (new)	Cross-document `_id` integrity	All fields, every depth — any `bson.ObjectID` value anywhere in the document

`idMap` storage

In-memory map[string]bson.ObjectID by default.
Threshold: 500 000 entries (~40 MB in memory). Above this, spill to disk.
Disk backend: bbolt (go.etcd.io/bbolt) — pure Go, no CGO, single file.
- File: <os.TempDir()>/tenant-import-<jobID>.bolt
- Single bucket idmap. Key = source hex string (24 bytes). Value = 12-byte ObjectID.
- In-memory LRU cache (100k hot entries) sits in front of bolt to keep Pass 2 deep-walk lookups fast.
Cleanup: defer os.Remove(path) at the end of RunImport. Also invoked via panic-recovery wrapper so a crash never leaks the file.

`IDMap` interface

type IDMap interface {
    Get(srcHex string) (bson.ObjectID, bool)
    Put(srcHex string, newOID bson.ObjectID)
    Len() int
    Close() error // releases bolt resources and deletes file if backed by disk
}

Implementations:

memIDMap — plain map[string]bson.ObjectID.
boltIDMap — bbolt-backed with LRU cache. Constructed when memIDMap.Len() crosses the threshold by draining the map into bolt.

Deep-walk code consumes the interface and is agnostic to storage.

Memory profile

Target: comfortably under 512 MB in a k8s pod for 2 M documents × 300 collections.

Stage	In-mem working set
Pass 1 chunk	10 000 OIDs × 12 B = ~240 KB + target reply
Pass 1 `idMap` (in-mem)	Case-3 entries × 80 B; capped at 500 k = 40 MB before spill
Pass 1 `idMap` (spilled)	LRU cache 100 k × 80 B = 8 MB
Pass 2 batch	`batchSize` × avg doc size (~5–10 MB at batch = 1000)
Pass 2 deep-walk scratch	Bounded per document; no accumulation

Notes:

remapSet per collection is typically empty (case 3 is rare). If large, also eligible for bolt spill (deferred; implement only if observed).
Source document streaming replaces the current full-collection-into-slice read at import.go:194.

Implementation steps

Streaming doc reader. Replace readJSONLDocs / readBSONDocs with iterator variants (forEachJSONLDoc, forEachBSONDoc) that invoke a callback per document. Keep the slice-returning versions for callers that still need them (tests, user-coll pre-scan).
IDMap interface + backends.
- memIDMap (trivial).
- boltIDMap with LRU. Add go.etcd.io/bbolt to go.mod.
- Migration helper: drainMemToBolt(mem, bolt).
Pass 1 classifier.
- New file internal/migrate/classify.go.
- Entry point: func classifyCollections(ctx, zr, targetDB, targetTenantID, skipList) (IDMap, map[string]*RemapSet, error).
- Per collection: stream _ids, chunk $in queries, populate IDMap and RemapSet.
Integrate user-coll flow.
- Feed existing ReuseExistingUsers OID remappings into the same IDMap so Pass 2 rewires user references in non-user docs.
- Disable the post-insert user remap sweep for case 3 when running under the new algorithm (its work is done during Pass 2).
Pass 2 rewriter.
- Replace the current "assign new _id and insert" block (import.go:300-334).
- Per doc: remapSet check → _id swap → full-field deep-walk → tenant-ref rewrite → append to batch.
- Batch write: bulkWrite with ordered: false, filter {_id, tenantId}, update $set: <doc>, upsert: true.
Cleanup.
- defer idMap.Close() removes bolt file.
- Panic-recover wrapper in RunImport to guarantee deletion on crash.
Tests.
- Unit: classifier (three cases), IDMap backends (including mem→bolt spill), deep-walk (nested arrays, customFields, OIDs at depth).
- Integration: re-run import against same target → second run produces zero new documents, zero reference drift.
- Cross-tenant clash: pre-seed target with a clashing _id under a different tenant → verify case 3 path rewires correctly.

Progress reporting

ProgressFn contract extends to cover two passes:

0–30 %: Pass 1 (classification + user pre-scan).
30–95 %: Pass 2 (streaming import + upserts).
95–100 %: user remap sweep (if any) + cleanup.

Weights are heuristic; adjust after observing real workloads.

Risks and open questions

Target scan cost at scale. Pre-scan issues one $in query per chunk per collection. For 300 collections × 200 chunks = 60 000 queries. Batched and projection-limited, this should be seconds, but needs validation on prod-sized tenants. If hot, consider parallelising per-collection scans with a small worker pool.
Bolt sync mode. Use db.NoSync = true. The bolt file is ephemeral (deleted at end of job, never reopened across process restarts). Reads within the same process always see prior writes regardless of NoSync because bbolt serves them from its in-memory B+tree, so Pass 2 lookups are consistent with Pass 1 writes. The only durability NoSync sacrifices is "survive a process crash" — and in that scenario we throw the file away and re-run the job anyway.
Business-key dedupe on other collections. Collections beyond user / customer may have unique indexes on non-_id fields (e.g. code, slug). Upsert on _id won't catch these. Not in scope for this change, but flag in the PR description.
Upsert with $set: <entire doc> replaces the existing target document wholesale on re-run. Confirm this matches the desired idempotent semantics (rather than merge or skip-if-exists).
Indexes. applyIndexesFromZip already idempotent (mongo no-ops on identical specs). No change needed.

Out of scope (future)

Parallel per-collection workers for both passes.
Incremental / delta imports (only changed documents).
Dedupe on unique business keys across non-user collections.
Exposing idMap size / spill state through the worker progress UI.

Tenant Import — Idempotent Rewrite Plan ​

Problem ​

Goals ​

Algorithm ​

Classification (three cases per source _id) ​

Two-pass flow ​

Scope ​

Collections skipped by the new algorithm ​

Deep-walk scope ​

idMap storage ​

IDMap interface ​

Memory profile ​

Implementation steps ​

Progress reporting ​

Risks and open questions ​

Out of scope (future) ​