Appearance
Tenant Import — Idempotent Rewrite Plan
Problem
Current tenant-import is not idempotent. Every non-user document receives a fresh bson.NewObjectID() on insert (internal/migrate/import.go:306), but cross-document _id references (e.g. parentId in a sub-table pointing at a main-table record's _id) are not rewired. Re-runs also insert duplicate documents because the insert path uses plain InsertMany, not upserts.
Consequences:
- Parent→child relationships broken silently after import (child's
parentIdstill holds the source_id, but the parent now has a fresh one). - Re-running the tool against a partially-imported tenant creates duplicates instead of resuming.
- Only the
usercollection has a de-duplication path (--reuse-existing-users).
Goals
- Idempotent imports: re-running the tool against the same target must converge, not duplicate.
- Preserve cross-document
_idreferences after import. - Bounded memory footprint suitable for k8s pods (≤ 512 MB).
- Resumable / retry-safe on partial failure.
Algorithm
Classification (three cases per source _id)
For each source document, look up {_id: src._id} in the target collection:
| Case | Target state | Action |
|---|---|---|
| 1 | Exists, tenantId == targetTenantID | Keep _id. No remap. Upsert by {_id, tenantId}. |
| 2 | Does not exist | Keep _id. No remap. Upsert by {_id, tenantId}. |
| 3 | Exists, tenantId != targetTenantID (cross-tenant clash) | Allocate new _id. Record srcHex → newOID in idMap. Upsert. |
Only case 3 produces an idMap entry. Cases 1 and 2 preserve the source _id so references remain intact without any rewriting.
Two-pass flow
Pass 1 must complete for every collection before Pass 2 touches the database, because a document in collection A may reference a document in collection B via _id; the deep-walk in Pass 2 needs the full global idMap.
Pass 1 — Classify (no writes):
- Walk zip entries. Skip empty collections entirely (don't load the zip entry).
- For each non-empty source collection (that isn't in the skip list below):
- Stream source
_ids in chunks of 10 000. - For each chunk:
target.find({_id: {$in: chunk}}, {_id:1, tenantId:1}). - Classify each source
_idinto case 1 / 2 / 3. - Case 3 entries: append to global
idMapand per-collectionremapSet.
- Stream source
- When the
usercollection is processed via the existingReuseExistingUsersflow, its OID remappings are written into the same globalidMapso non-user documents that reference users are rewired correctly by Pass 2.
Pass 2 — Apply and upsert:
- If
len(idMap) == 0, skip the deep-walk entirely (fast path). - Otherwise, for each collection:
- Re-stream source documents one at a time.
- If
doc._id ∈ remapSet, swap in the new OID fromidMap. - Deep-walk every field at every depth, replacing any
bson.ObjectIDvalue matching anidMapkey. - Apply tenant-ref rewrite (
rewriteTenantRefs, unchanged). - Append to batch. On reaching
batchSize, issue an unorderedbulkWriteof upserts filtered by{_id, tenantId}.
Scope
Collections skipped by the new algorithm
These collections are not classified, have no _id map entry created, and are not deep-walked in Pass 2:
user— owned by the existingReuseExistingUserspath (email-based dedupe). Its OID remappings still flow into the globalidMapfor use by other collections.user-session— ephemeral, no meaningful cross-document references.customer— dedupe driven by business keys, not_id.
Deep-walk scope
Two independent walks with different scopes — do not conflate:
| Walk | Purpose | Scope |
|---|---|---|
| User remap | Existing email/OID anonymisation | Known field list (remap.go:52-89) + customFields subtree |
idMap rewrite (new) | Cross-document _id integrity | All fields, every depth — any bson.ObjectID value anywhere in the document |
idMap storage
- In-memory
map[string]bson.ObjectIDby default. - Threshold: 500 000 entries (~40 MB in memory). Above this, spill to disk.
- Disk backend: bbolt (
go.etcd.io/bbolt) — pure Go, no CGO, single file.- File:
<os.TempDir()>/tenant-import-<jobID>.bolt - Single bucket
idmap. Key = source hex string (24 bytes). Value = 12-byteObjectID. - In-memory LRU cache (100k hot entries) sits in front of bolt to keep Pass 2 deep-walk lookups fast.
- File:
- Cleanup:
defer os.Remove(path)at the end ofRunImport. Also invoked via panic-recovery wrapper so a crash never leaks the file.
IDMap interface
go
type IDMap interface {
Get(srcHex string) (bson.ObjectID, bool)
Put(srcHex string, newOID bson.ObjectID)
Len() int
Close() error // releases bolt resources and deletes file if backed by disk
}Implementations:
memIDMap— plainmap[string]bson.ObjectID.boltIDMap— bbolt-backed with LRU cache. Constructed whenmemIDMap.Len()crosses the threshold by draining the map into bolt.
Deep-walk code consumes the interface and is agnostic to storage.
Memory profile
Target: comfortably under 512 MB in a k8s pod for 2 M documents × 300 collections.
| Stage | In-mem working set |
|---|---|
| Pass 1 chunk | 10 000 OIDs × 12 B = ~240 KB + target reply |
Pass 1 idMap (in-mem) | Case-3 entries × 80 B; capped at 500 k = 40 MB before spill |
Pass 1 idMap (spilled) | LRU cache 100 k × 80 B = 8 MB |
| Pass 2 batch | batchSize × avg doc size (~5–10 MB at batch = 1000) |
| Pass 2 deep-walk scratch | Bounded per document; no accumulation |
Notes:
remapSetper collection is typically empty (case 3 is rare). If large, also eligible for bolt spill (deferred; implement only if observed).- Source document streaming replaces the current full-collection-into-slice read at import.go:194.
Implementation steps
- Streaming doc reader. Replace
readJSONLDocs/readBSONDocswith iterator variants (forEachJSONLDoc,forEachBSONDoc) that invoke a callback per document. Keep the slice-returning versions for callers that still need them (tests, user-coll pre-scan). IDMapinterface + backends.memIDMap(trivial).boltIDMapwith LRU. Addgo.etcd.io/bbolttogo.mod.- Migration helper:
drainMemToBolt(mem, bolt).
- Pass 1 classifier.
- New file
internal/migrate/classify.go. - Entry point:
func classifyCollections(ctx, zr, targetDB, targetTenantID, skipList) (IDMap, map[string]*RemapSet, error). - Per collection: stream
_ids, chunk $in queries, populateIDMapandRemapSet.
- New file
- Integrate user-coll flow.
- Feed existing
ReuseExistingUsersOID remappings into the sameIDMapso Pass 2 rewires user references in non-user docs. - Disable the post-insert user remap sweep for case 3 when running under the new algorithm (its work is done during Pass 2).
- Feed existing
- Pass 2 rewriter.
- Replace the current "assign new
_idand insert" block (import.go:300-334). - Per doc: remapSet check →
_idswap → full-field deep-walk → tenant-ref rewrite → append to batch. - Batch write:
bulkWritewithordered: false, filter{_id, tenantId}, update$set: <doc>,upsert: true.
- Replace the current "assign new
- Cleanup.
defer idMap.Close()removes bolt file.- Panic-recover wrapper in
RunImportto guarantee deletion on crash.
- Tests.
- Unit: classifier (three cases),
IDMapbackends (including mem→bolt spill), deep-walk (nested arrays, customFields, OIDs at depth). - Integration: re-run import against same target → second run produces zero new documents, zero reference drift.
- Cross-tenant clash: pre-seed target with a clashing
_idunder a different tenant → verify case 3 path rewires correctly.
- Unit: classifier (three cases),
Progress reporting
ProgressFn contract extends to cover two passes:
- 0–30 %: Pass 1 (classification + user pre-scan).
- 30–95 %: Pass 2 (streaming import + upserts).
- 95–100 %: user remap sweep (if any) + cleanup.
Weights are heuristic; adjust after observing real workloads.
Risks and open questions
- Target scan cost at scale. Pre-scan issues one
$inquery per chunk per collection. For 300 collections × 200 chunks = 60 000 queries. Batched and projection-limited, this should be seconds, but needs validation on prod-sized tenants. If hot, consider parallelising per-collection scans with a small worker pool. - Bolt sync mode. Use
db.NoSync = true. The bolt file is ephemeral (deleted at end of job, never reopened across process restarts). Reads within the same process always see prior writes regardless ofNoSyncbecause bbolt serves them from its in-memory B+tree, so Pass 2 lookups are consistent with Pass 1 writes. The only durabilityNoSyncsacrifices is "survive a process crash" — and in that scenario we throw the file away and re-run the job anyway. - Business-key dedupe on other collections. Collections beyond
user/customermay have unique indexes on non-_idfields (e.g.code,slug). Upsert on_idwon't catch these. Not in scope for this change, but flag in the PR description. - Upsert with
$set: <entire doc>replaces the existing target document wholesale on re-run. Confirm this matches the desired idempotent semantics (rather than merge or skip-if-exists). - Indexes.
applyIndexesFromZipalready idempotent (mongo no-ops on identical specs). No change needed.
Out of scope (future)
- Parallel per-collection workers for both passes.
- Incremental / delta imports (only changed documents).
- Dedupe on unique business keys across non-user collections.
- Exposing
idMapsize / spill state through the worker progress UI.

