Skip to content

[Draft] Use $sum instead of $push to optimise Mongo group queries#311

Draft
suddendust wants to merge 2 commits into
mainfrom
mongo_fix_groupby_perf
Draft

[Draft] Use $sum instead of $push to optimise Mongo group queries#311
suddendust wants to merge 2 commits into
mainfrom
mongo_fix_groupby_perf

Conversation

@suddendust

@suddendust suddendust commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Optimize Mongo COUNT aggregation: $sum instead of $push + $size

Summary

The document-store library compiled COUNT(<expr>) for MongoDB into a $push accumulator
that collected every value into an in-memory array, then took its length with $size in a
follow-up $project:

$group:   { "<alias>": { $push: "$entityId" }, _id: { ... } }
$project: { "<alias>": { $size: "$<alias>" }, ... }

This materializes one array element per matching document just to count them. This change makes
COUNT emit a $sum accumulator instead — O(1) memory per group, no array, no spill — while
preserving the existing semantics:

// COUNT(<constant>)  ->  COUNT(*): count every document in the group
$group:   { _id: { ... }, "<alias>": { $sum: 1 } }
$project: { "<alias>": "$<alias>", ... }

// COUNT(<field/expr>) ->  count only documents where the operand is present (not missing),
//                         matching the old $push behavior (which skipped missing values)
$group:   { _id: { ... }, "<alias>": { $sum: { $cond: [ { $ne: [ { $type: "$<field>" }, "missing" ] }, 1, 0 ] } } }
$project: { "<alias>": "$<alias>", ... }

Performance Evidence

Both plans below are the same query, same filter, same index (tenantId_isLearnt_index),
on default_db.application_asset_entities for a tenant with ~459K matching documents
(explain("executionStats"), MongoDB 8.0.20). The only difference is the COUNT compilation.
The "after" plan was captured with the $sum: 1 (COUNT(*)) form.

Metric Before ($push + $size) After ($sum)
executionTimeMillis 13,753 ms 7,134 ms
$group accumulator addToArrayCapped(entityId, 104857600) count()
$group stage time (est.) ~13,555 ms ~7,084 ms
nReturned (groups) 4 4

Result: ~48% faster (13.75s -> 7.13s) on this dataset. The dominant saving is eliminating the
per-group array build ($group stage time drops from ~13.5s to ~80ms over its input); the
$cond presence-guard form for COUNT(<field>) removes the same array work, so it gets the same
class of improvement.

Out of scope (follow-up)

Note on full index coverage: the most efficient form ($sum: 1, reading no document fields) is
produced for COUNT(<constant>). For COUNT(<field>) the presence guard reads the field, so it
cannot be fully index-covered. The production asset-count query currently uses COUNT(fieldName);
if the intent is "count documents" (entityId is always present), the caller should issue
COUNT(*) / COUNT(1) to enable a fully index-covered, sub-second plan together with the
covering index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants