[Draft] Use $sum instead of $push to optimise Mongo group queries by suddendust · Pull Request #311 · hypertrace/document-store

suddendust · 2026-06-10T07:54:37Z

Optimize Mongo `COUNT` aggregation: `$sum` instead of `$push` + `$size`

Summary

The document-store library compiled COUNT(<expr>) for MongoDB into a $push accumulator
that collected every value into an in-memory array, then took its length with $size in a
follow-up $project:

$group:   { "<alias>": { $push: "$entityId" }, _id: { ... } }
$project: { "<alias>": { $size: "$<alias>" }, ... }

This materializes one array element per matching document just to count them. This change makes
COUNT emit a $sum accumulator instead — O(1) memory per group, no array, no spill — while
preserving the existing semantics:

// COUNT(<constant>)  ->  COUNT(*): count every document in the group
$group:   { _id: { ... }, "<alias>": { $sum: 1 } }
$project: { "<alias>": "$<alias>", ... }

// COUNT(<field/expr>) ->  count only documents where the operand is present (not missing),
//                         matching the old $push behavior (which skipped missing values)
$group:   { _id: { ... }, "<alias>": { $sum: { $cond: [ { $ne: [ { $type: "$<field>" }, "missing" ] }, 1, 0 ] } } }
$project: { "<alias>": "$<alias>", ... }

Performance Evidence

Both plans below are the same query, same filter, same index (tenantId_isLearnt_index),
on default_db.application_asset_entities for a tenant with ~459K matching documents
(explain("executionStats"), MongoDB 8.0.20). The only difference is the COUNT compilation.
The "after" plan was captured with the $sum: 1 (COUNT(*)) form.

Metric	Before (`$push` + `$size`)	After (`$sum`)
executionTimeMillis	13,753 ms	7,134 ms
`$group` accumulator	`addToArrayCapped(entityId, 104857600)`	`count()`
`$group` stage time (est.)	~13,555 ms	~7,084 ms
nReturned (groups)	4	4

Result: ~48% faster (13.75s -> 7.13s) on this dataset. The dominant saving is eliminating the
per-group array build ($group stage time drops from ~13.5s to ~80ms over its input); the
$cond presence-guard form for COUNT(<field>) removes the same array work, so it gets the same
class of improvement.

Out of scope (follow-up)

Note on full index coverage: the most efficient form ($sum: 1, reading no document fields) is
produced for COUNT(<constant>). For COUNT(<field>) the presence guard reads the field, so it
cannot be fully index-covered. The production asset-count query currently uses COUNT(fieldName);
if the intent is "count documents" (entityId is always present), the caller should issue
COUNT(*) / COUNT(1) to enable a fully index-covered, sub-second plan together with the
covering index.

Use $sum instead of $push to optimise Mongo group queries

2a3af0e

suddendust requested review from avinashkolluru, kotharironak, puneet-traceable, skjindal93 and suresh-prakash as code owners June 10, 2026 07:54

suddendust marked this pull request as draft June 10, 2026 07:55

lrathod approved these changes Jun 10, 2026

View reviewed changes

WIP

fb83995

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Use $sum instead of $push to optimise Mongo group queries#311

[Draft] Use $sum instead of $push to optimise Mongo group queries#311
suddendust wants to merge 2 commits into
mainfrom
mongo_fix_groupby_perf

suddendust commented Jun 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

suddendust commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Optimize Mongo COUNT aggregation: $sum instead of $push + $size

Summary

Performance Evidence

Out of scope (follow-up)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

suddendust commented Jun 10, 2026 •

edited

Loading

Optimize Mongo `COUNT` aggregation: `$sum` instead of `$push` + `$size`