Skip to content

feat: add theta sketch (part 2)#59

Merged
leerho merged 5 commits into
apache:mainfrom
ZENOTME:binomial_bounds
Jan 8, 2026
Merged

feat: add theta sketch (part 2)#59
leerho merged 5 commits into
apache:mainfrom
ZENOTME:binomial_bounds

Conversation

@ZENOTME

@ZENOTME ZENOTME commented Jan 1, 2026

Copy link
Copy Markdown
Contributor
  • add binomial_bounds to support calculate lower_bound&&upper_bound
  • add get_lower_bound&&get_upper_bound in ThetaSketch

cc @notfilippo @Xuanwo @PsiACE @tisonkun

Comment thread datasketches/src/theta/sketch.rs Outdated
Comment thread datasketches/src/theta/sketch.rs Outdated
@leerho

leerho commented Jan 4, 2026

Copy link
Copy Markdown
Member

Waiting on requested change.

Comment thread datasketches/src/theta/sketch.rs Outdated
Comment on lines +154 to +159
pub fn lower_bound(&self, num_std_devs: u32) -> Result<f64, Error> {
if !self.is_estimation_mode() {
return Ok(self.num_retained() as f64);
}
binomial_bounds::lower_bound(self.num_retained() as u64, self.theta(), num_std_devs)
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a NumStdDev enum:

/// Number of standard deviations for confidence bounds
///
/// This enum specifies the number of standard deviations to use when computing
/// upper and lower bounds for cardinality estimates. Higher values provide wider
/// confidence intervals with greater certainty that the true cardinality falls
/// within the bounds.
#[repr(u8)]
pub enum NumStdDev {
/// One standard deviation (\~68% confidence interval)
One = 1,
/// Two standard deviations (\~95% confidence interval)
Two = 2,
/// Three standard deviations (\~99.7% confidence interval)
Three = 3,
}

You can reuse it by moving the struct to a shared module level and thus we would never error here.

Besides, we may consider where to place these shared structs, including the ResizeFactor. datasketches::{NumStdDev, ResizeFactor} is fair enough now, but perhaps pollute the top-level namespace too much.

Comment thread datasketches/src/binomial_bounds.rs Outdated
Comment on lines +748 to +764
// Comment for this larger test like Java
// for ci in 1..=3 {
// let arr = run_test_aux(2000, ci, 1e-7);
// for j in 0..5 {
// let ratio = arr[j] / STD[i][j];
// assert!(
// (ratio - 1.0).abs() < TOL,
// "ci={}, j={}: expected {}, got {}, ratio={}",
// ci,
// j,
// STD[i][j],
// arr[j],
// ratio
// );
// }
// i += 1;
// }

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much time it costs? I'd prefer not to comment out tests.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't seem to find the source for this commented out test in either C++ or Java . Could you give me a full link?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much time it costs? I'd prefer not to comment out tests.

0.56s(comment out) vs 3.19s in mac m1 pro

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.56s(comment out) vs 3.19s in mac m1 pro

This can be fair enough to include.

@tisonkun tisonkun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* add binomial_bounds to support calculate lower_bound&&upper_bound
* add get_lower_bound&&get_upper_bound in ThetaSketch
@tisonkun

tisonkun commented Jan 5, 2026

Copy link
Copy Markdown
Member

I'll appreciate it if you can, the next time, avoid force push after reviewers claim in. Then reviewers can easily get the changeset from the last review.

Comment thread datasketches/src/lib.rs Outdated
mod binomial_bounds;
mod codec;
mod hash;
pub mod num_std_dev;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub mod num_std_dev;
mod num_std_dev;

Why pub?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remove it later

Comment thread datasketches/src/lib.rs Outdated
@ZENOTME

ZENOTME commented Jan 5, 2026

Copy link
Copy Markdown
Contributor Author

I'll appreciate it if you can, the next time, avoid force push after reviewers claim in. Then reviewers can easily get the changeset from the last review.

I’ll be more careful next time. If I need to rebase, should I use merge in situations like this instead?

@leerho leerho left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Looks OK to me. However, there are 2 unresolved requested changes, and the last CI produced errors.

@ZENOTME

ZENOTME commented Jan 6, 2026

Copy link
Copy Markdown
Contributor Author

This Looks OK to me. However, there are 2 unresolved requested changes, and the last CI produced errors.

Looks like ci fail at ./tools/generate_serialization_test_data.py, should we re run to try again. Test locally pass

@leerho

leerho commented Jan 6, 2026

Copy link
Copy Markdown
Member

Ran the tests again, still failed.

@leerho

leerho commented Jan 6, 2026

Copy link
Copy Markdown
Member

I think it would be a good idea to have the workflow_dispatch: trigger on any workflows. It simplifies troubleshooting.

@tisonkun

tisonkun commented Jan 7, 2026

Copy link
Copy Markdown
Member

Error: Expected generated files directory not found at /home/runner/work/datasketches-rust/datasketches-rust/tmp_datasketches_java/serialization_test_data/java_generated_files

Look like some Java side logic changed so that directory has changed as well.

I'll take a look after managing the release vote to the next stage. And perhaps #10 should help in this issue.

I think it would be a good idea to have the workflow_dispatch: trigger on any workflows. It simplifies troubleshooting.

Good point.

@PsiACE

PsiACE commented Jan 7, 2026

Copy link
Copy Markdown
Member

Ran the tests again, still failed.

I included a temporary fix in #62 (Density Sketch) to ensure CI passes.

@tisonkun tisonkun mentioned this pull request Jan 7, 2026
@tisonkun

tisonkun commented Jan 8, 2026

Copy link
Copy Markdown
Member

CI fixed. I'll give another look tomorrow and try to merge it to cut 0.2.0-rc.2/

Signed-off-by: tison <wander4096@gmail.com>

@tisonkun tisonkun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Push a new commit for reorg common types.

Now this looks good to me.

@tisonkun tisonkun requested a review from leerho January 8, 2026 16:09
@tisonkun

tisonkun commented Jan 8, 2026

Copy link
Copy Markdown
Member

cc @leerho @PsiACE you may take another look then.

@PsiACE PsiACE left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. And looking forward to the new release.

@leerho leerho merged commit fdb38b8 into apache:main Jan 8, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants