Skip to content

Benchmark review #16

@rcannood

Description

@rcannood

Hi @ycao6928 ! I let Claude review this code repository, and it came up with several things we need to fix before doing another benchmark run. I reviewed its comments -- there are a few I removed because I know they're okay. I believe the rest we should fix. Would you like to discuss first?


🔴 Urgent: wrong or non-reproducible scores

  • JSD / RMSE are swapped in the downstream metric.
    In src/metrics/downstream/script.R:33-34, ctdeconvolute_rmse is assigned the output of generate_jds() (JSD) and ctdeconcolute_jsd the output of generate_rmse() (RMSE). The two are inverted, which also breaks normalization because config.vsh.yaml gives RMSE max: +Inf and JSD max: 1.

    • Swap the assignments so each ID holds the correct quantity.
    • Fix the typo ctdeconcolute_jsdctdeconvolute_jsd (script + config + metric IDs).
  • Gene-level Pearson is a duplicate of cell-level Pearson.
    In src/metrics/ks_statistic_gene_cell/script.R:202-204 and 229-231, both pearson_*_cells and pearson_*_genes call proxyC::simil(counts, method = "correlation") on the same matrix in the same orientation (row-wise = cell–cell). The gene metric therefore is not gene-level at all.

    • Compute the gene version on the transposed matrix (use margin = 2), matching the original commented-out cor() logic.
  • "Effective library size" is identical to "library size".
    script.R:137-151: both are log1p(rowSums(counts)), so ks_statistic_efflib_size_cells_* always equals ks_statistic_lib_size_cells_*.

    • Define the effective library size properly (e.g. library size × TMM/normalization factor), or remove the metric if redundant.
  • Crash-prone subsampling in the Pearson KS tests.
    script.R:207-208,233-234: sample(as.numeric(pearson_*), 10000) has no set.seed (scores vary between runs) and no replace = TRUE (errors when the pool has fewer than 10000 values, i.e. small datasets).

    • Cap the sample at the population size (or use replace = TRUE).
  • reclassify_simsce hardcodes exactly 4 spatial clusters.
    src/helpers/utils.R:138-147 loops 1:4 over a fixed 4×4 matrix, while generate_sim_spatialCluster uses q = max(spatial_cluster) (utils.R:175). For any dataset without exactly 4 clusters, generate_sim_spatialcluster writes wrong/NA spatial_cluster labels.

    • Make the reclassification dynamic in the number of clusters.

🟠 Should fix: validity, reproducibility, compute

  • The precompute pipeline is computed but unused (and partly duplicated).
    sc_features and precompute_downstream store scFeatures / SVGs / Moran's I / CARD proportions into the processed dataset, but the metrics recompute all of it from scratch (see the commented "use precomputed" lines and the TODO at downstream/script.R:1). BayesSpace clustering and CARD run twice (process step + metric), unseeded, so the two runs can disagree.

    • Either wire the metrics to read the precomputed slots, or drop the precompute steps.
  • Per-cluster simulation loops assume no cluster fails.
    splatter/script.R:27-45 and symsim/script.R:29-81 wrap each cluster in try(), then unconditionally run colnames(simulated_result) <- rownames(input_ordered$obs). A single failed cluster causes a dimension-mismatch hard error (or silent misalignment); a failed first cluster leaves the result NULL.

    • Handle partial/failed clusters explicitly (skip + realign, or fail loudly).
  • symsim uses gene_len_pool without loading it (symsim/script.R:57).

    • Add an explicit data(gene_len_pool).
  • BayesSpace platform hardcoded to "ST" (utils.R:173-174) even when the dataset assay is Visium, which changes the neighbor graph used for clustering.

    • Drive the platform from dataset_assay_spatial.
  • negative_normal emits continuous values as counts (negative_normal/script.R:18), violating the counts: integer contract in file_simulated_dataset.yaml. It may crash integer-expecting metrics (edgeR/SPARK) instead of producing a low score.

    • Emit integer counts (e.g. round/rpois) so the control scores poorly rather than erroring.
  • --base tissue is advertised but unimplemented for everything except SRTsim (all others stop("ONLY domain base")).


🟡 Suggestions: framing & hygiene

  • Metric naming is inaccurate. The code uses ks::kde.test (a kernel-density global two-sample test returning zstat/tstat), not the Kolmogorov–Smirnov statistic. Update the metric IDs/descriptions that say "Kolmogorov-Smirnov statistic".
  • Document the goodness-of-fit framing. There is no train/test split, and the positive control is a copy of the input. State explicitly that the benchmark measures reproduction of the fitted data, not generalization.
  • Document the scaled_mean/scaled_var design. These z-score each dataset before the KS test, so only distribution shape is compared (location/scale removed) — intentional, but non-obvious.
  • Unify metric output conventions. ks_statistic_gene_cell emits both zstat and tstat; ks_statistic_sc_features emits only zstat.
  • De-duplicate helpers. as_finite_kde_input / add_kde_jitter / try_kde_test are copy-pasted into both KS metric scripts → move to src/helpers/utils.R.
  • Fix doc drift. README / _viash.yaml claim "13 simulation methods using 10 distinct STR datasets", but the repo ships 8 simulators + 3 controls and one test dataset; CHANGELOG lists synsim (→ symsim).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions