Benchmark review

Hi @ycao6928 ! I let Claude review this code repository, and it came up with several things we need to fix before doing another benchmark run. I reviewed its comments -- there are a few I removed because I know they're okay. I believe the rest we should fix. Would you like to discuss first?

---

## 🔴 Urgent: wrong or non-reproducible scores

- [x] **JSD / RMSE are swapped in the `downstream` metric.**
  In [`src/metrics/downstream/script.R:33-34`](src/metrics/downstream/script.R#L33-L34), `ctdeconvolute_rmse` is assigned the output of `generate_jds()` (JSD) and `ctdeconcolute_jsd` the output of `generate_rmse()` (RMSE). The two are inverted, which also breaks normalization because [`config.vsh.yaml`](src/metrics/downstream/config.vsh.yaml#L53-L78) gives RMSE `max: +Inf` and JSD `max: 1`.
  - [x] Swap the assignments so each ID holds the correct quantity.
  - [x] Fix the typo `ctdeconcolute_jsd` → `ctdeconvolute_jsd` (script + config + metric IDs).

- [x] **Gene-level Pearson is a duplicate of cell-level Pearson.**
  In [`src/metrics/ks_statistic_gene_cell/script.R:202-204` and `229-231`](src/metrics/ks_statistic_gene_cell/script.R#L200-L235), both `pearson_*_cells` and `pearson_*_genes` call `proxyC::simil(counts, method = "correlation")` on the same matrix in the same orientation (row-wise = cell–cell). The gene metric therefore is not gene-level at all.
  - [x] Compute the gene version on the transposed matrix (use `margin = 2`), matching the original commented-out `cor()` logic.

- [x] **"Effective library size" is identical to "library size".**
  [`script.R:137-151`](src/metrics/ks_statistic_gene_cell/script.R#L137-L151): both are `log1p(rowSums(counts))`, so `ks_statistic_efflib_size_cells_*` always equals `ks_statistic_lib_size_cells_*`.
  - [x] Define the effective library size properly (e.g. library size × TMM/normalization factor), or remove the metric if redundant.

- [x] **Crash-prone subsampling in the Pearson KS tests.**
  [`script.R:207-208,233-234`](src/metrics/ks_statistic_gene_cell/script.R#L207-L208): `sample(as.numeric(pearson_*), 10000)` has no `set.seed` (scores vary between runs) and no `replace = TRUE` (errors when the pool has fewer than 10000 values, i.e. small datasets).
  - [x] Cap the sample at the population size (or use `replace = TRUE`).

- [x] **`reclassify_simsce` hardcodes exactly 4 spatial clusters.**
  [`src/helpers/utils.R:138-147`](src/helpers/utils.R#L138-L147) loops `1:4` over a fixed 4×4 matrix, while `generate_sim_spatialCluster` uses `q = max(spatial_cluster)` ([`utils.R:175`](src/helpers/utils.R#L175)). For any dataset without exactly 4 clusters, [`generate_sim_spatialcluster`](src/process_datasets/generate_sim_spatialcluster/script.R) writes wrong/NA `spatial_cluster` labels.
  - [x] Make the reclassification dynamic in the number of clusters.

---

## 🟠 Should fix: validity, reproducibility, compute

- [ ] **The precompute pipeline is computed but unused (and partly duplicated).**
  [`sc_features`](src/process_datasets/sc_features/script.R) and [`precompute_downstream`](src/process_datasets/precompute_downstream/script.R) store scFeatures / SVGs / Moran's I / CARD proportions into the processed dataset, but the metrics recompute all of it from scratch (see the commented "use precomputed" lines and the `TODO` at [`downstream/script.R:1`](src/metrics/downstream/script.R#L1)). BayesSpace clustering and CARD run **twice** (process step + metric), unseeded, so the two runs can disagree.
  - [ ] Either wire the metrics to read the precomputed slots, or drop the precompute steps.

- [ ] **Per-cluster simulation loops assume no cluster fails.**
  [`splatter/script.R:27-45`](src/methods/splatter/script.R#L27-L45) and [`symsim/script.R:29-81`](src/methods/symsim/script.R#L29-L81) wrap each cluster in `try()`, then unconditionally run `colnames(simulated_result) <- rownames(input_ordered$obs)`. A single failed cluster causes a dimension-mismatch hard error (or silent misalignment); a failed *first* cluster leaves the result `NULL`.
  - [ ] Handle partial/failed clusters explicitly (skip + realign, or fail loudly).

- [ ] **`symsim` uses `gene_len_pool` without loading it** ([`symsim/script.R:57`](src/methods/symsim/script.R#L57)).
  - [ ] Add an explicit `data(gene_len_pool)`.

- [ ] **BayesSpace platform hardcoded to `"ST"`** ([`utils.R:173-174`](src/helpers/utils.R#L173-L174)) even when the dataset assay is `Visium`, which changes the neighbor graph used for clustering.
  - [ ] Drive the platform from `dataset_assay_spatial`.

- [ ] **`negative_normal` emits continuous values as `counts`** ([`negative_normal/script.R:18`](src/control_methods/negative_normal/script.R#L18)), violating the `counts: integer` contract in [`file_simulated_dataset.yaml`](src/api/file_simulated_dataset.yaml). It may crash integer-expecting metrics (edgeR/SPARK) instead of producing a low score.
  - [ ] Emit integer counts (e.g. round/rpois) so the control scores poorly rather than erroring.

- [ ] **`--base tissue` is advertised but unimplemented** for everything except SRTsim (all others `stop("ONLY domain base")`).
  - [ ] Implement `tissue` where feasible, or remove the choice from [`comp_method.yaml`](src/api/comp_method.yaml#L21).

---

## 🟡 Suggestions: framing & hygiene

- [ ] **Metric naming is inaccurate.** The code uses `ks::kde.test` (a kernel-density global two-sample test returning `zstat`/`tstat`), not the Kolmogorov–Smirnov statistic. Update the metric IDs/descriptions that say "Kolmogorov-Smirnov statistic".
- [ ] **Document the goodness-of-fit framing.** There is no train/test split, and the positive control is a copy of the input. State explicitly that the benchmark measures reproduction of the fitted data, not generalization.
- [ ] **Document the `scaled_mean`/`scaled_var` design.** These z-score each dataset before the KS test, so only distribution *shape* is compared (location/scale removed) — intentional, but non-obvious.
- [ ] **Unify metric output conventions.** `ks_statistic_gene_cell` emits both `zstat` and `tstat`; `ks_statistic_sc_features` emits only `zstat`.
- [ ] **De-duplicate helpers.** `as_finite_kde_input` / `add_kde_jitter` / `try_kde_test` are copy-pasted into both KS metric scripts → move to `src/helpers/utils.R`.
- [ ] **Fix doc drift.** README / `_viash.yaml` claim "13 simulation methods using 10 distinct STR datasets", but the repo ships 8 simulators + 3 controls and one test dataset; CHANGELOG lists `synsim` (→ `symsim`).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark review #16

🔴 Urgent: wrong or non-reproducible scores

🟠 Should fix: validity, reproducibility, compute

🟡 Suggestions: framing & hygiene

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Benchmark review #16

Description

🔴 Urgent: wrong or non-reproducible scores

🟠 Should fix: validity, reproducibility, compute

🟡 Suggestions: framing & hygiene

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions