You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi @ycao6928 ! I let Claude review this code repository, and it came up with several things we need to fix before doing another benchmark run. I reviewed its comments -- there are a few I removed because I know they're okay. I believe the rest we should fix. Would you like to discuss first?
🔴 Urgent: wrong or non-reproducible scores
JSD / RMSE are swapped in the downstream metric.
In src/metrics/downstream/script.R:33-34, ctdeconvolute_rmse is assigned the output of generate_jds() (JSD) and ctdeconcolute_jsd the output of generate_rmse() (RMSE). The two are inverted, which also breaks normalization because config.vsh.yaml gives RMSE max: +Inf and JSD max: 1.
Swap the assignments so each ID holds the correct quantity.
Gene-level Pearson is a duplicate of cell-level Pearson.
In src/metrics/ks_statistic_gene_cell/script.R:202-204 and 229-231, both pearson_*_cells and pearson_*_genes call proxyC::simil(counts, method = "correlation") on the same matrix in the same orientation (row-wise = cell–cell). The gene metric therefore is not gene-level at all.
Compute the gene version on the transposed matrix (use margin = 2), matching the original commented-out cor() logic.
"Effective library size" is identical to "library size". script.R:137-151: both are log1p(rowSums(counts)), so ks_statistic_efflib_size_cells_* always equals ks_statistic_lib_size_cells_*.
Define the effective library size properly (e.g. library size × TMM/normalization factor), or remove the metric if redundant.
Crash-prone subsampling in the Pearson KS tests. script.R:207-208,233-234: sample(as.numeric(pearson_*), 10000) has no set.seed (scores vary between runs) and no replace = TRUE (errors when the pool has fewer than 10000 values, i.e. small datasets).
Cap the sample at the population size (or use replace = TRUE).
reclassify_simsce hardcodes exactly 4 spatial clusters. src/helpers/utils.R:138-147 loops 1:4 over a fixed 4×4 matrix, while generate_sim_spatialCluster uses q = max(spatial_cluster) (utils.R:175). For any dataset without exactly 4 clusters, generate_sim_spatialcluster writes wrong/NA spatial_cluster labels.
Make the reclassification dynamic in the number of clusters.
🟠 Should fix: validity, reproducibility, compute
The precompute pipeline is computed but unused (and partly duplicated). sc_features and precompute_downstream store scFeatures / SVGs / Moran's I / CARD proportions into the processed dataset, but the metrics recompute all of it from scratch (see the commented "use precomputed" lines and the TODO at downstream/script.R:1). BayesSpace clustering and CARD run twice (process step + metric), unseeded, so the two runs can disagree.
Either wire the metrics to read the precomputed slots, or drop the precompute steps.
Per-cluster simulation loops assume no cluster fails. splatter/script.R:27-45 and symsim/script.R:29-81 wrap each cluster in try(), then unconditionally run colnames(simulated_result) <- rownames(input_ordered$obs). A single failed cluster causes a dimension-mismatch hard error (or silent misalignment); a failed first cluster leaves the result NULL.
Handle partial/failed clusters explicitly (skip + realign, or fail loudly).
BayesSpace platform hardcoded to "ST" (utils.R:173-174) even when the dataset assay is Visium, which changes the neighbor graph used for clustering.
Drive the platform from dataset_assay_spatial.
negative_normal emits continuous values as counts (negative_normal/script.R:18), violating the counts: integer contract in file_simulated_dataset.yaml. It may crash integer-expecting metrics (edgeR/SPARK) instead of producing a low score.
Emit integer counts (e.g. round/rpois) so the control scores poorly rather than erroring.
--base tissue is advertised but unimplemented for everything except SRTsim (all others stop("ONLY domain base")).
Implement tissue where feasible, or remove the choice from comp_method.yaml.
🟡 Suggestions: framing & hygiene
Metric naming is inaccurate. The code uses ks::kde.test (a kernel-density global two-sample test returning zstat/tstat), not the Kolmogorov–Smirnov statistic. Update the metric IDs/descriptions that say "Kolmogorov-Smirnov statistic".
Document the goodness-of-fit framing. There is no train/test split, and the positive control is a copy of the input. State explicitly that the benchmark measures reproduction of the fitted data, not generalization.
Document the scaled_mean/scaled_var design. These z-score each dataset before the KS test, so only distribution shape is compared (location/scale removed) — intentional, but non-obvious.
Unify metric output conventions.ks_statistic_gene_cell emits both zstat and tstat; ks_statistic_sc_features emits only zstat.
De-duplicate helpers.as_finite_kde_input / add_kde_jitter / try_kde_test are copy-pasted into both KS metric scripts → move to src/helpers/utils.R.
Fix doc drift. README / _viash.yaml claim "13 simulation methods using 10 distinct STR datasets", but the repo ships 8 simulators + 3 controls and one test dataset; CHANGELOG lists synsim (→ symsim).
Hi @ycao6928 ! I let Claude review this code repository, and it came up with several things we need to fix before doing another benchmark run. I reviewed its comments -- there are a few I removed because I know they're okay. I believe the rest we should fix. Would you like to discuss first?
🔴 Urgent: wrong or non-reproducible scores
JSD / RMSE are swapped in the
downstreammetric.In
src/metrics/downstream/script.R:33-34,ctdeconvolute_rmseis assigned the output ofgenerate_jds()(JSD) andctdeconcolute_jsdthe output ofgenerate_rmse()(RMSE). The two are inverted, which also breaks normalization becauseconfig.vsh.yamlgives RMSEmax: +Infand JSDmax: 1.ctdeconcolute_jsd→ctdeconvolute_jsd(script + config + metric IDs).Gene-level Pearson is a duplicate of cell-level Pearson.
In
src/metrics/ks_statistic_gene_cell/script.R:202-204and229-231, bothpearson_*_cellsandpearson_*_genescallproxyC::simil(counts, method = "correlation")on the same matrix in the same orientation (row-wise = cell–cell). The gene metric therefore is not gene-level at all.margin = 2), matching the original commented-outcor()logic."Effective library size" is identical to "library size".
script.R:137-151: both arelog1p(rowSums(counts)), soks_statistic_efflib_size_cells_*always equalsks_statistic_lib_size_cells_*.Crash-prone subsampling in the Pearson KS tests.
script.R:207-208,233-234:sample(as.numeric(pearson_*), 10000)has noset.seed(scores vary between runs) and noreplace = TRUE(errors when the pool has fewer than 10000 values, i.e. small datasets).replace = TRUE).reclassify_simscehardcodes exactly 4 spatial clusters.src/helpers/utils.R:138-147loops1:4over a fixed 4×4 matrix, whilegenerate_sim_spatialClusterusesq = max(spatial_cluster)(utils.R:175). For any dataset without exactly 4 clusters,generate_sim_spatialclusterwrites wrong/NAspatial_clusterlabels.🟠 Should fix: validity, reproducibility, compute
The precompute pipeline is computed but unused (and partly duplicated).
sc_featuresandprecompute_downstreamstore scFeatures / SVGs / Moran's I / CARD proportions into the processed dataset, but the metrics recompute all of it from scratch (see the commented "use precomputed" lines and theTODOatdownstream/script.R:1). BayesSpace clustering and CARD run twice (process step + metric), unseeded, so the two runs can disagree.Per-cluster simulation loops assume no cluster fails.
splatter/script.R:27-45andsymsim/script.R:29-81wrap each cluster intry(), then unconditionally runcolnames(simulated_result) <- rownames(input_ordered$obs). A single failed cluster causes a dimension-mismatch hard error (or silent misalignment); a failed first cluster leaves the resultNULL.symsimusesgene_len_poolwithout loading it (symsim/script.R:57).data(gene_len_pool).BayesSpace platform hardcoded to
"ST"(utils.R:173-174) even when the dataset assay isVisium, which changes the neighbor graph used for clustering.dataset_assay_spatial.negative_normalemits continuous values ascounts(negative_normal/script.R:18), violating thecounts: integercontract infile_simulated_dataset.yaml. It may crash integer-expecting metrics (edgeR/SPARK) instead of producing a low score.--base tissueis advertised but unimplemented for everything except SRTsim (all othersstop("ONLY domain base")).tissuewhere feasible, or remove the choice fromcomp_method.yaml.🟡 Suggestions: framing & hygiene
ks::kde.test(a kernel-density global two-sample test returningzstat/tstat), not the Kolmogorov–Smirnov statistic. Update the metric IDs/descriptions that say "Kolmogorov-Smirnov statistic".scaled_mean/scaled_vardesign. These z-score each dataset before the KS test, so only distribution shape is compared (location/scale removed) — intentional, but non-obvious.ks_statistic_gene_cellemits bothzstatandtstat;ks_statistic_sc_featuresemits onlyzstat.as_finite_kde_input/add_kde_jitter/try_kde_testare copy-pasted into both KS metric scripts → move tosrc/helpers/utils.R._viash.yamlclaim "13 simulation methods using 10 distinct STR datasets", but the repo ships 8 simulators + 3 controls and one test dataset; CHANGELOG listssynsim(→symsim).