reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Reconstructing Cell Lineage Trees from Phenotypic Features with Metric Learning

Authors: Da Kuang, Guanwen Qiu, Junhyong Kim

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that Cell Tree QM recovers lineage structures with minimal supervision and limited data, offering a scalable framework... We introduce a Lineage Reconstruction Benchmark comprising (a) synthetic datasets based on Brownian motion with independent noise and spurious signals, (b) lineage-resolved sc RNA-seq datasets. Experimental results on the benchmark demonstrate that Cell Tree QM efficiently reconstructs lineage structures under weak supervision and limited data...
Researcher Affiliation	Academia	1Department of Computer and Information Science, University of Pennsylvania, Philadelphia, USA 2Department of Biology, University of Pennsylvania, Philadelphia, USA. Correspondence to: Da Kuang <EMAIL>, Junhyong Kim <EMAIL >.
Pseudocode	No	The paper describes methods and workflows in prose and with figures (e.g., Figure 2 for Cell Tree QM Workflow, Figure 11 for Model Architecture), but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Code and benchmarks are available at: https://kuang-da.github.io/ Cell Tree QM-page
Open Datasets	Yes	We introduce a Lineage Reconstruction Benchmark comprising (a) synthetic datasets modeled via Brownian motion with independent noise and spurious signals; (b) lineage-resolved single-cell RNA sequencing datasets... Lineage-Resolved C. elegans Dataset: Among model organisms, C. elegans is uniquely suited for benchmarking lineage reconstruction because its embryonic cell lineage is invariant. We curate three subsets of increasing size C. elegans Small, Mid, and Large from transcriptomic atlases by Packer et al. (2019) and Large et al. (2024), containing 102, 183, and 295 leaves, respectively.
Dataset Splits	No	In D.2.4, 'SUPERVISED SETTING', the paper states: 'To evaluate lineage reconstruction methods in a supervised setting, we generate two sets of signal variables: one for training and one for testing (Fig. 10).' It also mentions in 'Partial-Labeled Leaves Setting' that 'only a subset of leaves (e.g., 30%, 50%, or 80%) have known lineage information'. However, specific percentages or numerical counts for the standard train/validation/test splits of the overall C. elegans datasets are not provided, nor are predefined splits referenced with citations.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. It only mentions 'high computational cost' for some experiments without detailing the hardware.
Software Dependencies	No	The paper mentions using a 'deep learning framework based on transformer architectures' and 'Transformer encoder blocks' and 'fully connected layers', but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	H. Training Setups... Supervised Setting. Simulation: We use a Transformer with 8 layers and 2 attention heads. The projection layer and hidden layer are both 256-dimensional, and the model outputs 128-dimensional embeddings. We apply a data dropout of 0.3 and a metric dropout of 0.2. When necessary, we set the gate regularization weight to 5. Real data: The Transformer also has 8 layers and 2 attention heads, but with 1024-dimensional projection and hidden layers, producing 128-dimensional outputs. Here, we use a data dropout of 0.1 and a metric dropout of 0.1. When needed, the gate regularization weight is 0.01.