Reconstructing Cell Lineage Trees from Phenotypic Features with Metric Learning

Authors: Da Kuang, Guanwen Qiu, Junhyong Kim

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that Cell Tree QM recovers lineage structures with minimal supervision and limited data, offering a scalable framework... We introduce a Lineage Reconstruction Benchmark comprising (a) synthetic datasets based on Brownian motion with independent noise and spurious signals, (b) lineage-resolved sc RNA-seq datasets. Experimental results on the benchmark demonstrate that Cell Tree QM efficiently reconstructs lineage structures under weak supervision and limited data...
Researcher Affiliation Academia 1Department of Computer and Information Science, University of Pennsylvania, Philadelphia, USA 2Department of Biology, University of Pennsylvania, Philadelphia, USA. Correspondence to: Da Kuang <EMAIL>, Junhyong Kim <EMAIL >.
Pseudocode No The paper describes methods and workflows in prose and with figures (e.g., Figure 2 for Cell Tree QM Workflow, Figure 11 for Model Architecture), but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code and benchmarks are available at: https://kuang-da.github.io/ Cell Tree QM-page
Open Datasets Yes We introduce a Lineage Reconstruction Benchmark comprising (a) synthetic datasets modeled via Brownian motion with independent noise and spurious signals; (b) lineage-resolved single-cell RNA sequencing datasets... Lineage-Resolved C. elegans Dataset: Among model organisms, C. elegans is uniquely suited for benchmarking lineage reconstruction because its embryonic cell lineage is invariant. We curate three subsets of increasing size C. elegans Small, Mid, and Large from transcriptomic atlases by Packer et al. (2019) and Large et al. (2024), containing 102, 183, and 295 leaves, respectively.
Dataset Splits No In D.2.4, 'SUPERVISED SETTING', the paper states: 'To evaluate lineage reconstruction methods in a supervised setting, we generate two sets of signal variables: one for training and one for testing (Fig. 10).' It also mentions in 'Partial-Labeled Leaves Setting' that 'only a subset of leaves (e.g., 30%, 50%, or 80%) have known lineage information'. However, specific percentages or numerical counts for the standard train/validation/test splits of the overall C. elegans datasets are not provided, nor are predefined splits referenced with citations.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. It only mentions 'high computational cost' for some experiments without detailing the hardware.
Software Dependencies No The paper mentions using a 'deep learning framework based on transformer architectures' and 'Transformer encoder blocks' and 'fully connected layers', but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes H. Training Setups... Supervised Setting. Simulation: We use a Transformer with 8 layers and 2 attention heads. The projection layer and hidden layer are both 256-dimensional, and the model outputs 128-dimensional embeddings. We apply a data dropout of 0.3 and a metric dropout of 0.2. When necessary, we set the gate regularization weight to 5. Real data: The Transformer also has 8 layers and 2 attention heads, but with 1024-dimensional projection and hidden layers, producing 128-dimensional outputs. Here, we use a data dropout of 0.1 and a metric dropout of 0.1. When needed, the gate regularization weight is 0.01.