Bootstrapping Self-Improvement of Language Model Programs for Zero-Shot Schema Matching
Authors: Nabeel Seedat, Mihaela Van Der Schaar
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches, highlighting its potential to accelerate data integration and interoperability of ML-ready data. We conduct experiments on the MIMIC-OMOP and Synthea-OMOP datasets, which are the standard benchmark datasets used in prior schema matching works (Sheetrit et al., 2024; Zhang et al., 2023b; Narayan et al., 2022; Zhang et al., 2023a; 2021). These datasets are real-world healthcare schema matching datasets and have been widely adopted due to their complexity and their reflection of real-world schema matching challenges. |
| Researcher Affiliation | Collaboration | Nabeel Seedat 1 2 Mihaela van der Schaar 1 1Department of Applied Mathematics and Theoretical Physics, University of Cambridge 2Foundational Machine Learning Research, Thomson Reuters. Correspondence to: Nabeel Seedat <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Optimize LM program L 0: Input: Set of evaluation queries Deval = e1, e2, . . . , en 0: Output: Set of top n demonstrations Ddemo ... Algorithm 3 Matchmaker: Schema Matching with Self-Improving Compositional Language Model Programs Require: Source schema Ss, Target schema St Ensure: Schema matches M |
| Open Source Code | Yes | 2https://github.com/seedatnabeel/Matchmaker or https://github.com/vanderschaarlab/Matchmaker |
| Open Datasets | Yes | We conduct experiments on the MIMIC-OMOP and Synthea-OMOP datasets, which are the standard benchmark datasets used in prior schema matching works (Sheetrit et al., 2024; Zhang et al., 2023b; Narayan et al., 2022; Zhang et al., 2023a; 2021). ... Open-source data: https://github.com/meni Data1/MIMIC_2_OMOP ... Open-source data: https://github.com/JZCS2018/SMAT/tree/main/datasets/omap/ |
| Dataset Splits | Yes | Note there is no specific train-test sets used as in supervised learning. As we perform the schema matching task in a zero-shot manner. ... In our experiments, we assess two variants given that labeled training data for schema matching is hard to access: (i) 20-80: 20% train and 80% test and (ii) 50-50: 50% train and 50% test. |
| Hardware Specification | Yes | All experiments are run on a single Nvidia A4000 GPU with 20 GB of vram. |
| Software Dependencies | Yes | The model version used as the LLM was GPT-4-1106, with the following settings: ... We use Colbert-V2 (Santhanam et al., 2022) as the embedding model ... All LLM baselines use GPT-4 (0613) (Open AI, 2023) as the backbone for fair comparison to the original works and to isolate the gains of the system not tied to the LLM. |
| Experiment Setup | Yes | GPT-4 Hyper-parameters. The model version used as the LLM was GPT-4-1106, with the following settings: { temperature : 0.5, max_tokens : 1024, top_p : 1, frequency_penalty : 0, presence_penalty : 0, n : 1, } |