InkSight: Offline-to-Online Handwriting Conversion by Teaching Vision-Language Models to Read and Write
Authors: Blagoj Mitrevski, Arina Rak, Julian Schnitzler, Chengkun Li, Andrii Maksai, Jesse Berent, Claudiu Cristian Musat
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our human evaluation reveals that 87% of the samples produced by our model on the challenging Hier Text dataset are considered as a valid tracing of the input image and 67% look like a pen trajectory traced by a human. In this section, we discuss the training datasets and implementation details, and present the qualitative and quantitative results, followed by an ablation study on training tasks and design choices. |
| Researcher Affiliation | Collaboration | Blagoj Mitrevski r , Arina Rak r , Julian Schnitzler r , Chengkun Li r , Andrii Maksai Q, Jesse Berent, Claudiu Musat Google Deep Mind, EPFL, work done as student researcher. r First authors: random order decided by AEA tool, QProject lead: EMAIL |
| Pseudocode | No | The paper describes methods and processes through textual explanations and diagrams (e.g., Figure 2 for Full Page System, Figure 3 and Table 1 for training tasks), but does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures that structure procedural steps like code. |
| Open Source Code | Yes | 1Git Hub: https://github.com/google-research/inksight Hugging Face: [link] Project page: [link] |
| Open Datasets | Yes | As public OCR training data, we use RIMES (Augustin et al., 2006; Grosicki et al., 2009), Hier Text (Long et al., 2022; 2023), IMGUR5K (Krishnan et al., 2023), ICDAR 15 historical documents (Murdock et al., 2015), and IAM (Marti & Bunke, 1999). As public digital ink training data, we use VNOn DB (Nguyen et al., 2018), SCUT-Couch (Li et al., 2008), and Deep Writing (Aksan et al., 2018). |
| Dataset Splits | Yes | To automatically assess our models, we evaluate the quality of derendering on the test splits of 3 OCR datasets: IAM (testset_f, 17.6k samples), IMGUR5K (~ 23.7k samples), and Hier Text. For Hier Text, which was not originally designed around handwritten samples, we apply the same filtering as to the OCR training data, and additionally only consider words marked as handwritten (~ 1.3k samples). Additionally, we have performed a small annotation campaign, asking people to trace 200 samples from the Hier Text test set. |
| Hardware Specification | Yes | With frozen Vi T encoders, the training of a 340M parameter model (such as Small-i or Small-p) takes 33h on 64 TPU v5e chips and the training of a 1B parameter Large-i model takes 105h on 64 TPU v5e chips, both run on an internal TPU v5e cluster and should be reproducible with public GCP TPU v5e. We benchmarked inference speed on three different GPU devices by measuring tokens processed per second over five runs. The Titan RTX achieved an average of 114.43 tokens/s, the T4 averaged 47.14 tokens/s, and the V100 averaged 139.14 tokens/s. |
| Software Dependencies | No | Similar to Pa LI models (Chen et al., 2022a; 2023a;b), our models together with the training mixtures are implemented in JAX/Flax (Bradbury et al., 2018) using the open-source T5X, Seq IO (Roberts et al., 2023) and Flaxformer (Heek et al., 2023) frameworks. While these frameworks are named and cited, specific version numbers for them are not provided. |
| Experiment Setup | Yes | For training, we use the Adafactor optimizer (Shazeer & Stern, 2018) with β1 = 0, second order moment of 0.8, and a languagemodel style teacher forcing with softmax cross-entropy loss. For the learning rate schedule, we use the linear decay scheduler with a peak learning rate of 0.01, 5k steps of warmup and a linear decay with decay factor of 3e 6, and a dropout of 0.1 on both Vi T encoder and the m T5 encoder-decoder. We train our models for 340k steps with batch size 512. |