NatureLM-audio: an Audio-Language Foundation Model for Bioacoustics

Authors: David Robinson, Marius Miron, Masato Hagiwara, Olivier Pietquin

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Nature LM-audio on a novel benchmark (BEANS-Zero) and it sets a new state of the art on several bioacoustics tasks, including zero-shot classification of unseen species. To advance bioacoustics research, we release our model weights, benchmark data, and open-source the code for training and benchmark data generation and model training. 7 EXPERIMENTS
Researcher Affiliation Industry David Robinson , Marius Miron, Masato Hagiwara, Olivier Pietquin Earth Species Project Corresponding author. EMAIL
Pseudocode No The paper describes the model architecture and training method using textual descriptions and mathematical equations (e.g., equations 1-3) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes To advance bioacoustics research, we release our model weights, benchmark data, and open-source the code for training and benchmark data generation and model training. 1Project page: https://earthspecies.github.io/naturelm-audio-demo/
Open Datasets Yes To train an audio-text model for bioacoustics, we compile a diverse dataset of text-audio pairs (Table 1). The data is collected through a combination of prompting on existing audio datasets, generating new LLM-generated text labels, and mixing new, procedurally augmented audio data. The dataset is comprised of bioacoustic recordings, general audio, speech, and music datasets. Table 1 lists many datasets used, such as 'CAP Wav Caps (Mei et al., 2024)', 'CLS NSynth (Engel et al., 2017)', 'CLS Libri TTS (Zen et al., 2019)', 'CLS, DET, CAP Xeno-canto (Vellinga & Planqu e, 2015)', and 'CLS, DET, CAP i Naturalist (i Naturalist)'.
Dataset Splits Yes To evaluate generalization, we create hold-out splits for Xeno-canto, i Naturalist, Animal Sound Archive, and Watkins datasets, used solely for benchmarking. unseen-species: 200 species held out from Animal Speak (Robinson et al., 2024). unseen-genus: We hold out entire genus whose family is well-represented (at least 250 training examples) totaling 101 unique species. unseen-family: We hold out entire families whose class is well-represented (at least training 250 examples) totaling 36 unique species and representing the hardest generalization setting.
Hardware Specification No The paper does not provide specific details about the hardware used, such as GPU models (e.g., NVIDIA A100), CPU models (e.g., Intel Xeon), or specific cloud instances. It mentions models like 'BEATs' and 'Llama 3.18b' but not the underlying hardware they ran on.
Software Dependencies No The paper mentions several models and architectures used, such as 'BEATs (Chen et al., 2023)', 'Q-Former (Li et al., 2023)', 'Llama 3.18b (Dubey et al., 2024a)', and 'Lo RA (Hu et al., 2022)'. However, it does not specify version numbers for any software dependencies or libraries like Python, PyTorch, TensorFlow, or CUDA.
Experiment Setup Yes We initialize the audio encoder weights using an existing BEATs checkpoint (BEATs iter3 plus AS2M finetuned on AS2M cpt2.pt) and fully fine-tune it... We initialize the LLM from Llama-3.1-8B-Instruct and apply Lo RA to all attention layers (rank: 32, alpha: 32, dropout: 0.1). We follow the proposed two-stage training strategy. In both stages, we use a linear warmup followed by a cosine learning rate schedule, with a peak learning rate of 9.0 10 5 and an end learning rate of 2.0 10 5. We use a batch size of 128 and run the first stage for 5.0 105 steps and the second stage for 1.6 106 steps. For inference, we use beam search with two beams, a repetition penalty of 1.0, and a length penalty of 1.0.