PooDLe🐩: Pooled and dense self-supervised learning from naturalistic videos
Authors: Alex N. Wang, Christopher Hoang, Yuwen Xiong, Yann LeCun, Mengye Ren
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our method with experiments on the BDD100K driving video dataset and the Walking Tours first-person video dataset, demonstrating its ability to capture spatial understanding from a dense objective and semantic understanding via a pooled representation objective. |
| Researcher Affiliation | Collaboration | 1New York University, 2Meta EMAIL |
| Pseudocode | No | The paper describes its methods using textual explanations, mathematical equations, and diagrams (Figure 2, Figure 3), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a project website link (https://agenticlearning.ai/poodle/) but does not contain an explicit statement about releasing source code for the methodology described, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | We pretrain Poo DLe on raw videos from BDD100K (Yu et al., 2020) and Walking Tours (WT) (Venkataramanan et al., 2024) and evaluate them on semantic segmentation and object detection benchmarks. The BDD100K pretrained model is evaluated on in-distribution tasks as well as Cityscapes (Cordts et al., 2016), and the Walking Tours model on ADE20K (Agrawal et al., 2015) and our newly proposed Walking Tours Semantic benchmark. We also ablate our combination of loss functions and decoder components, as well as the effects of crop area and input resolutions. ... Flow is predicted using a supervised off-the-shelf RAFT model or an unsupervised UFlow (Jonschkowski et al., 2020) model that we train ourselves. For unsupervised training, we exactly follow UFlow and train on the KITTI (Geiger et al., 2013) dataset before finetuning on BDD100k (Yu et al., 2020) for 100,000 steps on daytime-only videos. ... We use torchvision for supervised Image Net (IN1K) and weights released online for Image Net-pretrained DINO. |
| Dataset Splits | Yes | BDD (Yu et al., 2020) consists of 100,000 dashcam driving videos... We pretrain with the 70,000 videos in the training split and evaluate on the dataset s semantic segmentation and object detection tasks. ... For each training epoch on WT, we divide each video into 10-second clips and randomly sample two frames 0.5 seconds apart from each clip... We utilize the 25,910 frames sourced from WTall as the training set and the 6,170 frames sourced from the 3 new videos as the validation set. |
| Hardware Specification | Yes | The full model is trained on 16 A100s and takes about 30h for 100 epochs on BDD100K or 18min per epoch. ... Ablation-sized experiments were run on 2 or 4 H100/A100 GPUs for a total of 40 epochs, taking 20 40h depending on the configuration. |
| Software Dependencies | No | The paper mentions several software components like ResNet-50, DeepLab v1, UperNet, Faster R-CNN, RAFT, UFlow, and Open See D, often citing papers where they are introduced. However, it does not provide specific version numbers for these software components or underlying frameworks (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We use a Res Net-50 (R50) (He et al., 2016a) as our feature encoder, with the dense projector and predictor networks following Flow E (Xiong et al., 2021) and pooled counterparts following BYOL (Grill et al., 2020). ... When training on BDD, we sample two frames that are 0.5 1 seconds apart ( t {15...30}) from each video. We then take two large crops from the same image coordinates of area [0.16, 0.45] of the original image and resize them to 512 1024 pixels before applying augmentations. ... For the local objective, we sample K = 6 subcrop pairs with a crop area of [0.05, 0.3] of the initial crop, resized to 192 192 for both BDD and WT. For subcrops, random spatial jitter is applied as 10% of the initial crops height and width. ... Adam W is used as the optimizer and a weight decay value of 0.01. A learning rate of 5e 4 is used with 32 GPUs and 4 image pairs per GPU for a batch size total of 128. Cosine learning rate decay is used with a schedule for 300 epochs, despite early termination due to compute limitations. LR warmup is used for 2 training epochs. ... We use the same occlusion formulation as DDFlow (Liu et al., 2019) and parameters α1 = 0.1, α2 = 0.5. |