Abstract:
Video data is critical for learning systems that can perceive, interact with and reason about the world. Going beyond images, videos can provide useful cues about motion, causality, and even geometry, while often consisting of multiple modes of data. Yet, they also heighten the challenges in modeling such as computational and annotation costs. This observation stands in both discriminative (eg: activity recognition, video question-answering) and generative modeling (eg: video generation or editing), leading to efficiency bottlenecks across video pipelines, from training to inference. In this thesis, we introduce techniques that can mitigate such inefficiencies, with a focus on (1) making inference faster, and (2) training with freely-available signals.
First, we look into the inference pipelines of two challenging video modeling setups, namely, fine-grained activity recognition (i.e., temporal activity detection) and diffusion-based video editing. In activity detection, traditional video models sample inputs at fixed temporal resolutions, having to consume redundant information. To avoid this, we introduce Coarse-Fine Networks which sub-samples inputs dynamically in time, by learning the importance of each frame, and in turn, reducing the compute footprint significantly. On the other hand, in video editing, typical state-of-the-art models incur heavy memory and computational costs to generate temporally-coherent frames, in the form of diffusion inversion or cross-frame attention. To alleviate such latency overheads, we introduce Object-Centric Diffusion, which allocates fewer computations towards background regions that are often-unedited or arguably less-important for perceptual quality, showing up to 10x speed-ups in zero-shot, for a comparable synthesis quality.
Next, we investigate the training pipelines of video models in terms of their annotation costs. To avoid the need for expensive frame-level labels when pretraining for activity detection, we introduce Weakly-guided Self-supervised Detection Pretraining. It leverages weak video-level labels to design a pretext task that emulates detection (i.e., per-frame prediction). While requiring no extra annotations, our proposal outperforms prior art at the same training budget. Going beyond, we introduce Video-conditioned Text Representations, a video vision-language model (VLM) that supports learning from auxiliary semantic concepts (given as text), without any annotations. Starting from an image-VLM, we not only augment visual embeddings with temporal information, but also adapt text embeddings to video by grounding them on visual modality. Our method shines especially in challenging setups where language can be more-revealing than vision (eg: few-shot recognition), while gaining from label-free semantics.
Dates
Wednesday, April 24, 2024 - 03:30pm to Wednesday, April 24, 2024 - 04:30pm
Location
NCS 120
Event Description
Event Title
Ph.D. Proposal Defense: 'Towards Efficient Video Understanding and Generation: Free Training Signals to Faster Inference' - Kumara Kahatapitiya