Abstract:
Temporal action segmentation, as an important task in computer vision, plays an important role in a wide range of applications such as human activities analysis, video surveillance, etc. We tackle the temporal action segmentation problem with the goal of dense classification over time and treat it as a collaboration of two essential and complementary objectives, i.e., individual reasoning for per-frame classification and temporally continuous reasoning for per-action classification. We introduce a novel two-stage window-based Transformer to realize these two objectives in a highly efficient way separately. In the first stage, an action prediction network equips attentions with dilation to enlarge receptive fields without extra computation cost. Then, in the second stage, a lightweight denoising bottleneck structure is developed to re-represent the imperfect predictions and eliminate over-segmentation. Moreover, a simple but effective temporal label shift is developed to handle label ambiguity near action boundaries to improve the performance further.
![](/sites/default/files/styles/thumbnail/public/default_images/6KbA9AJs_400x400_0.png?itok=JZ3_TJqR)