Action Recognition from Single Timestamp Supervision in Untrimmed Videos

CVPR 2019


Labelling the start and end times of actions in long untrimmed videos is not only expensive, but often ambiguous. In this work, we use cheaper roughly-aligned labels. These are one timestamp per action, close to (but not necessarily within) the action instance (see the coloured dots).

We propose a method that starts from sampling distributions initialised from these single timestamps. The initial distributions (top) may overlap and contain background frames. We iteratively refine the locations and temporal extent of these distributions (bottom) using the classifier response during training.

We evaluate our method on three datasets for fine-grained recognition: THUMOS, BEOID and EPIC-Kitchens. Our work demonstrates that the sampling distributions converge to the location and extent of discriminative action segments.

Authors

  • Davide Moltisanti - University of Bristol, Visual Information Laboratory
  • Sanja Fidler - University of Toronto; NVIDIA; Vector Institute
  • Dima Damen - University of Bristol, Visual Information Laboratory

Abstract

Recognising actions in videos typically requires the start and end times of each action to be known. This labelling is not only subjective, but also expensive to acquire.
We propose a method that uses single timestamps located around each action instance, in untrimmed videos, as opposed to the standard start/end times. We replace expensive action bounds with sampling distributions initialised from these timestamps. We then use the classifier's response to iteratively update the sampling distributions. We demonstrate that these distributions converge to the location and extent of discriminative action segments.
We evaluate our method on three datasets for fine-grained recognition and show that single timestamps offer a reasonable compromise between recognition performance and labelling effort, performing comparably to full temporal supervision.


Updating the sampling distribution using the classifier response. Example from action 'open fridge' in EPIC Kitchens.

Video

Poster

Bibtex

@InProceedings{moltisanti19action,
author = "Moltisanti, Davide and Fidler, Sanja and Damen, Dima",
title = "{A}ction {R}ecognition from {S}ingle {T}imestamp {S}upervision in {U}ntrimmed {V}ideos",
booktitle = "Computer Vision and Pattern Recognition (CVPR)",
year = "2019"
}

Acknowledgement

Research supported by: EPSRC LOCATE (EP/N033779/1) and EPSRC Doctoral Training Programme at the University of Bristol.

Part of this work was carried out during Davide's summer internship at the University of Toronto, working with Dr. Fidler.