ST-CLIP

Spatio-Temporal Context Prompting for
Zero-Shot Action Detection

1National Tsing Hua University, Taiwan   2NVIDIA
WACV 2025

Abstract

Spatio-temporal action detection encompasses the tasks of localizing and classifying individual actions within a video. Recent works aim to enhance this process by incorporating interaction modeling, which captures the relationship between people and their surrounding context. However, these approaches have primarily focused on fully-supervised learning, and the current limitation lies in the lack of generalization capability to recognize unseen action categories. In this paper, we aim to adapt the pretrained image-language models to detect unseen actions. To this end, we propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction. Meanwhile, our Context Prompting module will utilize contextual information to prompt labels, thereby enhancing the generation of more representative text features. Moreover, to address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism which employs pretrained visual knowledge to find each person’s interest context tokens, and then these tokens will be used for prompting to generate text features tailored to each individual. To evaluate the ability to detect unseen actions, we propose a comprehensive benchmark on J-HMDB, UCF101-24, and AVA datasets. The experiments show that our method achieves superior results compared to previous approaches and can be further extended to multi-action videos, bringing it closer to real-world applications.

MY ALT TEXT

ST-CLIP framework

MY ALT TEXT

We first extract the person tokens for the person bounding boxes detected from each frame. Then, we perform temporal modeling on the neighboring frames to obtain the context tokens. After that, we leverage the CLIP’s visual knowledge to perform person-context interaction on these tokens. In addition, we utilize the attention weight in each encoder layer to find the interest tokens for each person, then the Context Prompting layer will use these visual tokens to prompt the class names. Finally, the cosine similarities between person-context relational tokens and the label prompting features determine the classification scores for the actions.

Proposed Benchmarks

We establish benchmarks for zero-shot spatial-temporal action detection on three popular datasets: J-HMDB, UCF101-24, and AVA. For J-HMDB and UCF101-24, we take 75% action classes for training, and the remaining 25% for testing. Besides, we conduct cross-validation with varying train/test label combinations. For AVA, we randomly select some training videos, ensuring that they all lack samples of the same classes. These missing classes are then treated as unseen classes for evaluation. During the evaluation phase, we test all classes in the validation videos, but the focus is solely on evaluating the performance on unseen classes. The following are the testing classes in each label split.

Experimental results

1. baseline: For a frame with detected individuals, the baseline utilizes the pretrained image encoder of CLIP to extract the image feature of this frame. Subsequently, it calculates the cosine similarities with the text features of each class name, which are then considered as the action classification scores for these individuals.

2. baseline (person crop): Based on 1, we further crop out parts of each person to obtain their respective image features for classification.

3. iCLIP: Separately classify actions for each individual. (Their classification units are the same as ours.)

4. ViCLIP: We also experiment with a video-language model. We extract the video feature map using a video encoder. Then, for each person, we use their bounding box to perform ROIAlign, obtaining the person feature for classification. (Their classification units are the same as ours.)

5. Video classification methods (ActionCLIP, A5, X-CLIP, Vita-CLIP): For J-HMDB and UCF101-24, since each video in both datasets contains only a single action, These methods can initially classify the entire video into an action class and then consider all detected individuals in the video as performing this action. As for AVA, we further narrowed the scope of classification from the entire video to tracklets.

6. Ours with the assumption of single-action video: We perform soft voting on each person’s classification score, extending our method to suit this scenario.

We present the experimental results for unseen classes in each label split. We also calculate the harmonic mean (H) of the average performance of both base and unseen classes.

BibTeX

@article{huang2024spatio,
        title={Spatio-Temporal Context Prompting for Zero-Shot Action Detection},
        author={Huang, Wei-Jhe and Chen, Min-Hung and Lai, Shang-Hong},
        journal={arXiv preprint arXiv:2408.15996},
        year={2024}
      }