Skip to content

mostafabahaa25/hdtm-group-activity-recognition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deep Learning Project for Volleyball Activity Recognition

An implementation of seminal CVPR 2016 paper: "A Hierarchical Deep Temporal Model for Group Activity Recognition."

Volleyball Activities

Table of Contents

Implemented Paper

Paper Year Original Paper Original Implementation Key Points
CVPR 16 2016 Paper Implementation Two-stage hierarchical LSTM for group activity recognition

Abstract

This project re-implements and extends the work presented in the original paper on group activity recognition using deep temporal modeling. The central idea of the study is that the temporal evolution of group activities can be inferred from the motion and actions of individual participants. To replicate and enhance this approach, a two-stage hierarchical LSTM model was implemented: the first LSTM captures temporal dynamics at the individual (player) level, while the second LSTM aggregates these representations to model group-level behavior. The model was trained and evaluated on both the Collective Activity Dataset and the Volleyball Dataset, achieving improved performance and demonstrating the effectiveness of hierarchical temporal modeling for structured activity understanding.

Model

B8

Figure 1: The figure illustrates the concept of group activity recognition using a hierarchical temporal model. In this framework, each individual within a scene is represented by a person-level LSTM, which captures the temporal dynamics of their actions. These individual representations are then integrated through a scene-level LSTM, allowing the model to learn and infer the overall group activity from collective temporal patterns.

B8

Figure 2: A detailed illustration of the implemented model architecture. Given tracklets of K players, each tracklet is first processed through a CNN backbone to extract spatial features, which are then passed to a person-level LSTM to model the temporal dynamics of individual actions. The resulting temporal representations from all players in the scene are pooled to form a unified scene representation. This aggregated feature is subsequently fed into a scene-level LSTM, which captures inter-player relationships and predicts the overall team activity.

B8

Figure 3: Earlier baseline models discarded important spatial arrangement information when aggregating player features. In the updated implementation, a two-group pooling mechanism was introduced to preserve the spatial configuration of players by separating features based on team affiliation. This enhancement enables the model to better capture team-wise spatial dynamics, leading to improved group activity recognition performance.

Key Contributions

  1. Enhanced Baseline Architectures: Redesigned and optimized the baseline models by integrating modern deep convolutional backbones (e.g., replacing AlexNet with ResNet50), resulting in improved feature representation and generalization.

  2. Significant Performance Gains: Achieved consistently higher accuracies across all baselines compared to the original study. Notably, the final baseline attained 91% accuracy, outperforming the paper’s reported 81.9%, highlighting the effectiveness of the proposed modifications.

  3. Introduction of a Novel Baseline (Baseline9): Developed an additional baseline that reached 91% accuracy without relying on a temporal module, demonstrating that strong spatial modeling alone can yield competitive results.

  4. Modern Framework Migration: Re-implemented the entire experimental pipeline using PyTorch, replacing the outdated Caffe framework to enable improved reproducibility, extensibility, and integration with current research workflows.

Accuracy and Improvement Over the Paper

Baseline Accuracy (Paper) Accuracy (Our Implementation)
B1-Image Classification 66.7% 77%
B2-Person Classification 64.6% skipped
B3-Fine-tuned Person Classification 68.1% 75%
B4-Temporal Model with Image Features 63.1% 77%
B5-Temporal Model with Person Features 67.6% skipped
B6-Two-stage Model without LSTM 1 74.7% 78%
B7-Two-stage Model without LSTM 2 80.2% 81%
B8-Two-stage Hierarchical Model(1 group) 70.3% 82%
B8-Two-stage Hierarchical Model(2 groups) 81.9% 90%
B9-Fine-Tuned Team Spatial Classification New-Baseline 91%

Key Takeaways

  1. Enhanced Baseline Performance: Achieved substantial improvements in baseline accuracy, reaching up to 91%, compared to the 81.9% reported in the original paper, demonstrating the effectiveness of the refined architectures and training strategies.

  2. Transition to a Modern Framework: Re-implemented the entire system in PyTorch, providing a more flexible, extensible, and research-friendly environment than the original Caffe implementation.

  3. Introduction of New Baselines: Developed additional baseline models, such as Baseline9, which achieved 90% accuracy without relying on temporal modeling—showcasing the robustness of purely spatial representations.

  4. Comprehensive Ablation Study: Conducted an in-depth ablation study across multiple baselines, systematically analyzing architectural variations and quantifying their impact on overall performance.

  5. Hierarchical Temporal Modeling: Employed a two-stage hierarchical LSTM framework to effectively model both individual-level dynamics and group-level interactions, enhancing temporal understanding of complex scenes.

  6. Team-Aware Pooling Mechanism: Introduced a team-wise pooling strategy to explicitly differentiate between opposing sides, thereby reducing inter-team confusion and improving activity classification accuracy.

  7. Expanded and Annotated Dataset: Utilized a rich volleyball dataset containing frame-level annotations, bounding boxes, and both individual and group activity labels to support reproducible and detailed analysis.

  8. Configurable Experimental Setup: Integrated YAML-based configuration files, allowing efficient and transparent control over hyperparameters, model architecture, and training routines.

  9. Training Optimization and Visualization Tools: Incorporated early stopping, metric tracking, and visualization modules—such as confusion matrices and classification reports—to facilitate informed model evaluation and comparison.

  10. Scalable and Modular Architecture: Designed the system following a modular and scalable structure, enabling future extensions, experiments, and integration with other research pipelines.

Installation

  1. Clone the repository:

    git clone https://github.com/mostafabahaa25/hdtm-group-activity-recognition
    cd hdtm-group-activity-recognition
  2. Install the required dependencies:

    pip install -r requirements.txt

Dataset

The experiments were conducted using the Volleyball Dataset introduced in the original study. This dataset provides a well-structured benchmark for group activity recognition and includes:

  • Videos: 55 publicly available volleyball match videos sourced from YouTube, capturing diverse team interactions and gameplay scenarios.

  • Annotated Frames: 4,830 manually annotated frames, each containing player bounding boxes, individual action labels, and group activity annotations, enabling fine-grained spatial–temporal analysis at both the player and scene levels.

Dataset Labels

Group Activity Classes

Class Instances
Right set 644
Right spike 623
Right pass 801
Right winpoint 295
Left winpoint 367
Left pass 826
Left spike 642
Left set 633

Action Classes

Class Instances
Waiting 3601
Setting 1332
Digging 2333
Falling 1241
Spiking 1216
Blocking 2458
Jumping 341
Moving 5121
Standing 38696

Dataset Splits

  • Training Set: 2/3 of the videos.
    • Train Videos: 1, 3, 6, 7, 10, 13, 15, 16, 18, 22, 23, 31, 32, 36, 38, 39, 40, 41, 42, 48, 50, 52, 53, 54.
  • Validation Set: 15 videos.
    • Validation Videos: 0, 2, 8, 12, 17, 19, 24, 26, 27, 28, 30, 33, 46, 49, 51.
  • Test Set: 1/3 of the videos.
    • Test Videos: 4, 5, 9, 11, 14, 20, 21, 25, 29, 34, 35, 37, 43, 44, 45, 47.

Dataset Sample

B8

The dataset is available for download at GitHub Deep Activity Rec, or on Kaggle here

Features

  • Comprehensive Baseline Implementations: Includes multiple experimental baselines (Baseline1, Baseline3, Baseline4, Baseline5, Baseline6, Baseline7, Baseline8, and Baseline9), each designed to explore different architectural and temporal modeling strategies.

  • Configurable Experimental Setup: Employs YAML-based configuration files to facilitate transparent, reproducible, and easily adjustable experimentation across various hyperparameters and model architectures.

  • Early Stopping Strategy: Integrates a built-in early stopping mechanism to prevent overfitting and ensure optimal model convergence based on validation performance.

  • Performance Monitoring and Visualization: Provides detailed evaluation metrics, including confusion matrices and classification reports, for comprehensive model assessment and interpretability.

  • Modular and Scalable Architecture: Designed with a modular codebase that supports scalability, enabling effortless integration of new components and future research extensions.

Ablation Study

Baseline Analysis and Insights

B1 - Image Classification

  • Description: Fine-tuned a ResNet50 model for frame-level classification, treating each frame independently without temporal modeling.

  • Insights: Demonstrated strong performance on static scene recognition but failed to capture the temporal dependencies critical for sequential group activities.

  • Key Features: Frame-level classification; absence of temporal or structural context.

B3 – Fine-Tuned Person Classification

  • Description: Fine-tuned ResNet50 on cropped player regions to perform individual action classification, followed by feature pooling for group activity recognition.

  • Insights: Improved granularity by focusing on individual players; however, performance remained limited due to the lack of explicit temporal modeling between frames.

  • Key Features: Person-centric classification; pooled representations without temporal aggregation

B4 - Temporal Model with Image Features

  • Description: Integrated an LSTM to capture temporal dependencies while continuing to use global image-level features as input.

  • Insights: Enabled sequential understanding of visual cues, yet suffered from the absence of structured player representations and spatial relationships.

  • Key Features: LSTM-based temporal modeling; global image features without explicit player modeling.

B6 - Two-stage Model Without Player-Level LSTM

  • Description: Omitted the player-level LSTM, retaining only the scene-level LSTM while relying on features extracted from individual players.

  • Insights: Scene-level modeling effectively captured holistic activity trends but at the expense of fine-grained temporal details at the player level.

  • Key Features: Scene-level temporal modeling; absence of player-level temporal learning.

B7 - Two-stage Model without LSTM 2

  • Description: Retained the player-level LSTM while removing the scene-level temporal module.

  • Insights: Preserved temporal consistency for individual players but struggled to integrate collective scene dynamics necessary for accurate group activity recognition.

  • Key Features: Player-level temporal modeling; no global scene-level temporal aggregation.

B8 - Two-stage Hierarchical Model

  • Description: Incorporates both player-level and scene-level LSTMs to construct a hierarchical temporal architecture, enabling the model to jointly capture individual actions and collective group behaviors over time.

  • Insights: Demonstrates the effectiveness of hierarchical temporal modeling by successfully integrating fine-grained player dynamics with high-level scene context, resulting in a more coherent understanding of group activities.

  • Key Features: Hierarchical LSTM framework; joint modeling of individual and team-level temporal dependencies.

B9

B8 - Two-stage Hierarchical Model with Team Pooling

  • Description: Adds team-wise pooling before applying scene-level LSTM.

  • Insights: Reduces confusion between left and right teams, improving classification.

  • Key Features: Team-wise pooling, hierarchical scene modeling.

B8

B9 - Fine-Tuned Team Spatial Classification

  • Description: Fine-tunes ResNet50 on individual player actions before pooling team representations.

  • Insights: Achieves state-of-the-art accuracy by leveraging fine-grained person representations.

  • Key Features: ResNet50-based person classification, Team-wise pooling, optimized scene classification.

B9

Baselines Implementation Comparison

Overview

This table outlines the progression of different baseline models, highlighting their implementation improvements and accuracy as measured in our implementation.

Baseline Model Baselines Implementation Accuracy (Our Implementation)
B1 - Image Classification Fine-tune ResNet50 On Image Level → Classify group activity. 78%
B2-Person Classification Extract person features(ResNet50 without Fine-tune) → Pool features over players → Classify group activity. I passed this baseline because it doesn't fine-tune. N/A
B3 - Fine-tuned Person Classification Fine-tune ResNet50 on Cropped Person Actions → Extract features → Pool features over players → Classify group activity. 76%
B4 - Temporal Model with Image Features Based on B1 → Extract image features → Apply LSTM for temporal modeling → Classify group activity. 80%
B5 - Temporal Model with Person Features Based on B2 → Apply LSTM for player-level modeling → Pool features → Classify group activity. I passed this baseline since I passed B2, and same idea applied in B7 N/A
B6 - Two-stage Model without LSTM 1 Based on B3 → Extract person features → Pool features → Apply LSTM for scene-level modeling → Classify group activity. 81%
B7 - Two-stage Model without LSTM 2 Based on B3 → Extract person features → Apply LSTM for player-level modeling → Pool features → Classify group activity. 88%
B8 - Two-stage Hierarchical Model Based on B3 → Extract person features → Apply LSTM for player-level modeling → Pool features over players → Apply LSTM for scene-level modeling → Classify group activity. 89.20%
B8 - Two-stage Hierarchical Model with Team Pooling Based on B7 → Extract person features → Apply LSTM for player-level modeling → Pool features per team → Concatenate Both Teams → Apply LSTM for scene-level modeling → Classify group activity. 93%
B9 - Fine-Tuned Team Spatial Classification Fine-tune ResNet50 on Cropped Person Actions → Extract player features → Pool features per team → Classify group activity. 92%

Key Takeaways

  • Baseline 1 → 3: Early models focus on frame-based CNN classification before shifting to person-level classification.
  • Baseline 4 → 5: Introduces LSTM-based temporal modeling for both image and player-level features.
  • Baseline 6 → 7: Evaluates the effects of removing person-level or scene-level LSTMs.
  • Baseline 8 → 9: Moves toward hierarchical team-aware pooling and an end-to-end structured classification approach.

Evaluation Metrics & Observations

Baseline 6 - Two-stage Model without LSTM 1 : (Accuracy: ~78%)

B6

  • L-set and r-set recognition reached 89% recall, benefiting from scene-level representations.
  • Pass actions remain a weak point (r-pass at 67% recall), showing that removing person-level LSTM impacts individual action recognition.
  • Balanced macro and weighted accuracy scores, indicating overall improvement in scene-level understanding.
  • L-winpoint performance jumped to 83% recall, meaning the model is now effectively distinguishing game-ending actions.

Baseline 7 - Two-stage Model without LSTM 2 : (Accuracy: ~81%)

B7

  • Pass recognition significantly improved (l-pass: 86.3%, r-pass: 83.3% recall) compared to earlier baselines.
  • Spike actions remain highly distinguishable (l-spike: 87.3%, r-spike: 86.1%), indicating robust temporal modeling.
  • Winpoint actions are weaker (l_winpoint: 53.9%, r_winpoint: 56.3%), suggesting some confusion in game-ending states.
  • Strong macro and weighted averages (~79%), proving that hierarchical structure helps even without scene-level LSTM.

Baseline 8 - Two-stage Hierarchical Model : (Accuracy: ~82%)

B8

  • Pass actions maintain strong recognition (r-pass: 85.2% recall), improving from B7.
  • Winpoint classification improves (l_winpoint: 64.7%), reducing confusion in match-ending events.
  • Team interactions are still not explicitly modeled, leaving room for improvement.

Baseline 8 - Two-stage Hierarchical Model with Team Pooling : (Accuracy: ~90%)

B8

  • Highest overall performance so far, with a macro average of 0.90%.
  • Team-aware pooling significantly improves winpoint actions (l_winpoint: 94.1%, r_winpoint: 87.4%).
  • Better precision-recall balance across all activity classes.
  • Spike and pass actions remain dominant at 88–96% accuracy, indicating the success of structured representation.
  • Minimal misclassification, highlighting the model’s strong team-aware learning.

Baseline 9 - Fine-Tuned Team Spatial Classification : (Accuracy: ~92%)

B9

  • Very close to B8 with Team Pooling in overall performance (91%).
  • Pass and spike actions maintain high precision and recall, ensuring smooth team-based action understanding.
  • Final structured hierarchical learning approach proves highly effective, confirming the best possible performance.

Key Takeaways

  1. Pass action recognition improves consistently, peaking at ~92% recall in B8 with Team Pooling.
  2. Winpoint classification struggles in early models but reaches 93% in B9, proving the importance of structured team representation.
  3. Spiking actions remain robust across all baselines, with minor refinements from B7 onward.
  4. Hierarchical modeling (B7,B8) yields the best results, demonstrating the effectiveness of structured feature learning.
  5. Team pooling (B8 with team separation) plays a crucial role in reducing left/right confusion and boosting final performance.

Configuration

Model configurations are stored in the configs/ directory. Adjust parameters such as learning rate, batch size, and number of epochs by editing the relevant .yml file.

Evaluation

Evaluation is performed automatically after training. Results include metrics like confusion matrices and classification reports, which are saved in the results/ directory.

Logging and Outputs

Logs and model outputs are organized into timestamped folders within the results/ directory for easy tracking of experiments.

About

Re-implementation and extension of the CVPR 2016 paper, “A Hierarchical Deep Temporal Model for Group Activity Recognition.” This project reproduces the original two-stage LSTM architecture in PyTorch with enhanced baselines, modernized components, and improved performance.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages