An implementation of seminal CVPR 2016 paper: "A Hierarchical Deep Temporal Model for Group Activity Recognition."
| Paper | Year | Original Paper | Original Implementation | Key Points |
|---|---|---|---|---|
| CVPR 16 | 2016 | Paper | Implementation | Two-stage hierarchical LSTM for group activity recognition |
This project re-implements and extends the work presented in the original paper on group activity recognition using deep temporal modeling. The central idea of the study is that the temporal evolution of group activities can be inferred from the motion and actions of individual participants. To replicate and enhance this approach, a two-stage hierarchical LSTM model was implemented: the first LSTM captures temporal dynamics at the individual (player) level, while the second LSTM aggregates these representations to model group-level behavior. The model was trained and evaluated on both the Collective Activity Dataset and the Volleyball Dataset, achieving improved performance and demonstrating the effectiveness of hierarchical temporal modeling for structured activity understanding.
Figure 1: The figure illustrates the concept of group activity recognition using a hierarchical temporal model. In this framework, each individual within a scene is represented by a person-level LSTM, which captures the temporal dynamics of their actions. These individual representations are then integrated through a scene-level LSTM, allowing the model to learn and infer the overall group activity from collective temporal patterns.
Figure 2: A detailed illustration of the implemented model architecture. Given tracklets of K players, each tracklet is first processed through a CNN backbone to extract spatial features, which are then passed to a person-level LSTM to model the temporal dynamics of individual actions. The resulting temporal representations from all players in the scene are pooled to form a unified scene representation. This aggregated feature is subsequently fed into a scene-level LSTM, which captures inter-player relationships and predicts the overall team activity.
Figure 3: Earlier baseline models discarded important spatial arrangement information when aggregating player features. In the updated implementation, a two-group pooling mechanism was introduced to preserve the spatial configuration of players by separating features based on team affiliation. This enhancement enables the model to better capture team-wise spatial dynamics, leading to improved group activity recognition performance.
-
Enhanced Baseline Architectures: Redesigned and optimized the baseline models by integrating modern deep convolutional backbones (e.g., replacing AlexNet with ResNet50), resulting in improved feature representation and generalization.
-
Significant Performance Gains: Achieved consistently higher accuracies across all baselines compared to the original study. Notably, the final baseline attained 91% accuracy, outperforming the paper’s reported 81.9%, highlighting the effectiveness of the proposed modifications.
-
Introduction of a Novel Baseline (Baseline9): Developed an additional baseline that reached 91% accuracy without relying on a temporal module, demonstrating that strong spatial modeling alone can yield competitive results.
-
Modern Framework Migration: Re-implemented the entire experimental pipeline using PyTorch, replacing the outdated Caffe framework to enable improved reproducibility, extensibility, and integration with current research workflows.
| Baseline | Accuracy (Paper) | Accuracy (Our Implementation) |
|---|---|---|
| B1-Image Classification | 66.7% | 77% |
| B2-Person Classification | 64.6% | skipped |
| B3-Fine-tuned Person Classification | 68.1% | 75% |
| B4-Temporal Model with Image Features | 63.1% | 77% |
| B5-Temporal Model with Person Features | 67.6% | skipped |
| B6-Two-stage Model without LSTM 1 | 74.7% | 78% |
| B7-Two-stage Model without LSTM 2 | 80.2% | 81% |
| B8-Two-stage Hierarchical Model(1 group) | 70.3% | 82% |
| B8-Two-stage Hierarchical Model(2 groups) | 81.9% | 90% |
| B9-Fine-Tuned Team Spatial Classification | New-Baseline | 91% |
-
Enhanced Baseline Performance: Achieved substantial improvements in baseline accuracy, reaching up to 91%, compared to the 81.9% reported in the original paper, demonstrating the effectiveness of the refined architectures and training strategies.
-
Transition to a Modern Framework: Re-implemented the entire system in PyTorch, providing a more flexible, extensible, and research-friendly environment than the original Caffe implementation.
-
Introduction of New Baselines: Developed additional baseline models, such as Baseline9, which achieved 90% accuracy without relying on temporal modeling—showcasing the robustness of purely spatial representations.
-
Comprehensive Ablation Study: Conducted an in-depth ablation study across multiple baselines, systematically analyzing architectural variations and quantifying their impact on overall performance.
-
Hierarchical Temporal Modeling: Employed a two-stage hierarchical LSTM framework to effectively model both individual-level dynamics and group-level interactions, enhancing temporal understanding of complex scenes.
-
Team-Aware Pooling Mechanism: Introduced a team-wise pooling strategy to explicitly differentiate between opposing sides, thereby reducing inter-team confusion and improving activity classification accuracy.
-
Expanded and Annotated Dataset: Utilized a rich volleyball dataset containing frame-level annotations, bounding boxes, and both individual and group activity labels to support reproducible and detailed analysis.
-
Configurable Experimental Setup: Integrated YAML-based configuration files, allowing efficient and transparent control over hyperparameters, model architecture, and training routines.
-
Training Optimization and Visualization Tools: Incorporated early stopping, metric tracking, and visualization modules—such as confusion matrices and classification reports—to facilitate informed model evaluation and comparison.
-
Scalable and Modular Architecture: Designed the system following a modular and scalable structure, enabling future extensions, experiments, and integration with other research pipelines.
-
Clone the repository:
git clone https://github.com/mostafabahaa25/hdtm-group-activity-recognition cd hdtm-group-activity-recognition -
Install the required dependencies:
pip install -r requirements.txt
The experiments were conducted using the Volleyball Dataset introduced in the original study. This dataset provides a well-structured benchmark for group activity recognition and includes:
-
Videos: 55 publicly available volleyball match videos sourced from YouTube, capturing diverse team interactions and gameplay scenarios.
-
Annotated Frames: 4,830 manually annotated frames, each containing player bounding boxes, individual action labels, and group activity annotations, enabling fine-grained spatial–temporal analysis at both the player and scene levels.
|
|
- Training Set: 2/3 of the videos.
- Train Videos: 1, 3, 6, 7, 10, 13, 15, 16, 18, 22, 23, 31, 32, 36, 38, 39, 40, 41, 42, 48, 50, 52, 53, 54.
- Validation Set: 15 videos.
- Validation Videos: 0, 2, 8, 12, 17, 19, 24, 26, 27, 28, 30, 33, 46, 49, 51.
- Test Set: 1/3 of the videos.
- Test Videos: 4, 5, 9, 11, 14, 20, 21, 25, 29, 34, 35, 37, 43, 44, 45, 47.
The dataset is available for download at GitHub Deep Activity Rec, or on Kaggle here
-
Comprehensive Baseline Implementations: Includes multiple experimental baselines (Baseline1, Baseline3, Baseline4, Baseline5, Baseline6, Baseline7, Baseline8, and Baseline9), each designed to explore different architectural and temporal modeling strategies.
-
Configurable Experimental Setup: Employs YAML-based configuration files to facilitate transparent, reproducible, and easily adjustable experimentation across various hyperparameters and model architectures.
-
Early Stopping Strategy: Integrates a built-in early stopping mechanism to prevent overfitting and ensure optimal model convergence based on validation performance.
-
Performance Monitoring and Visualization: Provides detailed evaluation metrics, including confusion matrices and classification reports, for comprehensive model assessment and interpretability.
-
Modular and Scalable Architecture: Designed with a modular codebase that supports scalability, enabling effortless integration of new components and future research extensions.
-
Description: Fine-tuned a ResNet50 model for frame-level classification, treating each frame independently without temporal modeling.
-
Insights: Demonstrated strong performance on static scene recognition but failed to capture the temporal dependencies critical for sequential group activities.
-
Key Features: Frame-level classification; absence of temporal or structural context.
-
Description: Fine-tuned ResNet50 on cropped player regions to perform individual action classification, followed by feature pooling for group activity recognition.
-
Insights: Improved granularity by focusing on individual players; however, performance remained limited due to the lack of explicit temporal modeling between frames.
-
Key Features: Person-centric classification; pooled representations without temporal aggregation
-
Description: Integrated an LSTM to capture temporal dependencies while continuing to use global image-level features as input.
-
Insights: Enabled sequential understanding of visual cues, yet suffered from the absence of structured player representations and spatial relationships.
-
Key Features: LSTM-based temporal modeling; global image features without explicit player modeling.
-
Description: Omitted the player-level LSTM, retaining only the scene-level LSTM while relying on features extracted from individual players.
-
Insights: Scene-level modeling effectively captured holistic activity trends but at the expense of fine-grained temporal details at the player level.
-
Key Features: Scene-level temporal modeling; absence of player-level temporal learning.
-
Description: Retained the player-level LSTM while removing the scene-level temporal module.
-
Insights: Preserved temporal consistency for individual players but struggled to integrate collective scene dynamics necessary for accurate group activity recognition.
-
Key Features: Player-level temporal modeling; no global scene-level temporal aggregation.
-
Description: Incorporates both player-level and scene-level LSTMs to construct a hierarchical temporal architecture, enabling the model to jointly capture individual actions and collective group behaviors over time.
-
Insights: Demonstrates the effectiveness of hierarchical temporal modeling by successfully integrating fine-grained player dynamics with high-level scene context, resulting in a more coherent understanding of group activities.
-
Key Features: Hierarchical LSTM framework; joint modeling of individual and team-level temporal dependencies.
-
Description: Adds team-wise pooling before applying scene-level LSTM.
-
Insights: Reduces confusion between left and right teams, improving classification.
-
Key Features: Team-wise pooling, hierarchical scene modeling.
-
Description: Fine-tunes ResNet50 on individual player actions before pooling team representations.
-
Insights: Achieves state-of-the-art accuracy by leveraging fine-grained person representations.
-
Key Features: ResNet50-based person classification, Team-wise pooling, optimized scene classification.
This table outlines the progression of different baseline models, highlighting their implementation improvements and accuracy as measured in our implementation.
| Baseline Model | Baselines Implementation | Accuracy (Our Implementation) |
|---|---|---|
| B1 - Image Classification | Fine-tune ResNet50 On Image Level → Classify group activity. | 78% |
| B2-Person Classification | Extract person features(ResNet50 without Fine-tune) → Pool features over players → Classify group activity. I passed this baseline because it doesn't fine-tune. | N/A |
| B3 - Fine-tuned Person Classification | Fine-tune ResNet50 on Cropped Person Actions → Extract features → Pool features over players → Classify group activity. | 76% |
| B4 - Temporal Model with Image Features | Based on B1 → Extract image features → Apply LSTM for temporal modeling → Classify group activity. | 80% |
| B5 - Temporal Model with Person Features | Based on B2 → Apply LSTM for player-level modeling → Pool features → Classify group activity. I passed this baseline since I passed B2, and same idea applied in B7 | N/A |
| B6 - Two-stage Model without LSTM 1 | Based on B3 → Extract person features → Pool features → Apply LSTM for scene-level modeling → Classify group activity. | 81% |
| B7 - Two-stage Model without LSTM 2 | Based on B3 → Extract person features → Apply LSTM for player-level modeling → Pool features → Classify group activity. | 88% |
| B8 - Two-stage Hierarchical Model | Based on B3 → Extract person features → Apply LSTM for player-level modeling → Pool features over players → Apply LSTM for scene-level modeling → Classify group activity. | 89.20% |
| B8 - Two-stage Hierarchical Model with Team Pooling | Based on B7 → Extract person features → Apply LSTM for player-level modeling → Pool features per team → Concatenate Both Teams → Apply LSTM for scene-level modeling → Classify group activity. | 93% |
| B9 - Fine-Tuned Team Spatial Classification | Fine-tune ResNet50 on Cropped Person Actions → Extract player features → Pool features per team → Classify group activity. | 92% |
- Baseline 1 → 3: Early models focus on frame-based CNN classification before shifting to person-level classification.
- Baseline 4 → 5: Introduces LSTM-based temporal modeling for both image and player-level features.
- Baseline 6 → 7: Evaluates the effects of removing person-level or scene-level LSTMs.
- Baseline 8 → 9: Moves toward hierarchical team-aware pooling and an end-to-end structured classification approach.
- L-set and r-set recognition reached 89% recall, benefiting from scene-level representations.
- Pass actions remain a weak point (r-pass at 67% recall), showing that removing person-level LSTM impacts individual action recognition.
- Balanced macro and weighted accuracy scores, indicating overall improvement in scene-level understanding.
- L-winpoint performance jumped to 83% recall, meaning the model is now effectively distinguishing game-ending actions.
- Pass recognition significantly improved (l-pass: 86.3%, r-pass: 83.3% recall) compared to earlier baselines.
- Spike actions remain highly distinguishable (l-spike: 87.3%, r-spike: 86.1%), indicating robust temporal modeling.
- Winpoint actions are weaker (l_winpoint: 53.9%, r_winpoint: 56.3%), suggesting some confusion in game-ending states.
- Strong macro and weighted averages (~79%), proving that hierarchical structure helps even without scene-level LSTM.
- Pass actions maintain strong recognition (r-pass: 85.2% recall), improving from B7.
- Winpoint classification improves (l_winpoint: 64.7%), reducing confusion in match-ending events.
- Team interactions are still not explicitly modeled, leaving room for improvement.
- Highest overall performance so far, with a macro average of 0.90%.
- Team-aware pooling significantly improves winpoint actions (l_winpoint: 94.1%, r_winpoint: 87.4%).
- Better precision-recall balance across all activity classes.
- Spike and pass actions remain dominant at 88–96% accuracy, indicating the success of structured representation.
- Minimal misclassification, highlighting the model’s strong team-aware learning.
- Very close to B8 with Team Pooling in overall performance (91%).
- Pass and spike actions maintain high precision and recall, ensuring smooth team-based action understanding.
- Final structured hierarchical learning approach proves highly effective, confirming the best possible performance.
- Pass action recognition improves consistently, peaking at ~92% recall in B8 with Team Pooling.
- Winpoint classification struggles in early models but reaches 93% in B9, proving the importance of structured team representation.
- Spiking actions remain robust across all baselines, with minor refinements from B7 onward.
- Hierarchical modeling (B7,B8) yields the best results, demonstrating the effectiveness of structured feature learning.
- Team pooling (B8 with team separation) plays a crucial role in reducing left/right confusion and boosting final performance.
Model configurations are stored in the configs/ directory. Adjust parameters such as learning rate, batch size, and number of epochs by editing the relevant .yml file.
Evaluation is performed automatically after training. Results include metrics like confusion matrices and classification reports, which are saved in the results/ directory.
Logs and model outputs are organized into timestamped folders within the results/ directory for easy tracking of experiments.











