Deep Learning Project for Volleyball Activity Recognition

An implementation of seminal CVPR 2016 paper: "A Hierarchical Deep Temporal Model for Group Activity Recognition."

Implemented Paper

Paper	Year	Original Paper	Original Implementation	Key Points
CVPR 16	2016	Paper	Implementation	Two-stage hierarchical LSTM for group activity recognition

Abstract

This project re-implements and extends the work presented in the original paper on group activity recognition using deep temporal modeling. The central idea of the study is that the temporal evolution of group activities can be inferred from the motion and actions of individual participants. To replicate and enhance this approach, a two-stage hierarchical LSTM model was implemented: the first LSTM captures temporal dynamics at the individual (player) level, while the second LSTM aggregates these representations to model group-level behavior. The model was trained and evaluated on both the Collective Activity Dataset and the Volleyball Dataset, achieving improved performance and demonstrating the effectiveness of hierarchical temporal modeling for structured activity understanding.

Model

Figure 1: The figure illustrates the concept of group activity recognition using a hierarchical temporal model. In this framework, each individual within a scene is represented by a person-level LSTM, which captures the temporal dynamics of their actions. These individual representations are then integrated through a scene-level LSTM, allowing the model to learn and infer the overall group activity from collective temporal patterns.

Figure 2: A detailed illustration of the implemented model architecture. Given tracklets of K players, each tracklet is first processed through a CNN backbone to extract spatial features, which are then passed to a person-level LSTM to model the temporal dynamics of individual actions. The resulting temporal representations from all players in the scene are pooled to form a unified scene representation. This aggregated feature is subsequently fed into a scene-level LSTM, which captures inter-player relationships and predicts the overall team activity.

Figure 3: Earlier baseline models discarded important spatial arrangement information when aggregating player features. In the updated implementation, a two-group pooling mechanism was introduced to preserve the spatial configuration of players by separating features based on team affiliation. This enhancement enables the model to better capture team-wise spatial dynamics, leading to improved group activity recognition performance.

Key Contributions

Enhanced Baseline Architectures: Redesigned and optimized the baseline models by integrating modern deep convolutional backbones (e.g., replacing AlexNet with ResNet50), resulting in improved feature representation and generalization.
Significant Performance Gains: Achieved consistently higher accuracies across all baselines compared to the original study. Notably, the final baseline attained 91% accuracy, outperforming the paper’s reported 81.9%, highlighting the effectiveness of the proposed modifications.
Introduction of a Novel Baseline (Baseline9): Developed an additional baseline that reached 91% accuracy without relying on a temporal module, demonstrating that strong spatial modeling alone can yield competitive results.
Modern Framework Migration: Re-implemented the entire experimental pipeline using PyTorch, replacing the outdated Caffe framework to enable improved reproducibility, extensibility, and integration with current research workflows.

Accuracy and Improvement Over the Paper

Baseline	Accuracy (Paper)	Accuracy (Our Implementation)
B1-Image Classification	66.7%	77%
B2-Person Classification	64.6%	skipped
B3-Fine-tuned Person Classification	68.1%	75%
B4-Temporal Model with Image Features	63.1%	77%
B5-Temporal Model with Person Features	67.6%	skipped
B6-Two-stage Model without LSTM 1	74.7%	78%
B7-Two-stage Model without LSTM 2	80.2%	81%
B8-Two-stage Hierarchical Model(1 group)	70.3%	82%
B8-Two-stage Hierarchical Model(2 groups)	81.9%	90%
B9-Fine-Tuned Team Spatial Classification	New-Baseline	91%

Key Takeaways

Enhanced Baseline Performance: Achieved substantial improvements in baseline accuracy, reaching up to 91%, compared to the 81.9% reported in the original paper, demonstrating the effectiveness of the refined architectures and training strategies.
Transition to a Modern Framework: Re-implemented the entire system in PyTorch, providing a more flexible, extensible, and research-friendly environment than the original Caffe implementation.
Introduction of New Baselines: Developed additional baseline models, such as Baseline9, which achieved 90% accuracy without relying on temporal modeling—showcasing the robustness of purely spatial representations.
Comprehensive Ablation Study: Conducted an in-depth ablation study across multiple baselines, systematically analyzing architectural variations and quantifying their impact on overall performance.
Hierarchical Temporal Modeling: Employed a two-stage hierarchical LSTM framework to effectively model both individual-level dynamics and group-level interactions, enhancing temporal understanding of complex scenes.
Team-Aware Pooling Mechanism: Introduced a team-wise pooling strategy to explicitly differentiate between opposing sides, thereby reducing inter-team confusion and improving activity classification accuracy.
Expanded and Annotated Dataset: Utilized a rich volleyball dataset containing frame-level annotations, bounding boxes, and both individual and group activity labels to support reproducible and detailed analysis.
Configurable Experimental Setup: Integrated YAML-based configuration files, allowing efficient and transparent control over hyperparameters, model architecture, and training routines.
Training Optimization and Visualization Tools: Incorporated early stopping, metric tracking, and visualization modules—such as confusion matrices and classification reports—to facilitate informed model evaluation and comparison.
Scalable and Modular Architecture: Designed the system following a modular and scalable structure, enabling future extensions, experiments, and integration with other research pipelines.

Installation

Clone the repository:

git clone https://github.com/mostafabahaa25/hdtm-group-activity-recognition
cd hdtm-group-activity-recognition

Install the required dependencies:
```
pip install -r requirements.txt
```

Dataset

The experiments were conducted using the Volleyball Dataset introduced in the original study. This dataset provides a well-structured benchmark for group activity recognition and includes:

Videos: 55 publicly available volleyball match videos sourced from YouTube, capturing diverse team interactions and gameplay scenarios.
Annotated Frames: 4,830 manually annotated frames, each containing player bounding boxes, individual action labels, and group activity annotations, enabling fine-grained spatial–temporal analysis at both the player and scene levels.

Dataset Labels

Group Activity Classes

Class	Instances
Right set	644
Right spike	623
Right pass	801
Right winpoint	295
Left winpoint	367
Left pass	826
Left spike	642
Left set	633

Action Classes

Class	Instances
Waiting	3601
Setting	1332
Digging	2333
Falling	1241
Spiking	1216
Blocking	2458
Jumping	341
Moving	5121
Standing	38696

Dataset Splits

Training Set: 2/3 of the videos.
- Train Videos: 1, 3, 6, 7, 10, 13, 15, 16, 18, 22, 23, 31, 32, 36, 38, 39, 40, 41, 42, 48, 50, 52, 53, 54.
Validation Set: 15 videos.
- Validation Videos: 0, 2, 8, 12, 17, 19, 24, 26, 27, 28, 30, 33, 46, 49, 51.
Test Set: 1/3 of the videos.
- Test Videos: 4, 5, 9, 11, 14, 20, 21, 25, 29, 34, 35, 37, 43, 44, 45, 47.

Dataset Sample

The dataset is available for download at GitHub Deep Activity Rec, or on Kaggle here

Features

Comprehensive Baseline Implementations: Includes multiple experimental baselines (Baseline1, Baseline3, Baseline4, Baseline5, Baseline6, Baseline7, Baseline8, and Baseline9), each designed to explore different architectural and temporal modeling strategies.
Configurable Experimental Setup: Employs YAML-based configuration files to facilitate transparent, reproducible, and easily adjustable experimentation across various hyperparameters and model architectures.
Early Stopping Strategy: Integrates a built-in early stopping mechanism to prevent overfitting and ensure optimal model convergence based on validation performance.
Performance Monitoring and Visualization: Provides detailed evaluation metrics, including confusion matrices and classification reports, for comprehensive model assessment and interpretability.
Modular and Scalable Architecture: Designed with a modular codebase that supports scalability, enabling effortless integration of new components and future research extensions.

Ablation Study

Baseline Analysis and Insights

B1 - Image Classification

Description: Fine-tuned a ResNet50 model for frame-level classification, treating each frame independently without temporal modeling.
Insights: Demonstrated strong performance on static scene recognition but failed to capture the temporal dependencies critical for sequential group activities.
Key Features: Frame-level classification; absence of temporal or structural context.

B3 – Fine-Tuned Person Classification

Description: Fine-tuned ResNet50 on cropped player regions to perform individual action classification, followed by feature pooling for group activity recognition.
Insights: Improved granularity by focusing on individual players; however, performance remained limited due to the lack of explicit temporal modeling between frames.
Key Features: Person-centric classification; pooled representations without temporal aggregation

B4 - Temporal Model with Image Features

Description: Integrated an LSTM to capture temporal dependencies while continuing to use global image-level features as input.
Insights: Enabled sequential understanding of visual cues, yet suffered from the absence of structured player representations and spatial relationships.
Key Features: LSTM-based temporal modeling; global image features without explicit player modeling.

B6 - Two-stage Model Without Player-Level LSTM

Description: Omitted the player-level LSTM, retaining only the scene-level LSTM while relying on features extracted from individual players.
Insights: Scene-level modeling effectively captured holistic activity trends but at the expense of fine-grained temporal details at the player level.
Key Features: Scene-level temporal modeling; absence of player-level temporal learning.

B7 - Two-stage Model without LSTM 2

Description: Retained the player-level LSTM while removing the scene-level temporal module.
Insights: Preserved temporal consistency for individual players but struggled to integrate collective scene dynamics necessary for accurate group activity recognition.
Key Features: Player-level temporal modeling; no global scene-level temporal aggregation.

B8 - Two-stage Hierarchical Model

Description: Incorporates both player-level and scene-level LSTMs to construct a hierarchical temporal architecture, enabling the model to jointly capture individual actions and collective group behaviors over time.
Insights: Demonstrates the effectiveness of hierarchical temporal modeling by successfully integrating fine-grained player dynamics with high-level scene context, resulting in a more coherent understanding of group activities.
Key Features: Hierarchical LSTM framework; joint modeling of individual and team-level temporal dependencies.

B8 - Two-stage Hierarchical Model with Team Pooling

Description: Adds team-wise pooling before applying scene-level LSTM.
Insights: Reduces confusion between left and right teams, improving classification.
Key Features: Team-wise pooling, hierarchical scene modeling.

B9 - Fine-Tuned Team Spatial Classification

Description: Fine-tunes ResNet50 on individual player actions before pooling team representations.
Insights: Achieves state-of-the-art accuracy by leveraging fine-grained person representations.
Key Features: ResNet50-based person classification, Team-wise pooling, optimized scene classification.

Baselines Implementation Comparison

Overview

This table outlines the progression of different baseline models, highlighting their implementation improvements and accuracy as measured in our implementation.

Baseline Model	Baselines Implementation	Accuracy (Our Implementation)
B1 - Image Classification	Fine-tune ResNet50 On Image Level → Classify group activity.	78%
B2-Person Classification	Extract person features(ResNet50 without Fine-tune) → Pool features over players → Classify group activity. I passed this baseline because it doesn't fine-tune.	N/A
B3 - Fine-tuned Person Classification	Fine-tune ResNet50 on Cropped Person Actions → Extract features → Pool features over players → Classify group activity.	76%
B4 - Temporal Model with Image Features	Based on B1 → Extract image features → Apply LSTM for temporal modeling → Classify group activity.	80%
B5 - Temporal Model with Person Features	Based on B2 → Apply LSTM for player-level modeling → Pool features → Classify group activity. I passed this baseline since I passed B2, and same idea applied in B7	N/A
B6 - Two-stage Model without LSTM 1	Based on B3 → Extract person features → Pool features → Apply LSTM for scene-level modeling → Classify group activity.	81%
B7 - Two-stage Model without LSTM 2	Based on B3 → Extract person features → Apply LSTM for player-level modeling → Pool features → Classify group activity.	88%
B8 - Two-stage Hierarchical Model	Based on B3 → Extract person features → Apply LSTM for player-level modeling → Pool features over players → Apply LSTM for scene-level modeling → Classify group activity.	89.20%
B8 - Two-stage Hierarchical Model with Team Pooling	Based on B7 → Extract person features → Apply LSTM for player-level modeling → Pool features per team → Concatenate Both Teams → Apply LSTM for scene-level modeling → Classify group activity.	93%
B9 - Fine-Tuned Team Spatial Classification	Fine-tune ResNet50 on Cropped Person Actions → Extract player features → Pool features per team → Classify group activity.	92%

Key Takeaways

Baseline 1 → 3: Early models focus on frame-based CNN classification before shifting to person-level classification.
Baseline 4 → 5: Introduces LSTM-based temporal modeling for both image and player-level features.
Baseline 6 → 7: Evaluates the effects of removing person-level or scene-level LSTMs.
Baseline 8 → 9: Moves toward hierarchical team-aware pooling and an end-to-end structured classification approach.

Evaluation Metrics & Observations

Baseline 6 - Two-stage Model without LSTM 1 : (Accuracy: ~78%)

L-set and r-set recognition reached 89% recall, benefiting from scene-level representations.
Pass actions remain a weak point (r-pass at 67% recall), showing that removing person-level LSTM impacts individual action recognition.
Balanced macro and weighted accuracy scores, indicating overall improvement in scene-level understanding.
L-winpoint performance jumped to 83% recall, meaning the model is now effectively distinguishing game-ending actions.

Baseline 7 - Two-stage Model without LSTM 2 : (Accuracy: ~81%)

Pass recognition significantly improved (l-pass: 86.3%, r-pass: 83.3% recall) compared to earlier baselines.
Spike actions remain highly distinguishable (l-spike: 87.3%, r-spike: 86.1%), indicating robust temporal modeling.
Winpoint actions are weaker (l_winpoint: 53.9%, r_winpoint: 56.3%), suggesting some confusion in game-ending states.
Strong macro and weighted averages (~79%), proving that hierarchical structure helps even without scene-level LSTM.

Baseline 8 - Two-stage Hierarchical Model : (Accuracy: ~82%)

Pass actions maintain strong recognition (r-pass: 85.2% recall), improving from B7.
Winpoint classification improves (l_winpoint: 64.7%), reducing confusion in match-ending events.
Team interactions are still not explicitly modeled, leaving room for improvement.

Baseline 8 - Two-stage Hierarchical Model with Team Pooling : (Accuracy: ~90%)

Highest overall performance so far, with a macro average of 0.90%.
Team-aware pooling significantly improves winpoint actions (l_winpoint: 94.1%, r_winpoint: 87.4%).
Better precision-recall balance across all activity classes.
Spike and pass actions remain dominant at 88–96% accuracy, indicating the success of structured representation.
Minimal misclassification, highlighting the model’s strong team-aware learning.

Baseline 9 - Fine-Tuned Team Spatial Classification : (Accuracy: ~92%)

Very close to B8 with Team Pooling in overall performance (91%).
Pass and spike actions maintain high precision and recall, ensuring smooth team-based action understanding.
Final structured hierarchical learning approach proves highly effective, confirming the best possible performance.

Key Takeaways

Pass action recognition improves consistently, peaking at ~92% recall in B8 with Team Pooling.
Winpoint classification struggles in early models but reaches 93% in B9, proving the importance of structured team representation.
Spiking actions remain robust across all baselines, with minor refinements from B7 onward.
Hierarchical modeling (B7,B8) yields the best results, demonstrating the effectiveness of structured feature learning.
Team pooling (B8 with team separation) plays a crucial role in reducing left/right confusion and boosting final performance.

Configuration

Model configurations are stored in the configs/ directory. Adjust parameters such as learning rate, batch size, and number of epochs by editing the relevant .yml file.

Evaluation

Evaluation is performed automatically after training. Results include metrics like confusion matrices and classification reports, which are saved in the results/ directory.

Logging and Outputs

Logs and model outputs are organized into timestamped folders within the results/ directory for easy tracking of experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
assets		assets
config		config
data		data
eval_utils		eval_utils
models		models
results		results
train		train
utils		utils
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Deep Learning Project for Volleyball Activity Recognition

An implementation of seminal CVPR 2016 paper: "A Hierarchical Deep Temporal Model for Group Activity Recognition."

Table of Contents

Implemented Paper

Abstract

Model

Key Contributions

Accuracy and Improvement Over the Paper

Key Takeaways

Installation

Dataset

Dataset Labels

Group Activity Classes

Action Classes

Dataset Splits

Dataset Sample

Features

Ablation Study

Baseline Analysis and Insights

B1 - Image Classification

B3 – Fine-Tuned Person Classification

B4 - Temporal Model with Image Features

B6 - Two-stage Model Without Player-Level LSTM

B7 - Two-stage Model without LSTM 2

B8 - Two-stage Hierarchical Model

B8 - Two-stage Hierarchical Model with Team Pooling

B9 - Fine-Tuned Team Spatial Classification

Baselines Implementation Comparison

Overview

Key Takeaways

Evaluation Metrics & Observations

Baseline 6 - Two-stage Model without LSTM 1 : (Accuracy: ~78%)

Baseline 7 - Two-stage Model without LSTM 2 : (Accuracy: ~81%)

Baseline 8 - Two-stage Hierarchical Model : (Accuracy: ~82%)

Baseline 8 - Two-stage Hierarchical Model with Team Pooling : (Accuracy: ~90%)

Baseline 9 - Fine-Tuned Team Spatial Classification : (Accuracy: ~92%)

Key Takeaways

Configuration

Evaluation

Logging and Outputs

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages