-
Project Overview
- Purpose
- Database Description
- Project Structure
-
Setup Requirements
- Prerequisites
- System Requirements
- Required Software
- Python Dependencies
-
Database Setup
- Schema Creation
- Table Creation
- Data Import Process
-
Data Processing Pipeline
- CSV File Structure
- Data Transformation
- Data Loading Process
- Handling Encodings
-
SQL Examples
- Basic Queries
- Advanced Examples
- Performance Optimization
-
Project Structure
- Directory Layout
- File Descriptions
- Key Components
-
Usage Guide
- Installation Steps
- Configuration
- Running the Scripts
- Troubleshooting
-
Contributing
- Guidelines
- Development Setup
- Testing
-
License & Attribution
- License Information
- Data Source Credits
- Acknowledgments
This project serves as a practical guide for learning advanced SQL concepts using the AdventureWorks database. It includes data processing scripts, database setup, and example queries for real-world business scenarios.
AdventureWorks represents a fictional bicycle manufacturer, featuring:
- Multiple business areas (Sales, Production, Purchasing, etc.)
- Complex relationships between entities
- Real-world business scenarios
- Rich dataset for advanced SQL practice
The complete database structure is visualized in the Entity-Relationship Diagram: View Full ER Diagram
The database consists of five main schemas:
- Person: Customer and person contact information
- HumanResources: Employee-related data
- Production: Product details and inventory
- Purchasing: Vendor and purchase order information
- Sales: Customer, sales orders, and store data
The tables are created using a structured SQL script (ddLsql/create_tables.sql) that:
-
Sets up initial requirements:
-- Enable UUID generation CREATE EXTENSION IF NOT EXISTS "uuid-ossp"; -- Create Schemas CREATE SCHEMA IF NOT EXISTS Person; CREATE SCHEMA IF NOT EXISTS HumanResources; CREATE SCHEMA IF NOT EXISTS Production; CREATE SCHEMA IF NOT EXISTS Purchasing; CREATE SCHEMA IF NOT EXISTS Sales; CREATE SCHEMA IF NOT EXISTS dbo;
-
Implements comprehensive data types:
- SERIAL for auto-incrementing IDs
- UUID for unique identifiers
- Timestamps for date tracking
- Decimal for precise financial calculations
- Various string types (VARCHAR, TEXT)
-
Enforces data integrity through:
- Primary and Foreign Keys
- CHECK constraints
- DEFAULT values
- NOT NULL constraints
- UNIQUE constraints
The project is organized into two main sections:
- Data Processing & Setup
- SQL Examples & Tutorials
- Linux/Unix-based operating system
- Python 3.8 or higher
- PostgreSQL 12 or higher
- Minimum 4GB RAM
- 2GB free disk space
- Internet connection for initial setup
- PostgreSQL Server
- Python 3.x
- pip (Python package manager)
pandas
psycopg2-binary
numpy- Database schemas are defined in
ddLsql/create_schema.sql - Includes schemas for:
- Person
- Production
- Sales
- Purchasing
- HumanResources
Tables are created using ddLsql/create_tables.sql, which follows a systematic approach:
-
Schema Organization:
- Tables are grouped by business function
- Each schema represents a distinct business area
- Logical separation of concerns
-
Table Dependencies:
- Tables are created in order of their dependencies
- Foreign key relationships are properly established
- Referential integrity is maintained
-
Key Examples:
-- Person tables CREATE TABLE Person.Person ( BusinessEntityID INT PRIMARY KEY, PersonType CHAR(2) NOT NULL, NameStyle BOOLEAN NOT NULL DEFAULT FALSE, -- ... additional columns ); -- HumanResources tables CREATE TABLE HumanResources.Employee ( BusinessEntityID INT PRIMARY KEY, NationalIDNumber VARCHAR(15) NOT NULL, -- ... additional columns FOREIGN KEY (BusinessEntityID) REFERENCES Person.Person );
-
Data Integrity:
- CHECK constraints for data validation
- DEFAULT values for standard fields
- UNIQUE constraints where needed
The data import process is handled by populate_table.py, which:
- Reads CSV files from the
data/directory - Handles various encodings (UTF-8, UTF-16)
- Manages data type conversions
- Maintains referential integrity
The data/ directory contains 68 CSV files, including:
- Business entity data
- Product information
- Sales records
- Employee data
- Geographic information
Handled by populate_table.py:
- Automatic data type conversion
- NULL value handling
- Date/time format standardization
- UUID processing
- Boolean value normalization
The loading process:
- Reads CSV files with proper encoding
- Transforms data to match PostgreSQL types
- Loads data in chunks for better performance
- Handles foreign key constraints
The project includes sophisticated encoding handling for CSV files, particularly in populate_table.py. Different files require different encoding approaches:
-
UTF-16 LE Files:
utf_16_encodings_file = ( 'BusinessEntityAddress', 'Employee', 'Person', 'EmailAddress', 'Password', 'PersonPhone', 'PhoneNumberType', 'ProductPhoto', 'BusinessEntity', 'ProductModel', 'CountryRegionCurrency', 'Store', 'Illustration', 'JobCandidate', 'Document', 'ProductDescription' )
-
UTF-8 Files:
utf_8_encodings_file = ( 'ProductReview', 'Product', 'Location' )
-
Encoding Detection Issues:
- Problem: Character encoding mismatch causing garbled data
- Solution: Script attempts multiple encodings in order:
encodings_to_try = [ 'cp1252', 'utf-8-sig', 'utf-16', 'latin-1' ]
-
Null Bytes in Data:
- Problem: Null bytes (\x00) causing parsing errors
- Solution: Automatic removal of null bytes:
cleaned_string = decoded_string.replace('\x00', '')
-
BOM (Byte Order Mark) Issues:
- Problem: BOM interfering with data parsing
- Solution: Using 'utf-8-sig' encoding where appropriate
-
Memory Efficiency:
- Problem: Large files causing memory issues
- Solution: Using
low_memory=Falsein pandas for reliable parsing:read_csv_params = { 'sep': '\t', 'header': None, 'low_memory': False }
If you encounter encoding issues:
-
Check File Type:
file -i your_file.csv # Check file encoding hexdump -C -n 32 your_file.csv # View file header bytes
-
Manual Encoding Override:
- Modify the encoding lists in
populate_table.py - Add specific files to appropriate encoding groups
- Modify the encoding lists in
-
Data Validation:
- Use the debug output to verify correct character encoding
- Check for data integrity after import
- Verify special characters are preserved
- Binary reading mode for better encoding handling
- StringIO for efficient in-memory processing
- Chunk-based processing for large files
- Explicit error handling with fallback options
Located in example/ directory:
- Data retrieval operations
- Filtering and sorting
- Joins and relationships
- Aggregation functions
Complex SQL operations demonstrating:
- Window functions
- Common Table Expressions (CTEs)
- Recursive queries
- Performance optimization techniques
- Indexing strategies
- Query optimization techniques
- Best practices for complex queries
advanced_sql_tutorial/
├── data/ # CSV data files
├── ddLsql/ # SQL schema and table definitions
├── example/ # SQL example queries
├── images/ # Documentation images
├── populate_table.py # Main data loading script
├── populate2.py # Additional loading utilities
└── readme.md # Project documentation
populate_table.py: Main data processing and loading scriptcreate_schema.sql: Database schema definitionscreate_tables.sql: Table creation scriptsexample/*.sql: Example SQL queries and tutorials
- Data Processing Scripts
- Database Setup Files
- Example Queries
- Documentation
- Clone the repository
- Install PostgreSQL
- Install Python dependencies
- Create the database
- Run schema and table creation scripts
- Update database configuration in
populate_table.py:
DB_CONFIG = {
"host": "localhost",
"port": "5433",
"dbname": "postgres",
"user": "data_eng",
"password": "12345pP"
}- Create database schemas:
psql -U data_eng -d postgres -f ddLsql/create_schema.sql- Create tables:
psql -U data_eng -d postgres -f ddLsql/create_tables.sql- Load data:
python populate_table.pyCommon issues and solutions:
- Encoding errors: Check file encoding and use appropriate parameters
- Memory issues: Adjust chunk size in data loading
- Permission errors: Verify database user privileges
- Follow PEP 8 style guide for Python code
- Document all SQL queries
- Include tests for new features
- Fork the repository
- Create a virtual environment
- Install development dependencies
- Create a feature branch
- Test new SQL queries
- Verify data integrity
- Check performance impact
This project is licensed under MIT License.
- Original database design by Microsoft
- AdventureWorks sample database
- Official Dataset: Microsoft AdventureWorks Sample Databases
- Download Link: AdventureWorks2019.bak
- Direct download:
wget https://github.com/Microsoft/sql-server-samples/releases/download/adventureworks/AdventureWorks2019.bak
- Alternative sources:
- Microsoft for the original AdventureWorks database (2019)
- PostgreSQL community
- Contributors to the project
- Microsoft SQL Server Samples team for maintaining the dataset