Advanced SQL Tutorial with AdventureWorks Database

Project Overview
- Purpose
- Database Description
- Project Structure
Setup Requirements
- Prerequisites
- System Requirements
- Required Software
- Python Dependencies
Database Setup
- Schema Creation
- Table Creation
- Data Import Process
Data Processing Pipeline
- CSV File Structure
- Data Transformation
- Data Loading Process
- Handling Encodings
SQL Examples
- Basic Queries
- Advanced Examples
- Performance Optimization
Project Structure
- Directory Layout
- File Descriptions
- Key Components
Usage Guide
- Installation Steps
- Configuration
- Running the Scripts
- Troubleshooting
Contributing
- Guidelines
- Development Setup
- Testing
License & Attribution
- License Information
- Data Source Credits
- Acknowledgments

1. Project Overview

Purpose

This project serves as a practical guide for learning advanced SQL concepts using the AdventureWorks database. It includes data processing scripts, database setup, and example queries for real-world business scenarios.

Database Description

AdventureWorks represents a fictional bicycle manufacturer, featuring:

Multiple business areas (Sales, Production, Purchasing, etc.)
Complex relationships between entities
Real-world business scenarios
Rich dataset for advanced SQL practice

The complete database structure is visualized in the Entity-Relationship Diagram: View Full ER Diagram

The database consists of five main schemas:

Person: Customer and person contact information
HumanResources: Employee-related data
Production: Product details and inventory
Purchasing: Vendor and purchase order information
Sales: Customer, sales orders, and store data

Table Creation Details

The tables are created using a structured SQL script (ddLsql/create_tables.sql) that:

Sets up initial requirements:

-- Enable UUID generation
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

-- Create Schemas
CREATE SCHEMA IF NOT EXISTS Person;
CREATE SCHEMA IF NOT EXISTS HumanResources;
CREATE SCHEMA IF NOT EXISTS Production;
CREATE SCHEMA IF NOT EXISTS Purchasing;
CREATE SCHEMA IF NOT EXISTS Sales;
CREATE SCHEMA IF NOT EXISTS dbo;

Implements comprehensive data types:
- SERIAL for auto-incrementing IDs
- UUID for unique identifiers
- Timestamps for date tracking
- Decimal for precise financial calculations
- Various string types (VARCHAR, TEXT)
Enforces data integrity through:
- Primary and Foreign Keys
- CHECK constraints
- DEFAULT values
- NOT NULL constraints
- UNIQUE constraints

Project Structure

The project is organized into two main sections:

Data Processing & Setup
SQL Examples & Tutorials

2. Setup Requirements

Prerequisites

Linux/Unix-based operating system
Python 3.8 or higher
PostgreSQL 12 or higher

System Requirements

Minimum 4GB RAM
2GB free disk space
Internet connection for initial setup

Required Software

PostgreSQL Server
Python 3.x
pip (Python package manager)

Python Dependencies

pandas
psycopg2-binary
numpy

3. Database Setup

Schema Creation

Database schemas are defined in ddLsql/create_schema.sql
Includes schemas for:
- Person
- Production
- Sales
- Purchasing
- HumanResources

Table Creation

Tables are created using ddLsql/create_tables.sql, which follows a systematic approach:

Schema Organization:
- Tables are grouped by business function
- Each schema represents a distinct business area
- Logical separation of concerns
Table Dependencies:
- Tables are created in order of their dependencies
- Foreign key relationships are properly established
- Referential integrity is maintained

Key Examples:

-- Person tables
CREATE TABLE Person.Person (
    BusinessEntityID INT PRIMARY KEY,
    PersonType CHAR(2) NOT NULL,
    NameStyle BOOLEAN NOT NULL DEFAULT FALSE,
    -- ... additional columns
);

-- HumanResources tables
CREATE TABLE HumanResources.Employee (
    BusinessEntityID INT PRIMARY KEY,
    NationalIDNumber VARCHAR(15) NOT NULL,
    -- ... additional columns
    FOREIGN KEY (BusinessEntityID) REFERENCES Person.Person
);

Data Integrity:
- CHECK constraints for data validation
- DEFAULT values for standard fields
- UNIQUE constraints where needed

Data Import Process

The data import process is handled by populate_table.py, which:

Reads CSV files from the data/ directory
Handles various encodings (UTF-8, UTF-16)
Manages data type conversions
Maintains referential integrity

4. Data Processing Pipeline

CSV File Structure

The data/ directory contains 68 CSV files, including:

Business entity data
Product information
Sales records
Employee data
Geographic information

Data Transformation

Handled by populate_table.py:

Automatic data type conversion
NULL value handling
Date/time format standardization
UUID processing
Boolean value normalization

Data Loading Process

The loading process:

Reads CSV files with proper encoding
Transforms data to match PostgreSQL types
Loads data in chunks for better performance
Handles foreign key constraints

Handling Encodings and Common Issues

The project includes sophisticated encoding handling for CSV files, particularly in populate_table.py. Different files require different encoding approaches:

File-Specific Encodings

UTF-16 LE Files:

utf_16_encodings_file = (
    'BusinessEntityAddress', 'Employee', 'Person', 'EmailAddress',
    'Password', 'PersonPhone', 'PhoneNumberType', 'ProductPhoto',
    'BusinessEntity', 'ProductModel', 'CountryRegionCurrency', 'Store',
    'Illustration', 'JobCandidate', 'Document', 'ProductDescription'
)

UTF-8 Files:

utf_8_encodings_file = (
    'ProductReview',
    'Product',
    'Location'
)

Common Issues and Solutions

Encoding Detection Issues:
- Problem: Character encoding mismatch causing garbled data
- Solution: Script attempts multiple encodings in order:
```
encodings_to_try = [
    'cp1252',
    'utf-8-sig',
    'utf-16',
    'latin-1'
]
```
Null Bytes in Data:
- Problem: Null bytes (\x00) causing parsing errors
- Solution: Automatic removal of null bytes:
```
cleaned_string = decoded_string.replace('\x00', '')
```
BOM (Byte Order Mark) Issues:
- Problem: BOM interfering with data parsing
- Solution: Using 'utf-8-sig' encoding where appropriate
Memory Efficiency:
- Problem: Large files causing memory issues
- Solution: Using low_memory=False in pandas for reliable parsing:
```
read_csv_params = {
    'sep': '\t',
    'header': None,
    'low_memory': False
}
```

Troubleshooting Guide

If you encounter encoding issues:

Check File Type:

file -i your_file.csv    # Check file encoding
hexdump -C -n 32 your_file.csv  # View file header bytes

Manual Encoding Override:
- Modify the encoding lists in populate_table.py
- Add specific files to appropriate encoding groups
Data Validation:
- Use the debug output to verify correct character encoding
- Check for data integrity after import
- Verify special characters are preserved

Performance Considerations

Binary reading mode for better encoding handling
StringIO for efficient in-memory processing
Chunk-based processing for large files
Explicit error handling with fallback options

5. SQL Examples

Basic Queries

Located in example/ directory:

Data retrieval operations
Filtering and sorting
Joins and relationships
Aggregation functions

Advanced Examples

Complex SQL operations demonstrating:

Window functions
Common Table Expressions (CTEs)
Recursive queries
Performance optimization techniques

Performance Optimization

Indexing strategies
Query optimization techniques
Best practices for complex queries

6. Project Structure

Directory Layout

advanced_sql_tutorial/
├── data/               # CSV data files
├── ddLsql/            # SQL schema and table definitions
├── example/           # SQL example queries
├── images/            # Documentation images
├── populate_table.py  # Main data loading script
├── populate2.py       # Additional loading utilities
└── readme.md         # Project documentation

File Descriptions

populate_table.py: Main data processing and loading script
create_schema.sql: Database schema definitions
create_tables.sql: Table creation scripts
example/*.sql: Example SQL queries and tutorials

Key Components

Data Processing Scripts
Database Setup Files
Example Queries
Documentation

7. Usage Guide

Installation Steps

Clone the repository
Install PostgreSQL
Install Python dependencies
Create the database
Run schema and table creation scripts

Configuration

Update database configuration in populate_table.py:

DB_CONFIG = {
    "host": "localhost",
    "port": "5433",
    "dbname": "postgres",
    "user": "data_eng",
    "password": "12345pP"
}

Running the Scripts

Create database schemas:

psql -U data_eng -d postgres -f ddLsql/create_schema.sql

Create tables:

psql -U data_eng -d postgres -f ddLsql/create_tables.sql

Load data:

python populate_table.py

Troubleshooting

Common issues and solutions:

Encoding errors: Check file encoding and use appropriate parameters
Memory issues: Adjust chunk size in data loading
Permission errors: Verify database user privileges

8. Contributing

Guidelines

Follow PEP 8 style guide for Python code
Document all SQL queries
Include tests for new features

Development Setup

Fork the repository
Create a virtual environment
Install development dependencies
Create a feature branch

Testing

Test new SQL queries
Verify data integrity
Check performance impact

9. License & Attribution

License Information

This project is licensed under MIT License.

Data Source Credits

Original database design by Microsoft
AdventureWorks sample database
Official Dataset: Microsoft AdventureWorks Sample Databases
Download Link: AdventureWorks2019.bak

Download Instructions

Direct download:

wget https://github.com/Microsoft/sql-server-samples/releases/download/adventureworks/AdventureWorks2019.bak

Alternative sources:
- Microsoft Docs - AdventureWorks Installation Guide
- GitHub Release Page

Acknowledgments

Microsoft for the original AdventureWorks database (2019)
PostgreSQL community
Contributors to the project
Microsoft SQL Server Samples team for maintaining the dataset

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
ddLsql		ddLsql
example		example
images		images
.gitignore		.gitignore
populate2.py		populate2.py
populate_table.py		populate_table.py
readme.md		readme.md
test_encoding_script.py		test_encoding_script.py

Folders and files

Latest commit

History

Repository files navigation

Advanced SQL Tutorial with AdventureWorks Database

Table of Contents

1. Project Overview

Purpose

Database Description

Table Creation Details

Project Structure

2. Setup Requirements

Prerequisites

System Requirements

Required Software

Python Dependencies

3. Database Setup

Schema Creation

Table Creation

Data Import Process

4. Data Processing Pipeline

CSV File Structure

Data Transformation

Data Loading Process

Handling Encodings and Common Issues

File-Specific Encodings

Common Issues and Solutions

Troubleshooting Guide

Performance Considerations

5. SQL Examples

Basic Queries

Advanced Examples

Performance Optimization

6. Project Structure

Directory Layout

File Descriptions

Key Components

7. Usage Guide

Installation Steps

Configuration

Running the Scripts

Troubleshooting

8. Contributing

Guidelines

Development Setup

Testing

9. License & Attribution

License Information

Data Source Credits

Download Instructions

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages