Skip to content

ashwaniYDV/toy-sql

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ToySQL — A SQL Database Engine from Scratch in Go

A toy SQL database engine built from scratch in Go, born out of curiosity to understand how SQL actually works under the hood. What happens when you type a query? How does a lexer break text into tokens? What does a parser do with them? How does a query planner decide what to execute? This project explores all of that by building each piece from scratch — no external SQL parser libraries, just hand-written Go code from tokenizer to disk. It's not a production database, but it's enough to really learn how the pieces fit together.

Table of Contents


Features

  • Complete SQL Pipeline: Lexer → Parser → AST → Query Planner → Execution Engine → Storage
  • DDL Support: CREATE DATABASE, CREATE TABLE, USE, SHOW DATABASES, SHOW TABLES, DESCRIBE
  • DML Support: INSERT, SELECT, UPDATE, DELETE
  • WHERE Filtering: Six comparison operators (=, >, <, >=, <=, !=) with AND/OR logical connectors
  • ORDER BY: Ascending (default) and descending sorting on any column
  • LIMIT: Result set truncation
  • GROUP BY: Grouping with aggregate functions (COUNT, SUM, AVG, MIN, MAX)
  • HAVING: Post-aggregation filtering on grouped results
  • INNER JOIN: Nested loop join with equality condition and table aliases
  • Persistent Storage: JSON file-based storage that survives between sessions
  • Interactive REPL: readline-powered CLI with history support
  • Comprehensive Testing: Unit tests + property-based tests using the rapid library
  • Type System: INT (64-bit signed integer) and STRING (variable-length text)

Quick Start

Prerequisites

  • Go 1.23 or later

Build & Run

# Clone and build
go build -o toysql ./cmd/toysql

# Run the interactive REPL
./toysql

Example Session

toysql> CREATE DATABASE company;
Database 'company' created

toysql> USE company;
Using database 'company'

toysql> CREATE TABLE employees (id INT, name STRING, age INT, department STRING);
Table 'employees' created

toysql> INSERT INTO employees VALUES (1, 'Alice', 30, 'Engineering');
1 row inserted

toysql> INSERT INTO employees VALUES (2, 'Bob', 25, 'Marketing');
1 row inserted

toysql> INSERT INTO employees VALUES (3, 'Charlie', 35, 'Engineering');
1 row inserted

toysql> SELECT name, age FROM employees WHERE department = 'Engineering' ORDER BY age DESC;
name    | age
--------+----
Charlie | 35
Alice   | 30
(2 row(s))

toysql> SELECT department, COUNT(*) AS cnt, AVG(age) AS avg_age FROM employees GROUP BY department;
department  | cnt | avg_age
------------+-----+--------
Engineering | 2   | 32
Marketing   | 1   | 25
(2 row(s))

toysql> CREATE TABLE departments (dept_name STRING, budget INT);
Table 'departments' created

toysql> INSERT INTO departments VALUES ('Engineering', 500000);
1 row inserted

toysql> INSERT INTO departments VALUES ('Marketing', 300000);
1 row inserted

toysql> INSERT INTO departments VALUES ('Sales', 200000);
1 row inserted

toysql> SELECT name, dept_name, budget FROM employees INNER JOIN departments AS d ON employees.department = d.dept_name;
name    | dept_name   | budget
--------+-------------+-------
Alice   | Engineering | 500000
Bob     | Marketing   | 300000
Charlie | Engineering | 500000
(3 row(s))

toysql> exit

Architecture Overview

ToySQL follows a layered pipeline architecture where each component transforms its input and passes it to the next stage. This mirrors how production database engines work, making each layer independently testable and understandable.

graph TB
    subgraph "User Interface"
        REPL[Client REPL<br/><i>cmd/toysql</i>]
    end

    subgraph "Frontend Processing"
        LEX[Lexer<br/><i>pkg/lexer</i>]
        PAR[Parser<br/><i>pkg/parser</i>]
    end

    subgraph "Semantic Layer"
        AST[AST<br/><i>pkg/ast</i>]
        CAT[Catalog Manager<br/><i>pkg/catalog</i>]
        PLN[Query Planner<br/><i>pkg/planner</i>]
    end

    subgraph "Execution Layer"
        ENG[Execution Engine<br/><i>pkg/engine</i>]
        direction LR
        TS[TableScan]
        FIL[Filter]
        PROJ[Projection]
        SORT[Sort]
        LIM[Limit]
        GRP[GroupBy]
        JOIN[Join]
    end

    subgraph "Persistence Layer"
        STO[Storage Engine<br/><i>pkg/storage</i>]
        FS[JSON Files on Disk<br/><i>data/databases/</i>]
    end

    REPL -->|SQL text| LEX
    LEX -->|Token stream| PAR
    PAR -->|AST nodes| AST
    AST --> PLN
    CAT -->|Schema metadata| PLN
    PLN -->|Execution plan| ENG
    ENG --> TS
    ENG --> FIL
    ENG --> PROJ
    ENG --> SORT
    ENG --> LIM
    ENG --> GRP
    ENG --> JOIN
    ENG -->|Read/Write| STO
    STO -->|File I/O| FS
    ENG -->|Result| REPL
Loading

Query Processing Pipeline

Every SQL statement flows through the same pipeline. Here's the detailed data transformation at each stage:

flowchart LR
    A["SQL Text<br/><code>SELECT name FROM users WHERE age > 21</code>"] --> B["Token Stream<br/><code>[SELECT, name, FROM, users, WHERE, age, >, 21]</code>"]
    B --> C["AST Node<br/><code>SelectStmt{Columns, Table, Where}</code>"]
    C --> D["Execution Plan<br/><code>SelectPlan{validated + schema}</code>"]
    D --> E["Result<br/><code>{Columns: [name], Rows: [[Alice], [Bob]]}</code>"]

    style A fill:#f9f,stroke:#333
    style B fill:#bbf,stroke:#333
    style C fill:#bfb,stroke:#333
    style D fill:#fbb,stroke:#333
    style E fill:#ff9,stroke:#333
Loading

SELECT Query Execution Flow

For the most complex operation — a SELECT with all clauses — the execution engine applies operators in a specific order:

flowchart TD
    A[TableScan] --> B[Join<br/><i>INNER JOIN</i>]
    B --> C[Filter<br/><i>WHERE clause</i>]
    C --> D[GroupBy<br/><i>GROUP BY + aggregates</i>]
    D --> E[Having Filter<br/><i>HAVING clause</i>]
    E --> F[Sort<br/><i>ORDER BY</i>]
    F --> G[Limit<br/><i>LIMIT N</i>]
    G --> H[Projection<br/><i>SELECT columns</i>]

    A -.->|"All rows from table"| B
    B -.->|"Combined rows from both tables"| C
    C -.->|"Rows matching WHERE"| D
    D -.->|"Grouped rows with aggregates"| E
    E -.->|"Groups matching HAVING"| F
    F -.->|"Sorted rows"| G
    G -.->|"Truncated rows"| H
    H -.->|"Final columns only"| I[Result to REPL]
Loading

INSERT Execution Flow

flowchart LR
    A[INSERT AST] --> B[Planner validates:<br/>- Table exists<br/>- Column count matches<br/>- Types match]
    B --> C[Engine builds row map]
    C --> D[Storage.InsertRow<br/>validates & appends]
    D --> E[Persist JSON to disk]
    E --> F["Result: 1 row inserted"]
Loading

DDL Execution Flow

flowchart LR
    A[DDL AST<br/><i>CREATE/USE/SHOW/DESCRIBE</i>] --> B[Planner validates:<br/>- DB exists/not exists<br/>- Table exists/not exists]
    B --> C{DDL Type}
    C -->|CREATE DATABASE| D[Catalog + Storage<br/>create directory + metadata.json]
    C -->|CREATE TABLE| E[Catalog + Storage<br/>create table JSON file]
    C -->|USE| F[Catalog sets<br/>active database]
    C -->|SHOW/DESCRIBE| G[Catalog returns<br/>metadata lists]
    D --> H[Confirmation message]
    E --> H
    F --> H
    G --> I[Formatted table output]
Loading

Component Deep Dive

1. Lexer (pkg/lexer)

The lexer performs lexical analysis, converting raw SQL text into a stream of classified tokens. It operates character-by-character using rune-based scanning for proper Unicode support.

stateDiagram-v2
    [*] --> Scanning
    Scanning --> Whitespace: space/tab/newline
    Whitespace --> Scanning: skip

    Scanning --> Identifier: letter or _
    Identifier --> Identifier: letter/digit/_
    Identifier --> KeywordCheck: non-ident char
    KeywordCheck --> EmitKeyword: matches keyword map
    KeywordCheck --> EmitIdentifier: no match

    Scanning --> IntLiteral: digit
    IntLiteral --> IntLiteral: digit
    IntLiteral --> EmitInt: non-digit

    Scanning --> StringLiteral: single quote
    StringLiteral --> StringLiteral: any char except '
    StringLiteral --> EmitString: closing quote
    StringLiteral --> Error: EOF without closing quote

    Scanning --> TwoCharOp: >, <, !
    TwoCharOp --> EmitOp: followed by =
    TwoCharOp --> EmitSingleOp: not followed by =

    Scanning --> EmitPunct: , ; ( ) * .
    Scanning --> Error: unrecognized char
Loading

Key characteristics:

  • Case-insensitive keywords: SELECT, select, SeLeCt all produce TOKEN_SELECT
  • Position tracking: Every token carries 1-based line and column numbers
  • Two-character operators: >=, <=, != are recognized as single tokens
  • Max limits: Identifiers ≤ 128 chars, strings ≤ 65,535 chars
  • Round-trip property: Tokenize → PrettyPrint → Tokenize produces identical token sequences

Token Types:

Category Tokens
Keywords SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, USE, SHOW, DESCRIBE, FROM, WHERE, INTO, VALUES, SET, TABLE, DATABASE, DATABASES, TABLES, AND, OR, ORDER, BY, LIMIT, GROUP, HAVING, JOIN, INNER, ON, AS, INT, STRING, ASC, DESC, COUNT, SUM, AVG, MIN, MAX
Literals IDENTIFIER, INT_LITERAL, STRING_LITERAL
Operators =, >, <, >=, <=, !=
Punctuation ,, ;, (, ), *, .
Special EOF

2. Parser (pkg/parser)

The parser uses recursive descent with one token of lookahead. Each SQL statement type has a dedicated parsing function. The parser validates grammar structure but does not perform semantic validation — that responsibility belongs to the Query Planner.

flowchart TD
    ENTRY[Parse Entry Point] --> DISPATCH{Leading Token?}

    DISPATCH -->|CREATE| CREATE_BRANCH{Next Token?}
    CREATE_BRANCH -->|DATABASE| CREATE_DB[parseCreateDatabase]
    CREATE_BRANCH -->|TABLE| CREATE_TBL[parseCreateTable]

    DISPATCH -->|USE| USE[parseUse]
    DISPATCH -->|SHOW| SHOW_BRANCH{Next Token?}
    SHOW_BRANCH -->|DATABASES| SHOW_DB[parseShowDatabases]
    SHOW_BRANCH -->|TABLES| SHOW_TBL[parseShowTables]

    DISPATCH -->|DESCRIBE| DESCRIBE[parseDescribe]
    DISPATCH -->|INSERT| INSERT[parseInsert]
    DISPATCH -->|SELECT| SELECT[parseSelect]
    DISPATCH -->|DELETE| DELETE[parseDelete]
    DISPATCH -->|UPDATE| UPDATE[parseUpdate]

    DISPATCH -->|Other| ERROR[ParseError:<br/>unexpected token]

    SELECT --> SEL_COLS[parseSelectColumns]
    SEL_COLS --> SEL_FROM[expect FROM + table]
    SEL_FROM --> SEL_JOIN[optional: parseJoin]
    SEL_JOIN --> SEL_WHERE[optional: parseWhere]
    SEL_WHERE --> SEL_GROUP[optional: parseGroupBy]
    SEL_GROUP --> SEL_HAVING[optional: parseHaving]
    SEL_HAVING --> SEL_ORDER[optional: parseOrderBy]
    SEL_ORDER --> SEL_LIMIT[optional: parseLimit]
Loading

Parsing characteristics:

  • Error recovery: No error recovery is attempted — first error halts parsing
  • Trailing token detection: After a valid statement + optional semicolon, extra tokens produce an error
  • Round-trip property: Parse → ASTPrint → Parse produces structurally identical ASTs
  • WHERE parsing: Supports AND with lower precedence than comparison operators

3. Abstract Syntax Tree (pkg/ast)

The AST provides a strongly-typed Go representation of parsed SQL statements. Downstream components (planner, engine) pattern-match on these types.

classDiagram
    class Statement {
        <<interface>>
        +statementNode()
    }

    class Expr {
        <<interface>>
        +exprNode()
    }

    class Value {
        <<interface>>
        +valueNode()
    }

    Statement <|-- CreateDatabaseStmt
    Statement <|-- UseDatabaseStmt
    Statement <|-- CreateTableStmt
    Statement <|-- ShowDatabasesStmt
    Statement <|-- ShowTablesStmt
    Statement <|-- DescribeStmt
    Statement <|-- InsertStmt
    Statement <|-- SelectStmt
    Statement <|-- DeleteStmt
    Statement <|-- UpdateStmt

    Expr <|-- ComparisonExpr
    Expr <|-- AndExpr
    Expr <|-- OrExpr

    Value <|-- IntValue
    Value <|-- StringValue
    Value <|-- ColumnRef

    SelectStmt --> JoinClause
    SelectStmt --> OrderByClause
    SelectStmt --> SelectColumn
    SelectStmt --> Expr

    InsertStmt --> Value
    UpdateStmt --> Assignment
    Assignment --> Value

    class SelectStmt {
        +Columns []SelectColumn
        +Table string
        +Join *JoinClause
        +Where Expr
        +GroupBy []string
        +Having Expr
        +OrderBy *OrderByClause
        +Limit *int
    }

    class ComparisonExpr {
        +Left string
        +Operator string
        +Right Value
    }

    class JoinClause {
        +Table string
        +Alias string
        +LeftCol string
        +LeftTable string
        +RightCol string
        +RightTable string
    }
Loading

4. Catalog Manager (pkg/catalog)

The Catalog Manager maintains an in-memory registry of databases and table schemas, backed by the filesystem. It provides the metadata needed by the Query Planner for semantic validation.

flowchart TD
    subgraph "Catalog State (In-Memory)"
        DB_MAP["databases map<br/>{db_name → {table_name → []ColumnDef}}"]
        ACTIVE["activeDB string<br/>(session state)"]
    end

    subgraph "Operations"
        CREATE_DB[CreateDatabase] --> DB_MAP
        USE_DB[UseDatabase] --> ACTIVE
        CREATE_TBL[CreateTable] --> DB_MAP
        GET_SCHEMA[GetTableSchema] --> DB_MAP
        LIST_DB[ListDatabases] --> DB_MAP
        LIST_TBL[ListTables] --> DB_MAP
    end

    subgraph "Filesystem (On Startup)"
        DISK["data/databases/<br/>├── db1/<br/>│   ├── metadata.json<br/>│   ├── users.json<br/>│   └── orders.json<br/>└── db2/<br/>    └── metadata.json"]
    end

    DISK -->|"loadFromDisk()"| DB_MAP
Loading

Key behaviors:

  • Loads existing metadata from disk at initialization
  • Database and table lists are always returned in alphabetical order
  • Validates all preconditions: no DB selected, table already exists, DB not found, etc.
  • Active database is session-scoped (in-memory only)

5. Query Planner (pkg/planner)

The Query Planner validates AST nodes against catalog metadata and produces typed execution plans. It acts as the semantic analysis layer — catching errors that the parser's grammar checks cannot detect.

flowchart TD
    AST[AST Statement] --> VALIDATE{Validation Checks}

    VALIDATE --> V1[Table exists?]
    VALIDATE --> V2[Columns exist in schema?]
    VALIDATE --> V3[Types compatible?]
    VALIDATE --> V4[Database selected?]
    VALIDATE --> V5[HAVING has GROUP BY?]
    VALIDATE --> V6[LIMIT > 0?]
    VALIDATE --> V7[No SUM/AVG on STRING?]

    V1 -->|Yes| PLAN[Produce Plan Node]
    V2 -->|Yes| PLAN
    V3 -->|Yes| PLAN
    V4 -->|Yes| PLAN
    V5 -->|Yes| PLAN
    V6 -->|Yes| PLAN
    V7 -->|Yes| PLAN

    V1 -->|No| ERR1[ErrTableNotFound]
    V2 -->|No| ERR2[ErrColumnNotFound]
    V3 -->|No| ERR3[ErrTypeMismatch]
    V4 -->|No| ERR4[ErrNoDatabaseSelected]
    V5 -->|No| ERR5[ErrHavingWithoutGroup]
    V6 -->|No| ERR6[ErrLimitNotPositive]
    V7 -->|No| ERR7[ErrAggregateOnString]

    PLAN --> PLAN_TYPES{Plan Type}
    PLAN_TYPES --> P1[CreateDatabasePlan]
    PLAN_TYPES --> P2[CreateTablePlan]
    PLAN_TYPES --> P3[InsertPlan<br/><i>+schema +columns +values</i>]
    PLAN_TYPES --> P4[SelectPlan<br/><i>+schema +joinSchema</i>]
    PLAN_TYPES --> P5[DeletePlan<br/><i>+schema +where</i>]
    PLAN_TYPES --> P6[UpdatePlan<br/><i>+schema +assignments</i>]
Loading

The planner enriches plans with schema information that the execution engine needs for type-safe operations (filtering, sorting, aggregation).


6. Execution Engine (pkg/engine)

The execution engine takes validated plans and executes them using a set of relational operators. Each operator transforms a row set and passes it downstream.

flowchart LR
    subgraph "Operators"
        direction TB
        TS["**TableScan**<br/>Reads all rows via<br/>Storage.ScanRows"]
        JOIN_OP["**Join**<br/>Nested loop join<br/>on equality condition"]
        FILTER["**Filter**<br/>Evaluates WHERE conditions<br/>6 operators + AND/OR"]
        GROUP["**GroupBy**<br/>Groups rows, computes<br/>COUNT/SUM/AVG/MIN/MAX"]
        HAVING["**Having**<br/>Filters groups by<br/>aggregate condition"]
        SORT_OP["**Sort**<br/>Orders by column<br/>ASC or DESC"]
        LIMIT_OP["**Limit**<br/>Truncates to<br/>N rows"]
        PROJ["**Projection**<br/>Selects only<br/>requested columns"]
    end

    TS --> JOIN_OP --> FILTER --> GROUP --> HAVING --> SORT_OP --> LIMIT_OP --> PROJ
Loading

Operator details:

Operator Responsibility Algorithm
TableScan Load all rows from a table Sequential file read
Join Combine rows from two tables Nested loop (O(n×m))
Filter Apply WHERE conditions Linear scan with expression evaluation
GroupBy Group rows + compute aggregates Hash-based grouping, preserves insertion order
Having Filter groups by aggregate values Linear scan over groups
Sort Order rows by column Go's sort.SliceStable (stable sort)
Limit Truncate result set Slice truncation
Projection Select output columns Column extraction per row

Comparison semantics:

  • INT columns: numeric comparison (int64)
  • STRING columns: case-sensitive lexicographic (byte-by-byte) comparison
  • Cross-type comparisons (INT vs STRING): return a type mismatch error

7. Storage Engine (pkg/storage)

The storage engine provides CRUD operations backed by JSON files on the filesystem. Each database is a directory, each table is a JSON file.

flowchart TD
    subgraph "StorageEngine Interface"
        CREATE_DB_S[CreateDatabase]
        CREATE_TBL_S[CreateTable]
        INSERT[InsertRow]
        SCAN[ScanRows]
        DELETE_S[DeleteRows]
        UPDATE_S[UpdateRows]
    end

    subgraph "Filesystem Layout"
        ROOT["data/"]
        DBS["databases/"]
        DB1["mydb/"]
        META["metadata.json"]
        TBL1["users.json"]
        TBL2["orders.json"]

        ROOT --> DBS
        DBS --> DB1
        DB1 --> META
        DB1 --> TBL1
        DB1 --> TBL2
    end

    CREATE_DB_S -->|"mkdir + write metadata.json"| DB1
    CREATE_TBL_S -->|"write {schema, rows:[]}"| TBL1
    INSERT -->|"load, validate, append, persist"| TBL1
    SCAN -->|"load, normalize types, return rows"| TBL1
    DELETE_S -->|"load, filter out matches, persist"| TBL1
    UPDATE_S -->|"load, modify matches, persist"| TBL1
Loading

Data integrity features:

  • Type validation on INSERT (column count + type match)
  • JSON numbers are normalized from float64int64 for INT columns (Go's json.Unmarshal quirk)
  • Atomic persistence: full file rewrite on each mutation
  • Error reporting: specific error codes for DB not found, table not found, type mismatch

8. Client REPL (cmd/toysql)

The interactive REPL ties all components together and presents results in formatted tables.

flowchart TD
    START[Start Application] --> INIT[Initialize:<br/>StorageEngine → CatalogManager → QueryPlanner → ExecutionEngine]
    INIT --> PROMPT[Display Prompt<br/><code>toysql></code>]
    PROMPT --> READ[Read Input Line]
    READ --> CHECK_EXIT{exit or quit?}
    CHECK_EXIT -->|Yes| DONE[Exit with code 0]
    CHECK_EXIT -->|No| CHECK_EMPTY{Empty/whitespace?}
    CHECK_EMPTY -->|Yes| PROMPT
    CHECK_EMPTY -->|No| PIPELINE[Execute Pipeline:<br/>Tokenize → Parse → Plan → Execute]
    PIPELINE --> CHECK_ERR{Error?}
    CHECK_ERR -->|Yes| DISPLAY_ERR[Display error message]
    CHECK_ERR -->|No| FORMAT{Result type?}
    FORMAT -->|SELECT rows| TABLE[Format as aligned table<br/>with headers + separator]
    FORMAT -->|DDL success| MSG[Display confirmation message]
    FORMAT -->|DML success| COUNT[Display rows affected]
    TABLE --> PROMPT
    MSG --> PROMPT
    COUNT --> PROMPT
    DISPLAY_ERR --> PROMPT
Loading

REPL features:

  • readline-powered input with command history (persisted in .toysql_history)
  • Ctrl+C interrupts current line (no crash), Ctrl+D exits
  • Aligned table output with automatic column width calculation
  • Row count footer for SELECT results (e.g., (3 row(s)))

Supported SQL Syntax

Data Definition Language (DDL)

-- Database management
CREATE DATABASE <name>;
USE <name>;
SHOW DATABASES;

-- Table management
CREATE TABLE <name> (<col1> <type>, <col2> <type>, ...);
SHOW TABLES;
DESCRIBE <table>;

Data Manipulation Language (DML)

-- Insert data
INSERT INTO <table> VALUES (<val1>, <val2>, ...);
INSERT INTO <table> (<col1>, <col2>) VALUES (<val1>, <val2>);

-- Query data
SELECT * FROM <table>;
SELECT <col1>, <col2> FROM <table>;
SELECT <col1> FROM <table> WHERE <condition>;
SELECT <col1> FROM <table> WHERE <cond1> AND <cond2>;
SELECT <col1> FROM <table> ORDER BY <col> [ASC|DESC];
SELECT <col1> FROM <table> LIMIT <n>;

-- Aggregation
SELECT <col>, COUNT(*), SUM(<col>), AVG(<col>), MIN(<col>), MAX(<col>)
  FROM <table> GROUP BY <col>;
SELECT <col>, COUNT(*) AS cnt FROM <table> GROUP BY <col> HAVING cnt > 5;

-- Joins
SELECT t1.name, t2.amount
  FROM orders AS t1
  INNER JOIN payments AS t2 ON t1.id = t2.order_id;

-- Update data
UPDATE <table> SET <col1> = <val1>, <col2> = <val2> WHERE <condition>;

-- Delete data
DELETE FROM <table>;
DELETE FROM <table> WHERE <condition>;

WHERE Clause Operators

Operator Description Example
= Equal to WHERE age = 25
!= Not equal to WHERE status != 'active'
> Greater than WHERE salary > 50000
< Less than WHERE age < 30
>= Greater than or equal WHERE score >= 90
<= Less than or equal WHERE price <= 100
AND Logical AND WHERE age > 21 AND city = 'NYC'
OR Logical OR WHERE dept = 'Eng' OR dept = 'Sales'

Data Types

ToySQL supports two column types:

Type Go Representation Range Description
INT int64 0 to 9,223,372,036,854,775,807 64-bit signed integer
STRING string up to 65,535 characters Variable-length text

Comparison rules:

  • INT columns: numeric ordering
  • STRING columns: case-sensitive lexicographic (byte-by-byte) ordering, where uppercase letters precede lowercase

Storage Format

Directory Layout

data/
└── databases/
    ├── company/
    │   ├── metadata.json
    │   ├── employees.json
    │   └── departments.json
    └── analytics/
        ├── metadata.json
        └── events.json

Database Metadata (metadata.json)

{
  "name": "company",
  "created_at": "2024-06-15T10:30:00Z"
}

Table File (employees.json)

{
  "schema": [
    {"name": "id", "type": "INT"},
    {"name": "name", "type": "STRING"},
    {"name": "age", "type": "INT"},
    {"name": "department", "type": "STRING"}
  ],
  "rows": [
    {"id": 1, "name": "Alice", "age": 30, "department": "Engineering"},
    {"id": 2, "name": "Bob", "age": 25, "department": "Marketing"}
  ]
}

Error Handling

ToySQL uses a unified error type (SQLError) across all pipeline stages, providing structured error codes and optional source position information.

flowchart TD
    subgraph "Error Categories"
        direction TB
        LEX_ERR["**Lexer Errors (1xx)**<br/>• Unrecognized character<br/>• Unterminated string<br/>• Integer overflow<br/>• Identifier too long<br/>• String too long"]
        PARSE_ERR["**Parser Errors (2xx)**<br/>• Unexpected token<br/>• Trailing tokens"]
        PLAN_ERR["**Planner Errors (3xx)**<br/>• No database selected<br/>• Database/table not found<br/>• Database/table exists<br/>• Column not found<br/>• Type mismatch<br/>• HAVING without GROUP BY<br/>• LIMIT not positive<br/>• SUM/AVG on STRING"]
        STORE_ERR["**Storage Errors (4xx)**<br/>• Database dir not found<br/>• Table file not found<br/>• Column count mismatch<br/>• Insert type mismatch<br/>• File I/O error"]
    end

    subgraph "Error Structure"
        ERR["SQLError<br/>• Code: ErrorCode<br/>• Message: string<br/>• Pos: *Position (line, col)"]
    end

    LEX_ERR --> ERR
    PARSE_ERR --> ERR
    PLAN_ERR --> ERR
    STORE_ERR --> ERR
Loading

Error propagation strategy:

  • Errors at any pipeline stage immediately halt processing and bubble up to the REPL
  • The REPL displays the error and continues accepting input (never crashes)
  • Each statement is atomic: it either fully succeeds or has no effect
  • Error messages include context (column names, table names, positions) to be human-readable and actionable

Testing Strategy

ToySQL employs a dual testing approach combining deterministic unit tests with property-based tests for comprehensive correctness validation.

Testing Pyramid

flowchart TD
    subgraph "Integration Tests"
        E2E["End-to-end pipeline tests<br/><i>cmd/toysql/integration_test.go</i>"]
    end

    subgraph "Property-Based Tests (rapid)"
        P1["Lexer round-trip"]
        P2["Parser round-trip"]
        P3["Case-insensitive keywords"]
        P4["String literal extraction"]
        P5["Integer literal extraction"]
        P6["Whitespace independence"]
        P7["Identifier classification"]
        P8["Catalog schema round-trip"]
        P9["Storage insert-scan round-trip"]
        P10["Filter correctness"]
        P11["Sort correctness"]
        P12["GroupBy uniqueness"]
        P13["JOIN correctness"]
    end

    subgraph "Unit Tests"
        U1["Each error condition"]
        U2["Each SQL statement type"]
        U3["Each operator behavior"]
        U4["Edge cases & boundary values"]
    end
Loading

Property-Based Tests

Property-based tests verify universal correctness properties across randomly generated inputs using the rapid library. Each property test runs a minimum of 100 iterations with generated inputs.

Property Description Validates
Lexer round-trip tokenize → pretty-print → tokenize = same tokens Req 1.8, 1.9
Parser round-trip parse → print → parse = same AST Req 2.13, 2.14
Case-insensitive keywords Any case permutation produces same token type Req 1.7
String literal extraction Token literal = original string without quotes Req 1.3
Integer literal extraction Token literal = string representation of integer Req 1.4
Whitespace independence Extra whitespace doesn't change token types/literals Req 1.5
Identifier classification Non-keyword ident strings produce IDENTIFIER token Req 1.10, 1.12
Schema round-trip CreateTable + GetTableSchema returns same schema Req 3.4, 3.9
List sorting Database/table lists are alphabetically sorted Req 3.7, 3.8
Insert-scan round-trip Inserted rows are all retrievable with same values Req 4.5, 4.8
Delete correctness Only matching rows removed, count is correct Req 4.9
Update correctness Only matching rows modified, count is correct Req 4.10
Filter correctness Correct rows pass each comparison operator Req 6.1–6.10
AND conjunction AND returns intersection of individual filters Req 6.7
ORDER BY Output is sorted by specified column/direction Req 7.1, 7.2
LIMIT Result has at most N rows Req 7.3
GROUP BY uniqueness One row per unique value combination Req 7.4
Aggregate correctness COUNT/SUM/AVG/MIN/MAX compute correctly Req 7.5
JOIN correctness Returns exactly matching row pairs Req 8.1

Running Tests

# Run all tests
go test ./...

# Run tests with verbose output
go test -v ./...

# Run only property tests
go test -v -run Property ./...

# Run tests for a specific package
go test -v ./pkg/engine/...

Project Structure

toysql/
├── cmd/
│   └── toysql/
│       ├── main.go                    # REPL entry point and pipeline wiring
│       ├── formatter.go               # Result formatting (aligned tables)
│       ├── formatter_test.go          # Unit tests for formatting
│       ├── formatter_property_test.go # Property tests for formatting
│       ├── integration_test.go        # End-to-end pipeline tests
│       └── main_test.go              # REPL behavior tests
├── pkg/
│   ├── ast/
│   │   ├── ast.go                    # All AST node type definitions
│   │   └── ast_test.go              # AST interface tests
│   ├── catalog/
│   │   ├── catalog.go               # Catalog Manager implementation
│   │   ├── catalog_test.go          # Unit tests
│   │   └── catalog_property_test.go # Property tests (schema round-trip, sorting)
│   ├── engine/
│   │   ├── engine.go                # Execution Engine (dispatch + INSERT/SELECT/DELETE/UPDATE)
│   │   ├── engine_test.go           # Unit tests
│   │   ├── engine_property_test.go  # Property tests
│   │   ├── filter.go               # Filter operator (WHERE + HAVING evaluation)
│   │   ├── filter_test.go          # Unit tests for filtering
│   │   ├── filter_property_test.go # Property tests for filter correctness
│   │   ├── groupby.go             # GroupBy operator (aggregation)
│   │   ├── groupby_test.go        # Unit tests for GROUP BY
│   │   ├── groupby_property_test.go # Property tests for grouping
│   │   ├── join.go                # Join operator (nested loop INNER JOIN)
│   │   ├── join_test.go           # Unit tests for JOIN
│   │   ├── join_property_test.go  # Property tests for JOIN
│   │   ├── sort.go               # Sort + Limit operators
│   │   ├── sort_test.go          # Unit tests for ORDER BY/LIMIT
│   │   └── sort_property_test.go # Property tests for sorting
│   ├── errors/
│   │   ├── errors.go             # Unified error types and error codes
│   │   └── errors_test.go       # Error formatting tests
│   ├── lexer/
│   │   ├── lexer.go             # Lexer (tokenizer) implementation
│   │   ├── lexer_test.go        # Unit tests
│   │   ├── lexer_property_test.go # Property tests (round-trip, keywords, etc.)
│   │   ├── printer.go           # Pretty printer (tokens → SQL text)
│   │   └── printer_test.go     # Printer unit tests
│   ├── parser/
│   │   ├── parser.go           # Parser entry point and statement dispatch
│   │   ├── parser_ddl.go      # DDL parsing (CREATE, SHOW, USE, DESCRIBE)
│   │   ├── parser_ddl_test.go # DDL parser tests
│   │   ├── parser_dml.go      # DML parsing (INSERT, SELECT, DELETE, UPDATE)
│   │   ├── parser_dml_test.go # DML parser tests
│   │   ├── parser_test.go     # General parser tests
│   │   ├── parser_unit_test.go # Parser unit test cases
│   │   ├── parser_property_test.go # Property tests (round-trip)
│   │   ├── helpers.go         # Parser helper functions (peek, advance, expect)
│   │   ├── helpers_test.go    # Helper tests
│   │   ├── printer.go        # AST printer (AST → tokens)
│   │   └── printer_test.go   # AST printer tests
│   ├── planner/
│   │   ├── planner.go        # Query Planner interface + plan types
│   │   ├── planner_ddl.go    # DDL planning (validation + plan creation)
│   │   ├── planner_dml.go    # DML planning (validation + plan creation)
│   │   ├── helpers.go        # Planner helper functions
│   │   └── planner_test.go   # Planner tests
│   └── storage/
│       ├── interface.go           # StorageEngine interface definition
│       ├── json_storage.go        # JSON file-based implementation
│       ├── json_storage_test.go   # Unit tests
│       └── json_storage_property_test.go # Property tests (CRUD round-trips)
├── data/
│   └── .gitkeep               # Placeholder for runtime database files
├── go.mod                     # Go module definition
├── go.sum                     # Dependency checksums
└── README.md                  # This file

Design Decisions

Why Go?

Go was chosen for its simplicity, strong typing, excellent standard library, and built-in testing framework. The single-binary compilation makes distribution trivial.

Why JSON for Storage?

JSON files make the storage layer inspectable and debuggable. You can open any table file in a text editor to see exactly what's stored. While not performant for large datasets, it perfectly serves the educational goal of transparency.

Why No External SQL Parser?

Hand-writing the lexer and parser is the core educational value. It demonstrates:

  • How tokenization works (lexer state machine)
  • How recursive descent parsing produces ASTs
  • How grammar rules map to code structure
  • How error reporting with positions works

Why Property-Based Testing?

Property-based tests catch edge cases that hand-written unit tests miss. They verify invariants like "for ALL valid inputs, X holds" rather than just "for THESE specific inputs, X holds." Key examples:

  • Round-trip properties ensure lossless transformations
  • Filter correctness verifies all 6 operators against arbitrary data
  • Sorting properties verify output ordering for random datasets

Why Nested Loop Join?

The nested loop algorithm is O(n×m) but is the simplest to implement correctly and understand. For educational purposes, clarity trumps performance. A real database would use hash joins or merge joins for large tables.

Why Single-Threaded?

Removing concurrency concerns keeps the codebase focused on database fundamentals. There's no need for locks, transaction isolation, or MVCC — just pure pipeline processing.

Limitations

Limitation Reason
No indexes Educational simplicity; all scans are sequential
No transactions / ACID Single-threaded with atomic file writes is sufficient
No NULL support Simplifies type system and comparison logic
Two types only (INT, STRING) Minimal type system covers most concepts
No subqueries Keeps parser and planner complexity manageable
No concurrent access Avoids locking/MVCC complexity
Full table rewrite on mutation Simple persistence; no WAL or page-based storage

Dependencies

Package Purpose
pgregory.net/rapid Property-based testing framework
github.com/chzyer/readline REPL line editing and history

Both are minimal, well-maintained libraries with no transitive dependency chains.


License

This project is built for educational purposes. Feel free to study, modify, and learn from it.

About

A toy SQL database engine built from scratch in Go, born out of curiosity to understand how SQL actually works under the hood.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages