ToySQL — A SQL Database Engine from Scratch in Go

A toy SQL database engine built from scratch in Go, born out of curiosity to understand how SQL actually works under the hood. What happens when you type a query? How does a lexer break text into tokens? What does a parser do with them? How does a query planner decide what to execute? This project explores all of that by building each piece from scratch — no external SQL parser libraries, just hand-written Go code from tokenizer to disk. It's not a production database, but it's enough to really learn how the pieces fit together.

Features

Complete SQL Pipeline: Lexer → Parser → AST → Query Planner → Execution Engine → Storage
DDL Support: CREATE DATABASE, CREATE TABLE, USE, SHOW DATABASES, SHOW TABLES, DESCRIBE
DML Support: INSERT, SELECT, UPDATE, DELETE
WHERE Filtering: Six comparison operators (=, >, <, >=, <=, !=) with AND/OR logical connectors
ORDER BY: Ascending (default) and descending sorting on any column
LIMIT: Result set truncation
GROUP BY: Grouping with aggregate functions (COUNT, SUM, AVG, MIN, MAX)
HAVING: Post-aggregation filtering on grouped results
INNER JOIN: Nested loop join with equality condition and table aliases
Persistent Storage: JSON file-based storage that survives between sessions
Interactive REPL: readline-powered CLI with history support
Comprehensive Testing: Unit tests + property-based tests using the rapid library
Type System: INT (64-bit signed integer) and STRING (variable-length text)

Quick Start

Prerequisites

Go 1.23 or later

Build & Run

# Clone and build
go build -o toysql ./cmd/toysql

# Run the interactive REPL
./toysql

Example Session

toysql> CREATE DATABASE company;
Database 'company' created

toysql> USE company;
Using database 'company'

toysql> CREATE TABLE employees (id INT, name STRING, age INT, department STRING);
Table 'employees' created

toysql> INSERT INTO employees VALUES (1, 'Alice', 30, 'Engineering');
1 row inserted

toysql> INSERT INTO employees VALUES (2, 'Bob', 25, 'Marketing');
1 row inserted

toysql> INSERT INTO employees VALUES (3, 'Charlie', 35, 'Engineering');
1 row inserted

toysql> SELECT name, age FROM employees WHERE department = 'Engineering' ORDER BY age DESC;
name    | age
--------+----
Charlie | 35
Alice   | 30
(2 row(s))

toysql> SELECT department, COUNT(*) AS cnt, AVG(age) AS avg_age FROM employees GROUP BY department;
department  | cnt | avg_age
------------+-----+--------
Engineering | 2   | 32
Marketing   | 1   | 25
(2 row(s))

toysql> CREATE TABLE departments (dept_name STRING, budget INT);
Table 'departments' created

toysql> INSERT INTO departments VALUES ('Engineering', 500000);
1 row inserted

toysql> INSERT INTO departments VALUES ('Marketing', 300000);
1 row inserted

toysql> INSERT INTO departments VALUES ('Sales', 200000);
1 row inserted

toysql> SELECT name, dept_name, budget FROM employees INNER JOIN departments AS d ON employees.department = d.dept_name;
name    | dept_name   | budget
--------+-------------+-------
Alice   | Engineering | 500000
Bob     | Marketing   | 300000
Charlie | Engineering | 500000
(3 row(s))

toysql> exit

Architecture Overview

ToySQL follows a layered pipeline architecture where each component transforms its input and passes it to the next stage. This mirrors how production database engines work, making each layer independently testable and understandable.

graph TB
    subgraph "User Interface"
        REPL[Client REPL<br/><i>cmd/toysql</i>]
    end

    subgraph "Frontend Processing"
        LEX[Lexer<br/><i>pkg/lexer</i>]
        PAR[Parser<br/><i>pkg/parser</i>]
    end

    subgraph "Semantic Layer"
        AST[AST<br/><i>pkg/ast</i>]
        CAT[Catalog Manager<br/><i>pkg/catalog</i>]
        PLN[Query Planner<br/><i>pkg/planner</i>]
    end

    subgraph "Execution Layer"
        ENG[Execution Engine<br/><i>pkg/engine</i>]
        direction LR
        TS[TableScan]
        FIL[Filter]
        PROJ[Projection]
        SORT[Sort]
        LIM[Limit]
        GRP[GroupBy]
        JOIN[Join]
    end

    subgraph "Persistence Layer"
        STO[Storage Engine<br/><i>pkg/storage</i>]
        FS[JSON Files on Disk<br/><i>data/databases/</i>]
    end

    REPL -->|SQL text| LEX
    LEX -->|Token stream| PAR
    PAR -->|AST nodes| AST
    AST --> PLN
    CAT -->|Schema metadata| PLN
    PLN -->|Execution plan| ENG
    ENG --> TS
    ENG --> FIL
    ENG --> PROJ
    ENG --> SORT
    ENG --> LIM
    ENG --> GRP
    ENG --> JOIN
    ENG -->|Read/Write| STO
    STO -->|File I/O| FS
    ENG -->|Result| REPL

Query Processing Pipeline

Every SQL statement flows through the same pipeline. Here's the detailed data transformation at each stage:

flowchart LR
    A["SQL Text<br/><code>SELECT name FROM users WHERE age > 21</code>"] --> B["Token Stream<br/><code>[SELECT, name, FROM, users, WHERE, age, >, 21]</code>"]
    B --> C["AST Node<br/><code>SelectStmt{Columns, Table, Where}</code>"]
    C --> D["Execution Plan<br/><code>SelectPlan{validated + schema}</code>"]
    D --> E["Result<br/><code>{Columns: [name], Rows: [[Alice], [Bob]]}</code>"]

    style A fill:#f9f,stroke:#333
    style B fill:#bbf,stroke:#333
    style C fill:#bfb,stroke:#333
    style D fill:#fbb,stroke:#333
    style E fill:#ff9,stroke:#333

SELECT Query Execution Flow

For the most complex operation — a SELECT with all clauses — the execution engine applies operators in a specific order:

flowchart TD
    A[TableScan] --> B[Join<br/><i>INNER JOIN</i>]
    B --> C[Filter<br/><i>WHERE clause</i>]
    C --> D[GroupBy<br/><i>GROUP BY + aggregates</i>]
    D --> E[Having Filter<br/><i>HAVING clause</i>]
    E --> F[Sort<br/><i>ORDER BY</i>]
    F --> G[Limit<br/><i>LIMIT N</i>]
    G --> H[Projection<br/><i>SELECT columns</i>]

    A -.->|"All rows from table"| B
    B -.->|"Combined rows from both tables"| C
    C -.->|"Rows matching WHERE"| D
    D -.->|"Grouped rows with aggregates"| E
    E -.->|"Groups matching HAVING"| F
    F -.->|"Sorted rows"| G
    G -.->|"Truncated rows"| H
    H -.->|"Final columns only"| I[Result to REPL]

INSERT Execution Flow

flowchart LR
    A[INSERT AST] --> B[Planner validates:<br/>- Table exists<br/>- Column count matches<br/>- Types match]
    B --> C[Engine builds row map]
    C --> D[Storage.InsertRow<br/>validates & appends]
    D --> E[Persist JSON to disk]
    E --> F["Result: 1 row inserted"]

DDL Execution Flow

flowchart LR
    A[DDL AST<br/><i>CREATE/USE/SHOW/DESCRIBE</i>] --> B[Planner validates:<br/>- DB exists/not exists<br/>- Table exists/not exists]
    B --> C{DDL Type}
    C -->|CREATE DATABASE| D[Catalog + Storage<br/>create directory + metadata.json]
    C -->|CREATE TABLE| E[Catalog + Storage<br/>create table JSON file]
    C -->|USE| F[Catalog sets<br/>active database]
    C -->|SHOW/DESCRIBE| G[Catalog returns<br/>metadata lists]
    D --> H[Confirmation message]
    E --> H
    F --> H
    G --> I[Formatted table output]

Component Deep Dive

1. Lexer (`pkg/lexer`)

The lexer performs lexical analysis, converting raw SQL text into a stream of classified tokens. It operates character-by-character using rune-based scanning for proper Unicode support.

stateDiagram-v2
    [*] --> Scanning
    Scanning --> Whitespace: space/tab/newline
    Whitespace --> Scanning: skip

    Scanning --> Identifier: letter or _
    Identifier --> Identifier: letter/digit/_
    Identifier --> KeywordCheck: non-ident char
    KeywordCheck --> EmitKeyword: matches keyword map
    KeywordCheck --> EmitIdentifier: no match

    Scanning --> IntLiteral: digit
    IntLiteral --> IntLiteral: digit
    IntLiteral --> EmitInt: non-digit

    Scanning --> StringLiteral: single quote
    StringLiteral --> StringLiteral: any char except '
    StringLiteral --> EmitString: closing quote
    StringLiteral --> Error: EOF without closing quote

    Scanning --> TwoCharOp: >, <, !
    TwoCharOp --> EmitOp: followed by =
    TwoCharOp --> EmitSingleOp: not followed by =

    Scanning --> EmitPunct: , ; ( ) * .
    Scanning --> Error: unrecognized char

Key characteristics:

Case-insensitive keywords: SELECT, select, SeLeCt all produce TOKEN_SELECT
Position tracking: Every token carries 1-based line and column numbers
Two-character operators: >=, <=, != are recognized as single tokens
Max limits: Identifiers ≤ 128 chars, strings ≤ 65,535 chars
Round-trip property: Tokenize → PrettyPrint → Tokenize produces identical token sequences

Token Types:

Category	Tokens
Keywords	`SELECT`, `INSERT`, `UPDATE`, `DELETE`, `CREATE`, `DROP`, `USE`, `SHOW`, `DESCRIBE`, `FROM`, `WHERE`, `INTO`, `VALUES`, `SET`, `TABLE`, `DATABASE`, `DATABASES`, `TABLES`, `AND`, `OR`, `ORDER`, `BY`, `LIMIT`, `GROUP`, `HAVING`, `JOIN`, `INNER`, `ON`, `AS`, `INT`, `STRING`, `ASC`, `DESC`, `COUNT`, `SUM`, `AVG`, `MIN`, `MAX`
Literals	`IDENTIFIER`, `INT_LITERAL`, `STRING_LITERAL`
Operators	`=`, `>`, `<`, `>=`, `<=`, `!=`
Punctuation	`,`, `;`, `(`, `)`, `*`, `.`
Special	`EOF`

2. Parser (`pkg/parser`)

The parser uses recursive descent with one token of lookahead. Each SQL statement type has a dedicated parsing function. The parser validates grammar structure but does not perform semantic validation — that responsibility belongs to the Query Planner.

flowchart TD
    ENTRY[Parse Entry Point] --> DISPATCH{Leading Token?}

    DISPATCH -->|CREATE| CREATE_BRANCH{Next Token?}
    CREATE_BRANCH -->|DATABASE| CREATE_DB[parseCreateDatabase]
    CREATE_BRANCH -->|TABLE| CREATE_TBL[parseCreateTable]

    DISPATCH -->|USE| USE[parseUse]
    DISPATCH -->|SHOW| SHOW_BRANCH{Next Token?}
    SHOW_BRANCH -->|DATABASES| SHOW_DB[parseShowDatabases]
    SHOW_BRANCH -->|TABLES| SHOW_TBL[parseShowTables]

    DISPATCH -->|DESCRIBE| DESCRIBE[parseDescribe]
    DISPATCH -->|INSERT| INSERT[parseInsert]
    DISPATCH -->|SELECT| SELECT[parseSelect]
    DISPATCH -->|DELETE| DELETE[parseDelete]
    DISPATCH -->|UPDATE| UPDATE[parseUpdate]

    DISPATCH -->|Other| ERROR[ParseError:<br/>unexpected token]

    SELECT --> SEL_COLS[parseSelectColumns]
    SEL_COLS --> SEL_FROM[expect FROM + table]
    SEL_FROM --> SEL_JOIN[optional: parseJoin]
    SEL_JOIN --> SEL_WHERE[optional: parseWhere]
    SEL_WHERE --> SEL_GROUP[optional: parseGroupBy]
    SEL_GROUP --> SEL_HAVING[optional: parseHaving]
    SEL_HAVING --> SEL_ORDER[optional: parseOrderBy]
    SEL_ORDER --> SEL_LIMIT[optional: parseLimit]

Parsing characteristics:

Error recovery: No error recovery is attempted — first error halts parsing
Trailing token detection: After a valid statement + optional semicolon, extra tokens produce an error
Round-trip property: Parse → ASTPrint → Parse produces structurally identical ASTs
WHERE parsing: Supports AND with lower precedence than comparison operators

3. Abstract Syntax Tree (`pkg/ast`)

The AST provides a strongly-typed Go representation of parsed SQL statements. Downstream components (planner, engine) pattern-match on these types.

classDiagram
    class Statement {
        <<interface>>
        +statementNode()
    }

    class Expr {
        <<interface>>
        +exprNode()
    }

    class Value {
        <<interface>>
        +valueNode()
    }

    Statement <|-- CreateDatabaseStmt
    Statement <|-- UseDatabaseStmt
    Statement <|-- CreateTableStmt
    Statement <|-- ShowDatabasesStmt
    Statement <|-- ShowTablesStmt
    Statement <|-- DescribeStmt
    Statement <|-- InsertStmt
    Statement <|-- SelectStmt
    Statement <|-- DeleteStmt
    Statement <|-- UpdateStmt

    Expr <|-- ComparisonExpr
    Expr <|-- AndExpr
    Expr <|-- OrExpr

    Value <|-- IntValue
    Value <|-- StringValue
    Value <|-- ColumnRef

    SelectStmt --> JoinClause
    SelectStmt --> OrderByClause
    SelectStmt --> SelectColumn
    SelectStmt --> Expr

    InsertStmt --> Value
    UpdateStmt --> Assignment
    Assignment --> Value

    class SelectStmt {
        +Columns []SelectColumn
        +Table string
        +Join *JoinClause
        +Where Expr
        +GroupBy []string
        +Having Expr
        +OrderBy *OrderByClause
        +Limit *int
    }

    class ComparisonExpr {
        +Left string
        +Operator string
        +Right Value
    }

    class JoinClause {
        +Table string
        +Alias string
        +LeftCol string
        +LeftTable string
        +RightCol string
        +RightTable string
    }

4. Catalog Manager (`pkg/catalog`)

The Catalog Manager maintains an in-memory registry of databases and table schemas, backed by the filesystem. It provides the metadata needed by the Query Planner for semantic validation.

flowchart TD
    subgraph "Catalog State (In-Memory)"
        DB_MAP["databases map<br/>{db_name → {table_name → []ColumnDef}}"]
        ACTIVE["activeDB string<br/>(session state)"]
    end

    subgraph "Operations"
        CREATE_DB[CreateDatabase] --> DB_MAP
        USE_DB[UseDatabase] --> ACTIVE
        CREATE_TBL[CreateTable] --> DB_MAP
        GET_SCHEMA[GetTableSchema] --> DB_MAP
        LIST_DB[ListDatabases] --> DB_MAP
        LIST_TBL[ListTables] --> DB_MAP
    end

    subgraph "Filesystem (On Startup)"
        DISK["data/databases/<br/>├── db1/<br/>│   ├── metadata.json<br/>│   ├── users.json<br/>│   └── orders.json<br/>└── db2/<br/>    └── metadata.json"]
    end

    DISK -->|"loadFromDisk()"| DB_MAP

Key behaviors:

Loads existing metadata from disk at initialization
Database and table lists are always returned in alphabetical order
Validates all preconditions: no DB selected, table already exists, DB not found, etc.
Active database is session-scoped (in-memory only)

5. Query Planner (`pkg/planner`)

The Query Planner validates AST nodes against catalog metadata and produces typed execution plans. It acts as the semantic analysis layer — catching errors that the parser's grammar checks cannot detect.

flowchart TD
    AST[AST Statement] --> VALIDATE{Validation Checks}

    VALIDATE --> V1[Table exists?]
    VALIDATE --> V2[Columns exist in schema?]
    VALIDATE --> V3[Types compatible?]
    VALIDATE --> V4[Database selected?]
    VALIDATE --> V5[HAVING has GROUP BY?]
    VALIDATE --> V6[LIMIT > 0?]
    VALIDATE --> V7[No SUM/AVG on STRING?]

    V1 -->|Yes| PLAN[Produce Plan Node]
    V2 -->|Yes| PLAN
    V3 -->|Yes| PLAN
    V4 -->|Yes| PLAN
    V5 -->|Yes| PLAN
    V6 -->|Yes| PLAN
    V7 -->|Yes| PLAN

    V1 -->|No| ERR1[ErrTableNotFound]
    V2 -->|No| ERR2[ErrColumnNotFound]
    V3 -->|No| ERR3[ErrTypeMismatch]
    V4 -->|No| ERR4[ErrNoDatabaseSelected]
    V5 -->|No| ERR5[ErrHavingWithoutGroup]
    V6 -->|No| ERR6[ErrLimitNotPositive]
    V7 -->|No| ERR7[ErrAggregateOnString]

    PLAN --> PLAN_TYPES{Plan Type}
    PLAN_TYPES --> P1[CreateDatabasePlan]
    PLAN_TYPES --> P2[CreateTablePlan]
    PLAN_TYPES --> P3[InsertPlan<br/><i>+schema +columns +values</i>]
    PLAN_TYPES --> P4[SelectPlan<br/><i>+schema +joinSchema</i>]
    PLAN_TYPES --> P5[DeletePlan<br/><i>+schema +where</i>]
    PLAN_TYPES --> P6[UpdatePlan<br/><i>+schema +assignments</i>]

The planner enriches plans with schema information that the execution engine needs for type-safe operations (filtering, sorting, aggregation).

6. Execution Engine (`pkg/engine`)

The execution engine takes validated plans and executes them using a set of relational operators. Each operator transforms a row set and passes it downstream.

flowchart LR
    subgraph "Operators"
        direction TB
        TS["**TableScan**<br/>Reads all rows via<br/>Storage.ScanRows"]
        JOIN_OP["**Join**<br/>Nested loop join<br/>on equality condition"]
        FILTER["**Filter**<br/>Evaluates WHERE conditions<br/>6 operators + AND/OR"]
        GROUP["**GroupBy**<br/>Groups rows, computes<br/>COUNT/SUM/AVG/MIN/MAX"]
        HAVING["**Having**<br/>Filters groups by<br/>aggregate condition"]
        SORT_OP["**Sort**<br/>Orders by column<br/>ASC or DESC"]
        LIMIT_OP["**Limit**<br/>Truncates to<br/>N rows"]
        PROJ["**Projection**<br/>Selects only<br/>requested columns"]
    end

    TS --> JOIN_OP --> FILTER --> GROUP --> HAVING --> SORT_OP --> LIMIT_OP --> PROJ

Operator details:

Operator	Responsibility	Algorithm
TableScan	Load all rows from a table	Sequential file read
Join	Combine rows from two tables	Nested loop (O(n×m))
Filter	Apply WHERE conditions	Linear scan with expression evaluation
GroupBy	Group rows + compute aggregates	Hash-based grouping, preserves insertion order
Having	Filter groups by aggregate values	Linear scan over groups
Sort	Order rows by column	Go's `sort.SliceStable` (stable sort)
Limit	Truncate result set	Slice truncation
Projection	Select output columns	Column extraction per row

Comparison semantics:

INT columns: numeric comparison (int64)
STRING columns: case-sensitive lexicographic (byte-by-byte) comparison
Cross-type comparisons (INT vs STRING): return a type mismatch error

7. Storage Engine (`pkg/storage`)

The storage engine provides CRUD operations backed by JSON files on the filesystem. Each database is a directory, each table is a JSON file.

flowchart TD
    subgraph "StorageEngine Interface"
        CREATE_DB_S[CreateDatabase]
        CREATE_TBL_S[CreateTable]
        INSERT[InsertRow]
        SCAN[ScanRows]
        DELETE_S[DeleteRows]
        UPDATE_S[UpdateRows]
    end

    subgraph "Filesystem Layout"
        ROOT["data/"]
        DBS["databases/"]
        DB1["mydb/"]
        META["metadata.json"]
        TBL1["users.json"]
        TBL2["orders.json"]

        ROOT --> DBS
        DBS --> DB1
        DB1 --> META
        DB1 --> TBL1
        DB1 --> TBL2
    end

    CREATE_DB_S -->|"mkdir + write metadata.json"| DB1
    CREATE_TBL_S -->|"write {schema, rows:[]}"| TBL1
    INSERT -->|"load, validate, append, persist"| TBL1
    SCAN -->|"load, normalize types, return rows"| TBL1
    DELETE_S -->|"load, filter out matches, persist"| TBL1
    UPDATE_S -->|"load, modify matches, persist"| TBL1

Data integrity features:

Type validation on INSERT (column count + type match)
JSON numbers are normalized from float64 → int64 for INT columns (Go's json.Unmarshal quirk)
Atomic persistence: full file rewrite on each mutation
Error reporting: specific error codes for DB not found, table not found, type mismatch

8. Client REPL (`cmd/toysql`)

The interactive REPL ties all components together and presents results in formatted tables.

flowchart TD
    START[Start Application] --> INIT[Initialize:<br/>StorageEngine → CatalogManager → QueryPlanner → ExecutionEngine]
    INIT --> PROMPT[Display Prompt<br/><code>toysql></code>]
    PROMPT --> READ[Read Input Line]
    READ --> CHECK_EXIT{exit or quit?}
    CHECK_EXIT -->|Yes| DONE[Exit with code 0]
    CHECK_EXIT -->|No| CHECK_EMPTY{Empty/whitespace?}
    CHECK_EMPTY -->|Yes| PROMPT
    CHECK_EMPTY -->|No| PIPELINE[Execute Pipeline:<br/>Tokenize → Parse → Plan → Execute]
    PIPELINE --> CHECK_ERR{Error?}
    CHECK_ERR -->|Yes| DISPLAY_ERR[Display error message]
    CHECK_ERR -->|No| FORMAT{Result type?}
    FORMAT -->|SELECT rows| TABLE[Format as aligned table<br/>with headers + separator]
    FORMAT -->|DDL success| MSG[Display confirmation message]
    FORMAT -->|DML success| COUNT[Display rows affected]
    TABLE --> PROMPT
    MSG --> PROMPT
    COUNT --> PROMPT
    DISPLAY_ERR --> PROMPT

REPL features:

readline-powered input with command history (persisted in .toysql_history)
Ctrl+C interrupts current line (no crash), Ctrl+D exits
Aligned table output with automatic column width calculation
Row count footer for SELECT results (e.g., (3 row(s)))

Supported SQL Syntax

Data Definition Language (DDL)

-- Database management
CREATE DATABASE <name>;
USE <name>;
SHOW DATABASES;

-- Table management
CREATE TABLE <name> (<col1> <type>, <col2> <type>, ...);
SHOW TABLES;
DESCRIBE <table>;

Data Manipulation Language (DML)

-- Insert data
INSERT INTO <table> VALUES (<val1>, <val2>, ...);
INSERT INTO <table> (<col1>, <col2>) VALUES (<val1>, <val2>);

-- Query data
SELECT * FROM <table>;
SELECT <col1>, <col2> FROM <table>;
SELECT <col1> FROM <table> WHERE <condition>;
SELECT <col1> FROM <table> WHERE <cond1> AND <cond2>;
SELECT <col1> FROM <table> ORDER BY <col> [ASC|DESC];
SELECT <col1> FROM <table> LIMIT <n>;

-- Aggregation
SELECT <col>, COUNT(*), SUM(<col>), AVG(<col>), MIN(<col>), MAX(<col>)
  FROM <table> GROUP BY <col>;
SELECT <col>, COUNT(*) AS cnt FROM <table> GROUP BY <col> HAVING cnt > 5;

-- Joins
SELECT t1.name, t2.amount
  FROM orders AS t1
  INNER JOIN payments AS t2 ON t1.id = t2.order_id;

-- Update data
UPDATE <table> SET <col1> = <val1>, <col2> = <val2> WHERE <condition>;

-- Delete data
DELETE FROM <table>;
DELETE FROM <table> WHERE <condition>;

WHERE Clause Operators

Operator	Description	Example
`=`	Equal to	`WHERE age = 25`
`!=`	Not equal to	`WHERE status != 'active'`
`>`	Greater than	`WHERE salary > 50000`
`<`	Less than	`WHERE age < 30`
`>=`	Greater than or equal	`WHERE score >= 90`
`<=`	Less than or equal	`WHERE price <= 100`
`AND`	Logical AND	`WHERE age > 21 AND city = 'NYC'`
`OR`	Logical OR	`WHERE dept = 'Eng' OR dept = 'Sales'`

Data Types

ToySQL supports two column types:

Type	Go Representation	Range	Description
`INT`	`int64`	0 to 9,223,372,036,854,775,807	64-bit signed integer
`STRING`	`string`	up to 65,535 characters	Variable-length text

Comparison rules:

INT columns: numeric ordering
STRING columns: case-sensitive lexicographic (byte-by-byte) ordering, where uppercase letters precede lowercase

Storage Format

Directory Layout

data/
└── databases/
    ├── company/
    │   ├── metadata.json
    │   ├── employees.json
    │   └── departments.json
    └── analytics/
        ├── metadata.json
        └── events.json

Database Metadata (`metadata.json`)

{
  "name": "company",
  "created_at": "2024-06-15T10:30:00Z"
}

Table File (`employees.json`)

{
  "schema": [
    {"name": "id", "type": "INT"},
    {"name": "name", "type": "STRING"},
    {"name": "age", "type": "INT"},
    {"name": "department", "type": "STRING"}
  ],
  "rows": [
    {"id": 1, "name": "Alice", "age": 30, "department": "Engineering"},
    {"id": 2, "name": "Bob", "age": 25, "department": "Marketing"}
  ]
}

Error Handling

ToySQL uses a unified error type (SQLError) across all pipeline stages, providing structured error codes and optional source position information.

flowchart TD
    subgraph "Error Categories"
        direction TB
        LEX_ERR["**Lexer Errors (1xx)**<br/>• Unrecognized character<br/>• Unterminated string<br/>• Integer overflow<br/>• Identifier too long<br/>• String too long"]
        PARSE_ERR["**Parser Errors (2xx)**<br/>• Unexpected token<br/>• Trailing tokens"]
        PLAN_ERR["**Planner Errors (3xx)**<br/>• No database selected<br/>• Database/table not found<br/>• Database/table exists<br/>• Column not found<br/>• Type mismatch<br/>• HAVING without GROUP BY<br/>• LIMIT not positive<br/>• SUM/AVG on STRING"]
        STORE_ERR["**Storage Errors (4xx)**<br/>• Database dir not found<br/>• Table file not found<br/>• Column count mismatch<br/>• Insert type mismatch<br/>• File I/O error"]
    end

    subgraph "Error Structure"
        ERR["SQLError<br/>• Code: ErrorCode<br/>• Message: string<br/>• Pos: *Position (line, col)"]
    end

    LEX_ERR --> ERR
    PARSE_ERR --> ERR
    PLAN_ERR --> ERR
    STORE_ERR --> ERR

Error propagation strategy:

Errors at any pipeline stage immediately halt processing and bubble up to the REPL
The REPL displays the error and continues accepting input (never crashes)
Each statement is atomic: it either fully succeeds or has no effect
Error messages include context (column names, table names, positions) to be human-readable and actionable

Testing Strategy

ToySQL employs a dual testing approach combining deterministic unit tests with property-based tests for comprehensive correctness validation.

Testing Pyramid

flowchart TD
    subgraph "Integration Tests"
        E2E["End-to-end pipeline tests<br/><i>cmd/toysql/integration_test.go</i>"]
    end

    subgraph "Property-Based Tests (rapid)"
        P1["Lexer round-trip"]
        P2["Parser round-trip"]
        P3["Case-insensitive keywords"]
        P4["String literal extraction"]
        P5["Integer literal extraction"]
        P6["Whitespace independence"]
        P7["Identifier classification"]
        P8["Catalog schema round-trip"]
        P9["Storage insert-scan round-trip"]
        P10["Filter correctness"]
        P11["Sort correctness"]
        P12["GroupBy uniqueness"]
        P13["JOIN correctness"]
    end

    subgraph "Unit Tests"
        U1["Each error condition"]
        U2["Each SQL statement type"]
        U3["Each operator behavior"]
        U4["Edge cases & boundary values"]
    end

Property-Based Tests

Property-based tests verify universal correctness properties across randomly generated inputs using the rapid library. Each property test runs a minimum of 100 iterations with generated inputs.

Property	Description	Validates
Lexer round-trip	tokenize → pretty-print → tokenize = same tokens	Req 1.8, 1.9
Parser round-trip	parse → print → parse = same AST	Req 2.13, 2.14
Case-insensitive keywords	Any case permutation produces same token type	Req 1.7
String literal extraction	Token literal = original string without quotes	Req 1.3
Integer literal extraction	Token literal = string representation of integer	Req 1.4
Whitespace independence	Extra whitespace doesn't change token types/literals	Req 1.5
Identifier classification	Non-keyword ident strings produce IDENTIFIER token	Req 1.10, 1.12
Schema round-trip	CreateTable + GetTableSchema returns same schema	Req 3.4, 3.9
List sorting	Database/table lists are alphabetically sorted	Req 3.7, 3.8
Insert-scan round-trip	Inserted rows are all retrievable with same values	Req 4.5, 4.8
Delete correctness	Only matching rows removed, count is correct	Req 4.9
Update correctness	Only matching rows modified, count is correct	Req 4.10
Filter correctness	Correct rows pass each comparison operator	Req 6.1–6.10
AND conjunction	AND returns intersection of individual filters	Req 6.7
ORDER BY	Output is sorted by specified column/direction	Req 7.1, 7.2
LIMIT	Result has at most N rows	Req 7.3
GROUP BY uniqueness	One row per unique value combination	Req 7.4
Aggregate correctness	COUNT/SUM/AVG/MIN/MAX compute correctly	Req 7.5
JOIN correctness	Returns exactly matching row pairs	Req 8.1

Running Tests

# Run all tests
go test ./...

# Run tests with verbose output
go test -v ./...

# Run only property tests
go test -v -run Property ./...

# Run tests for a specific package
go test -v ./pkg/engine/...

Project Structure

toysql/
├── cmd/
│   └── toysql/
│       ├── main.go                    # REPL entry point and pipeline wiring
│       ├── formatter.go               # Result formatting (aligned tables)
│       ├── formatter_test.go          # Unit tests for formatting
│       ├── formatter_property_test.go # Property tests for formatting
│       ├── integration_test.go        # End-to-end pipeline tests
│       └── main_test.go              # REPL behavior tests
├── pkg/
│   ├── ast/
│   │   ├── ast.go                    # All AST node type definitions
│   │   └── ast_test.go              # AST interface tests
│   ├── catalog/
│   │   ├── catalog.go               # Catalog Manager implementation
│   │   ├── catalog_test.go          # Unit tests
│   │   └── catalog_property_test.go # Property tests (schema round-trip, sorting)
│   ├── engine/
│   │   ├── engine.go                # Execution Engine (dispatch + INSERT/SELECT/DELETE/UPDATE)
│   │   ├── engine_test.go           # Unit tests
│   │   ├── engine_property_test.go  # Property tests
│   │   ├── filter.go               # Filter operator (WHERE + HAVING evaluation)
│   │   ├── filter_test.go          # Unit tests for filtering
│   │   ├── filter_property_test.go # Property tests for filter correctness
│   │   ├── groupby.go             # GroupBy operator (aggregation)
│   │   ├── groupby_test.go        # Unit tests for GROUP BY
│   │   ├── groupby_property_test.go # Property tests for grouping
│   │   ├── join.go                # Join operator (nested loop INNER JOIN)
│   │   ├── join_test.go           # Unit tests for JOIN
│   │   ├── join_property_test.go  # Property tests for JOIN
│   │   ├── sort.go               # Sort + Limit operators
│   │   ├── sort_test.go          # Unit tests for ORDER BY/LIMIT
│   │   └── sort_property_test.go # Property tests for sorting
│   ├── errors/
│   │   ├── errors.go             # Unified error types and error codes
│   │   └── errors_test.go       # Error formatting tests
│   ├── lexer/
│   │   ├── lexer.go             # Lexer (tokenizer) implementation
│   │   ├── lexer_test.go        # Unit tests
│   │   ├── lexer_property_test.go # Property tests (round-trip, keywords, etc.)
│   │   ├── printer.go           # Pretty printer (tokens → SQL text)
│   │   └── printer_test.go     # Printer unit tests
│   ├── parser/
│   │   ├── parser.go           # Parser entry point and statement dispatch
│   │   ├── parser_ddl.go      # DDL parsing (CREATE, SHOW, USE, DESCRIBE)
│   │   ├── parser_ddl_test.go # DDL parser tests
│   │   ├── parser_dml.go      # DML parsing (INSERT, SELECT, DELETE, UPDATE)
│   │   ├── parser_dml_test.go # DML parser tests
│   │   ├── parser_test.go     # General parser tests
│   │   ├── parser_unit_test.go # Parser unit test cases
│   │   ├── parser_property_test.go # Property tests (round-trip)
│   │   ├── helpers.go         # Parser helper functions (peek, advance, expect)
│   │   ├── helpers_test.go    # Helper tests
│   │   ├── printer.go        # AST printer (AST → tokens)
│   │   └── printer_test.go   # AST printer tests
│   ├── planner/
│   │   ├── planner.go        # Query Planner interface + plan types
│   │   ├── planner_ddl.go    # DDL planning (validation + plan creation)
│   │   ├── planner_dml.go    # DML planning (validation + plan creation)
│   │   ├── helpers.go        # Planner helper functions
│   │   └── planner_test.go   # Planner tests
│   └── storage/
│       ├── interface.go           # StorageEngine interface definition
│       ├── json_storage.go        # JSON file-based implementation
│       ├── json_storage_test.go   # Unit tests
│       └── json_storage_property_test.go # Property tests (CRUD round-trips)
├── data/
│   └── .gitkeep               # Placeholder for runtime database files
├── go.mod                     # Go module definition
├── go.sum                     # Dependency checksums
└── README.md                  # This file

Design Decisions

Why Go?

Go was chosen for its simplicity, strong typing, excellent standard library, and built-in testing framework. The single-binary compilation makes distribution trivial.

Why JSON for Storage?

JSON files make the storage layer inspectable and debuggable. You can open any table file in a text editor to see exactly what's stored. While not performant for large datasets, it perfectly serves the educational goal of transparency.

Why No External SQL Parser?

Hand-writing the lexer and parser is the core educational value. It demonstrates:

How tokenization works (lexer state machine)
How recursive descent parsing produces ASTs
How grammar rules map to code structure
How error reporting with positions works

Why Property-Based Testing?

Property-based tests catch edge cases that hand-written unit tests miss. They verify invariants like "for ALL valid inputs, X holds" rather than just "for THESE specific inputs, X holds." Key examples:

Round-trip properties ensure lossless transformations
Filter correctness verifies all 6 operators against arbitrary data
Sorting properties verify output ordering for random datasets

Why Nested Loop Join?

The nested loop algorithm is O(n×m) but is the simplest to implement correctly and understand. For educational purposes, clarity trumps performance. A real database would use hash joins or merge joins for large tables.

Why Single-Threaded?

Removing concurrency concerns keeps the codebase focused on database fundamentals. There's no need for locks, transaction isolation, or MVCC — just pure pipeline processing.

Limitations

Limitation	Reason
No indexes	Educational simplicity; all scans are sequential
No transactions / ACID	Single-threaded with atomic file writes is sufficient
No NULL support	Simplifies type system and comparison logic
Two types only (INT, STRING)	Minimal type system covers most concepts
No subqueries	Keeps parser and planner complexity manageable
No concurrent access	Avoids locking/MVCC complexity
Full table rewrite on mutation	Simple persistence; no WAL or page-based storage

Dependencies

Package	Purpose
`pgregory.net/rapid`	Property-based testing framework
`github.com/chzyer/readline`	REPL line editing and history

Both are minimal, well-maintained libraries with no transitive dependency chains.

License

This project is built for educational purposes. Feel free to study, modify, and learn from it.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
.kiro/specs/toy-sql-engine		.kiro/specs/toy-sql-engine
cmd/toysql		cmd/toysql
data		data
pkg		pkg
README.md		README.md
go.mod		go.mod
go.sum		go.sum
toysql		toysql

Folders and files

Latest commit

History

Repository files navigation

ToySQL — A SQL Database Engine from Scratch in Go

Table of Contents

Features

Quick Start

Prerequisites

Build & Run

Example Session

Architecture Overview

Query Processing Pipeline

SELECT Query Execution Flow

INSERT Execution Flow

DDL Execution Flow

Component Deep Dive

1. Lexer (pkg/lexer)

2. Parser (pkg/parser)

3. Abstract Syntax Tree (pkg/ast)

4. Catalog Manager (pkg/catalog)

5. Query Planner (pkg/planner)

6. Execution Engine (pkg/engine)

7. Storage Engine (pkg/storage)

8. Client REPL (cmd/toysql)

Supported SQL Syntax

Data Definition Language (DDL)

Data Manipulation Language (DML)

WHERE Clause Operators

Data Types

Storage Format

Directory Layout

Database Metadata (metadata.json)

Table File (employees.json)

Error Handling

Testing Strategy

Testing Pyramid

Property-Based Tests

Running Tests

Project Structure

Design Decisions

Why Go?

Why JSON for Storage?

Why No External SQL Parser?

Why Property-Based Testing?

Why Nested Loop Join?

Why Single-Threaded?

Limitations

Dependencies

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Lexer (`pkg/lexer`)

2. Parser (`pkg/parser`)

3. Abstract Syntax Tree (`pkg/ast`)

4. Catalog Manager (`pkg/catalog`)

5. Query Planner (`pkg/planner`)

6. Execution Engine (`pkg/engine`)

7. Storage Engine (`pkg/storage`)

8. Client REPL (`cmd/toysql`)

Database Metadata (`metadata.json`)

Table File (`employees.json`)

Packages