A toy SQL database engine built from scratch in Go, born out of curiosity to understand how SQL actually works under the hood. What happens when you type a query? How does a lexer break text into tokens? What does a parser do with them? How does a query planner decide what to execute? This project explores all of that by building each piece from scratch — no external SQL parser libraries, just hand-written Go code from tokenizer to disk. It's not a production database, but it's enough to really learn how the pieces fit together.
- Features
- Quick Start
- Architecture Overview
- Query Processing Pipeline
- Component Deep Dive
- Supported SQL Syntax
- Data Types
- Storage Format
- Error Handling
- Testing Strategy
- Project Structure
- Design Decisions
- Complete SQL Pipeline: Lexer → Parser → AST → Query Planner → Execution Engine → Storage
- DDL Support:
CREATE DATABASE,CREATE TABLE,USE,SHOW DATABASES,SHOW TABLES,DESCRIBE - DML Support:
INSERT,SELECT,UPDATE,DELETE - WHERE Filtering: Six comparison operators (
=,>,<,>=,<=,!=) withAND/ORlogical connectors - ORDER BY: Ascending (default) and descending sorting on any column
- LIMIT: Result set truncation
- GROUP BY: Grouping with aggregate functions (
COUNT,SUM,AVG,MIN,MAX) - HAVING: Post-aggregation filtering on grouped results
- INNER JOIN: Nested loop join with equality condition and table aliases
- Persistent Storage: JSON file-based storage that survives between sessions
- Interactive REPL: readline-powered CLI with history support
- Comprehensive Testing: Unit tests + property-based tests using the
rapidlibrary - Type System:
INT(64-bit signed integer) andSTRING(variable-length text)
- Go 1.23 or later
# Clone and build
go build -o toysql ./cmd/toysql
# Run the interactive REPL
./toysqltoysql> CREATE DATABASE company;
Database 'company' created
toysql> USE company;
Using database 'company'
toysql> CREATE TABLE employees (id INT, name STRING, age INT, department STRING);
Table 'employees' created
toysql> INSERT INTO employees VALUES (1, 'Alice', 30, 'Engineering');
1 row inserted
toysql> INSERT INTO employees VALUES (2, 'Bob', 25, 'Marketing');
1 row inserted
toysql> INSERT INTO employees VALUES (3, 'Charlie', 35, 'Engineering');
1 row inserted
toysql> SELECT name, age FROM employees WHERE department = 'Engineering' ORDER BY age DESC;
name | age
--------+----
Charlie | 35
Alice | 30
(2 row(s))
toysql> SELECT department, COUNT(*) AS cnt, AVG(age) AS avg_age FROM employees GROUP BY department;
department | cnt | avg_age
------------+-----+--------
Engineering | 2 | 32
Marketing | 1 | 25
(2 row(s))
toysql> CREATE TABLE departments (dept_name STRING, budget INT);
Table 'departments' created
toysql> INSERT INTO departments VALUES ('Engineering', 500000);
1 row inserted
toysql> INSERT INTO departments VALUES ('Marketing', 300000);
1 row inserted
toysql> INSERT INTO departments VALUES ('Sales', 200000);
1 row inserted
toysql> SELECT name, dept_name, budget FROM employees INNER JOIN departments AS d ON employees.department = d.dept_name;
name | dept_name | budget
--------+-------------+-------
Alice | Engineering | 500000
Bob | Marketing | 300000
Charlie | Engineering | 500000
(3 row(s))
toysql> exitToySQL follows a layered pipeline architecture where each component transforms its input and passes it to the next stage. This mirrors how production database engines work, making each layer independently testable and understandable.
graph TB
subgraph "User Interface"
REPL[Client REPL<br/><i>cmd/toysql</i>]
end
subgraph "Frontend Processing"
LEX[Lexer<br/><i>pkg/lexer</i>]
PAR[Parser<br/><i>pkg/parser</i>]
end
subgraph "Semantic Layer"
AST[AST<br/><i>pkg/ast</i>]
CAT[Catalog Manager<br/><i>pkg/catalog</i>]
PLN[Query Planner<br/><i>pkg/planner</i>]
end
subgraph "Execution Layer"
ENG[Execution Engine<br/><i>pkg/engine</i>]
direction LR
TS[TableScan]
FIL[Filter]
PROJ[Projection]
SORT[Sort]
LIM[Limit]
GRP[GroupBy]
JOIN[Join]
end
subgraph "Persistence Layer"
STO[Storage Engine<br/><i>pkg/storage</i>]
FS[JSON Files on Disk<br/><i>data/databases/</i>]
end
REPL -->|SQL text| LEX
LEX -->|Token stream| PAR
PAR -->|AST nodes| AST
AST --> PLN
CAT -->|Schema metadata| PLN
PLN -->|Execution plan| ENG
ENG --> TS
ENG --> FIL
ENG --> PROJ
ENG --> SORT
ENG --> LIM
ENG --> GRP
ENG --> JOIN
ENG -->|Read/Write| STO
STO -->|File I/O| FS
ENG -->|Result| REPL
Every SQL statement flows through the same pipeline. Here's the detailed data transformation at each stage:
flowchart LR
A["SQL Text<br/><code>SELECT name FROM users WHERE age > 21</code>"] --> B["Token Stream<br/><code>[SELECT, name, FROM, users, WHERE, age, >, 21]</code>"]
B --> C["AST Node<br/><code>SelectStmt{Columns, Table, Where}</code>"]
C --> D["Execution Plan<br/><code>SelectPlan{validated + schema}</code>"]
D --> E["Result<br/><code>{Columns: [name], Rows: [[Alice], [Bob]]}</code>"]
style A fill:#f9f,stroke:#333
style B fill:#bbf,stroke:#333
style C fill:#bfb,stroke:#333
style D fill:#fbb,stroke:#333
style E fill:#ff9,stroke:#333
For the most complex operation — a SELECT with all clauses — the execution engine applies operators in a specific order:
flowchart TD
A[TableScan] --> B[Join<br/><i>INNER JOIN</i>]
B --> C[Filter<br/><i>WHERE clause</i>]
C --> D[GroupBy<br/><i>GROUP BY + aggregates</i>]
D --> E[Having Filter<br/><i>HAVING clause</i>]
E --> F[Sort<br/><i>ORDER BY</i>]
F --> G[Limit<br/><i>LIMIT N</i>]
G --> H[Projection<br/><i>SELECT columns</i>]
A -.->|"All rows from table"| B
B -.->|"Combined rows from both tables"| C
C -.->|"Rows matching WHERE"| D
D -.->|"Grouped rows with aggregates"| E
E -.->|"Groups matching HAVING"| F
F -.->|"Sorted rows"| G
G -.->|"Truncated rows"| H
H -.->|"Final columns only"| I[Result to REPL]
flowchart LR
A[INSERT AST] --> B[Planner validates:<br/>- Table exists<br/>- Column count matches<br/>- Types match]
B --> C[Engine builds row map]
C --> D[Storage.InsertRow<br/>validates & appends]
D --> E[Persist JSON to disk]
E --> F["Result: 1 row inserted"]
flowchart LR
A[DDL AST<br/><i>CREATE/USE/SHOW/DESCRIBE</i>] --> B[Planner validates:<br/>- DB exists/not exists<br/>- Table exists/not exists]
B --> C{DDL Type}
C -->|CREATE DATABASE| D[Catalog + Storage<br/>create directory + metadata.json]
C -->|CREATE TABLE| E[Catalog + Storage<br/>create table JSON file]
C -->|USE| F[Catalog sets<br/>active database]
C -->|SHOW/DESCRIBE| G[Catalog returns<br/>metadata lists]
D --> H[Confirmation message]
E --> H
F --> H
G --> I[Formatted table output]
The lexer performs lexical analysis, converting raw SQL text into a stream of classified tokens. It operates character-by-character using rune-based scanning for proper Unicode support.
stateDiagram-v2
[*] --> Scanning
Scanning --> Whitespace: space/tab/newline
Whitespace --> Scanning: skip
Scanning --> Identifier: letter or _
Identifier --> Identifier: letter/digit/_
Identifier --> KeywordCheck: non-ident char
KeywordCheck --> EmitKeyword: matches keyword map
KeywordCheck --> EmitIdentifier: no match
Scanning --> IntLiteral: digit
IntLiteral --> IntLiteral: digit
IntLiteral --> EmitInt: non-digit
Scanning --> StringLiteral: single quote
StringLiteral --> StringLiteral: any char except '
StringLiteral --> EmitString: closing quote
StringLiteral --> Error: EOF without closing quote
Scanning --> TwoCharOp: >, <, !
TwoCharOp --> EmitOp: followed by =
TwoCharOp --> EmitSingleOp: not followed by =
Scanning --> EmitPunct: , ; ( ) * .
Scanning --> Error: unrecognized char
Key characteristics:
- Case-insensitive keywords:
SELECT,select,SeLeCtall produceTOKEN_SELECT - Position tracking: Every token carries 1-based line and column numbers
- Two-character operators:
>=,<=,!=are recognized as single tokens - Max limits: Identifiers ≤ 128 chars, strings ≤ 65,535 chars
- Round-trip property: Tokenize → PrettyPrint → Tokenize produces identical token sequences
Token Types:
| Category | Tokens |
|---|---|
| Keywords | SELECT, INSERT, UPDATE, DELETE, CREATE, DROP, USE, SHOW, DESCRIBE, FROM, WHERE, INTO, VALUES, SET, TABLE, DATABASE, DATABASES, TABLES, AND, OR, ORDER, BY, LIMIT, GROUP, HAVING, JOIN, INNER, ON, AS, INT, STRING, ASC, DESC, COUNT, SUM, AVG, MIN, MAX |
| Literals | IDENTIFIER, INT_LITERAL, STRING_LITERAL |
| Operators | =, >, <, >=, <=, != |
| Punctuation | ,, ;, (, ), *, . |
| Special | EOF |
The parser uses recursive descent with one token of lookahead. Each SQL statement type has a dedicated parsing function. The parser validates grammar structure but does not perform semantic validation — that responsibility belongs to the Query Planner.
flowchart TD
ENTRY[Parse Entry Point] --> DISPATCH{Leading Token?}
DISPATCH -->|CREATE| CREATE_BRANCH{Next Token?}
CREATE_BRANCH -->|DATABASE| CREATE_DB[parseCreateDatabase]
CREATE_BRANCH -->|TABLE| CREATE_TBL[parseCreateTable]
DISPATCH -->|USE| USE[parseUse]
DISPATCH -->|SHOW| SHOW_BRANCH{Next Token?}
SHOW_BRANCH -->|DATABASES| SHOW_DB[parseShowDatabases]
SHOW_BRANCH -->|TABLES| SHOW_TBL[parseShowTables]
DISPATCH -->|DESCRIBE| DESCRIBE[parseDescribe]
DISPATCH -->|INSERT| INSERT[parseInsert]
DISPATCH -->|SELECT| SELECT[parseSelect]
DISPATCH -->|DELETE| DELETE[parseDelete]
DISPATCH -->|UPDATE| UPDATE[parseUpdate]
DISPATCH -->|Other| ERROR[ParseError:<br/>unexpected token]
SELECT --> SEL_COLS[parseSelectColumns]
SEL_COLS --> SEL_FROM[expect FROM + table]
SEL_FROM --> SEL_JOIN[optional: parseJoin]
SEL_JOIN --> SEL_WHERE[optional: parseWhere]
SEL_WHERE --> SEL_GROUP[optional: parseGroupBy]
SEL_GROUP --> SEL_HAVING[optional: parseHaving]
SEL_HAVING --> SEL_ORDER[optional: parseOrderBy]
SEL_ORDER --> SEL_LIMIT[optional: parseLimit]
Parsing characteristics:
- Error recovery: No error recovery is attempted — first error halts parsing
- Trailing token detection: After a valid statement + optional semicolon, extra tokens produce an error
- Round-trip property: Parse → ASTPrint → Parse produces structurally identical ASTs
- WHERE parsing: Supports
ANDwith lower precedence than comparison operators
The AST provides a strongly-typed Go representation of parsed SQL statements. Downstream components (planner, engine) pattern-match on these types.
classDiagram
class Statement {
<<interface>>
+statementNode()
}
class Expr {
<<interface>>
+exprNode()
}
class Value {
<<interface>>
+valueNode()
}
Statement <|-- CreateDatabaseStmt
Statement <|-- UseDatabaseStmt
Statement <|-- CreateTableStmt
Statement <|-- ShowDatabasesStmt
Statement <|-- ShowTablesStmt
Statement <|-- DescribeStmt
Statement <|-- InsertStmt
Statement <|-- SelectStmt
Statement <|-- DeleteStmt
Statement <|-- UpdateStmt
Expr <|-- ComparisonExpr
Expr <|-- AndExpr
Expr <|-- OrExpr
Value <|-- IntValue
Value <|-- StringValue
Value <|-- ColumnRef
SelectStmt --> JoinClause
SelectStmt --> OrderByClause
SelectStmt --> SelectColumn
SelectStmt --> Expr
InsertStmt --> Value
UpdateStmt --> Assignment
Assignment --> Value
class SelectStmt {
+Columns []SelectColumn
+Table string
+Join *JoinClause
+Where Expr
+GroupBy []string
+Having Expr
+OrderBy *OrderByClause
+Limit *int
}
class ComparisonExpr {
+Left string
+Operator string
+Right Value
}
class JoinClause {
+Table string
+Alias string
+LeftCol string
+LeftTable string
+RightCol string
+RightTable string
}
The Catalog Manager maintains an in-memory registry of databases and table schemas, backed by the filesystem. It provides the metadata needed by the Query Planner for semantic validation.
flowchart TD
subgraph "Catalog State (In-Memory)"
DB_MAP["databases map<br/>{db_name → {table_name → []ColumnDef}}"]
ACTIVE["activeDB string<br/>(session state)"]
end
subgraph "Operations"
CREATE_DB[CreateDatabase] --> DB_MAP
USE_DB[UseDatabase] --> ACTIVE
CREATE_TBL[CreateTable] --> DB_MAP
GET_SCHEMA[GetTableSchema] --> DB_MAP
LIST_DB[ListDatabases] --> DB_MAP
LIST_TBL[ListTables] --> DB_MAP
end
subgraph "Filesystem (On Startup)"
DISK["data/databases/<br/>├── db1/<br/>│ ├── metadata.json<br/>│ ├── users.json<br/>│ └── orders.json<br/>└── db2/<br/> └── metadata.json"]
end
DISK -->|"loadFromDisk()"| DB_MAP
Key behaviors:
- Loads existing metadata from disk at initialization
- Database and table lists are always returned in alphabetical order
- Validates all preconditions: no DB selected, table already exists, DB not found, etc.
- Active database is session-scoped (in-memory only)
The Query Planner validates AST nodes against catalog metadata and produces typed execution plans. It acts as the semantic analysis layer — catching errors that the parser's grammar checks cannot detect.
flowchart TD
AST[AST Statement] --> VALIDATE{Validation Checks}
VALIDATE --> V1[Table exists?]
VALIDATE --> V2[Columns exist in schema?]
VALIDATE --> V3[Types compatible?]
VALIDATE --> V4[Database selected?]
VALIDATE --> V5[HAVING has GROUP BY?]
VALIDATE --> V6[LIMIT > 0?]
VALIDATE --> V7[No SUM/AVG on STRING?]
V1 -->|Yes| PLAN[Produce Plan Node]
V2 -->|Yes| PLAN
V3 -->|Yes| PLAN
V4 -->|Yes| PLAN
V5 -->|Yes| PLAN
V6 -->|Yes| PLAN
V7 -->|Yes| PLAN
V1 -->|No| ERR1[ErrTableNotFound]
V2 -->|No| ERR2[ErrColumnNotFound]
V3 -->|No| ERR3[ErrTypeMismatch]
V4 -->|No| ERR4[ErrNoDatabaseSelected]
V5 -->|No| ERR5[ErrHavingWithoutGroup]
V6 -->|No| ERR6[ErrLimitNotPositive]
V7 -->|No| ERR7[ErrAggregateOnString]
PLAN --> PLAN_TYPES{Plan Type}
PLAN_TYPES --> P1[CreateDatabasePlan]
PLAN_TYPES --> P2[CreateTablePlan]
PLAN_TYPES --> P3[InsertPlan<br/><i>+schema +columns +values</i>]
PLAN_TYPES --> P4[SelectPlan<br/><i>+schema +joinSchema</i>]
PLAN_TYPES --> P5[DeletePlan<br/><i>+schema +where</i>]
PLAN_TYPES --> P6[UpdatePlan<br/><i>+schema +assignments</i>]
The planner enriches plans with schema information that the execution engine needs for type-safe operations (filtering, sorting, aggregation).
The execution engine takes validated plans and executes them using a set of relational operators. Each operator transforms a row set and passes it downstream.
flowchart LR
subgraph "Operators"
direction TB
TS["**TableScan**<br/>Reads all rows via<br/>Storage.ScanRows"]
JOIN_OP["**Join**<br/>Nested loop join<br/>on equality condition"]
FILTER["**Filter**<br/>Evaluates WHERE conditions<br/>6 operators + AND/OR"]
GROUP["**GroupBy**<br/>Groups rows, computes<br/>COUNT/SUM/AVG/MIN/MAX"]
HAVING["**Having**<br/>Filters groups by<br/>aggregate condition"]
SORT_OP["**Sort**<br/>Orders by column<br/>ASC or DESC"]
LIMIT_OP["**Limit**<br/>Truncates to<br/>N rows"]
PROJ["**Projection**<br/>Selects only<br/>requested columns"]
end
TS --> JOIN_OP --> FILTER --> GROUP --> HAVING --> SORT_OP --> LIMIT_OP --> PROJ
Operator details:
| Operator | Responsibility | Algorithm |
|---|---|---|
| TableScan | Load all rows from a table | Sequential file read |
| Join | Combine rows from two tables | Nested loop (O(n×m)) |
| Filter | Apply WHERE conditions | Linear scan with expression evaluation |
| GroupBy | Group rows + compute aggregates | Hash-based grouping, preserves insertion order |
| Having | Filter groups by aggregate values | Linear scan over groups |
| Sort | Order rows by column | Go's sort.SliceStable (stable sort) |
| Limit | Truncate result set | Slice truncation |
| Projection | Select output columns | Column extraction per row |
Comparison semantics:
INTcolumns: numeric comparison (int64)STRINGcolumns: case-sensitive lexicographic (byte-by-byte) comparison- Cross-type comparisons (INT vs STRING): return a type mismatch error
The storage engine provides CRUD operations backed by JSON files on the filesystem. Each database is a directory, each table is a JSON file.
flowchart TD
subgraph "StorageEngine Interface"
CREATE_DB_S[CreateDatabase]
CREATE_TBL_S[CreateTable]
INSERT[InsertRow]
SCAN[ScanRows]
DELETE_S[DeleteRows]
UPDATE_S[UpdateRows]
end
subgraph "Filesystem Layout"
ROOT["data/"]
DBS["databases/"]
DB1["mydb/"]
META["metadata.json"]
TBL1["users.json"]
TBL2["orders.json"]
ROOT --> DBS
DBS --> DB1
DB1 --> META
DB1 --> TBL1
DB1 --> TBL2
end
CREATE_DB_S -->|"mkdir + write metadata.json"| DB1
CREATE_TBL_S -->|"write {schema, rows:[]}"| TBL1
INSERT -->|"load, validate, append, persist"| TBL1
SCAN -->|"load, normalize types, return rows"| TBL1
DELETE_S -->|"load, filter out matches, persist"| TBL1
UPDATE_S -->|"load, modify matches, persist"| TBL1
Data integrity features:
- Type validation on INSERT (column count + type match)
- JSON numbers are normalized from
float64→int64for INT columns (Go'sjson.Unmarshalquirk) - Atomic persistence: full file rewrite on each mutation
- Error reporting: specific error codes for DB not found, table not found, type mismatch
The interactive REPL ties all components together and presents results in formatted tables.
flowchart TD
START[Start Application] --> INIT[Initialize:<br/>StorageEngine → CatalogManager → QueryPlanner → ExecutionEngine]
INIT --> PROMPT[Display Prompt<br/><code>toysql></code>]
PROMPT --> READ[Read Input Line]
READ --> CHECK_EXIT{exit or quit?}
CHECK_EXIT -->|Yes| DONE[Exit with code 0]
CHECK_EXIT -->|No| CHECK_EMPTY{Empty/whitespace?}
CHECK_EMPTY -->|Yes| PROMPT
CHECK_EMPTY -->|No| PIPELINE[Execute Pipeline:<br/>Tokenize → Parse → Plan → Execute]
PIPELINE --> CHECK_ERR{Error?}
CHECK_ERR -->|Yes| DISPLAY_ERR[Display error message]
CHECK_ERR -->|No| FORMAT{Result type?}
FORMAT -->|SELECT rows| TABLE[Format as aligned table<br/>with headers + separator]
FORMAT -->|DDL success| MSG[Display confirmation message]
FORMAT -->|DML success| COUNT[Display rows affected]
TABLE --> PROMPT
MSG --> PROMPT
COUNT --> PROMPT
DISPLAY_ERR --> PROMPT
REPL features:
- readline-powered input with command history (persisted in
.toysql_history) - Ctrl+C interrupts current line (no crash), Ctrl+D exits
- Aligned table output with automatic column width calculation
- Row count footer for SELECT results (e.g.,
(3 row(s)))
-- Database management
CREATE DATABASE <name>;
USE <name>;
SHOW DATABASES;
-- Table management
CREATE TABLE <name> (<col1> <type>, <col2> <type>, ...);
SHOW TABLES;
DESCRIBE <table>;-- Insert data
INSERT INTO <table> VALUES (<val1>, <val2>, ...);
INSERT INTO <table> (<col1>, <col2>) VALUES (<val1>, <val2>);
-- Query data
SELECT * FROM <table>;
SELECT <col1>, <col2> FROM <table>;
SELECT <col1> FROM <table> WHERE <condition>;
SELECT <col1> FROM <table> WHERE <cond1> AND <cond2>;
SELECT <col1> FROM <table> ORDER BY <col> [ASC|DESC];
SELECT <col1> FROM <table> LIMIT <n>;
-- Aggregation
SELECT <col>, COUNT(*), SUM(<col>), AVG(<col>), MIN(<col>), MAX(<col>)
FROM <table> GROUP BY <col>;
SELECT <col>, COUNT(*) AS cnt FROM <table> GROUP BY <col> HAVING cnt > 5;
-- Joins
SELECT t1.name, t2.amount
FROM orders AS t1
INNER JOIN payments AS t2 ON t1.id = t2.order_id;
-- Update data
UPDATE <table> SET <col1> = <val1>, <col2> = <val2> WHERE <condition>;
-- Delete data
DELETE FROM <table>;
DELETE FROM <table> WHERE <condition>;| Operator | Description | Example |
|---|---|---|
= |
Equal to | WHERE age = 25 |
!= |
Not equal to | WHERE status != 'active' |
> |
Greater than | WHERE salary > 50000 |
< |
Less than | WHERE age < 30 |
>= |
Greater than or equal | WHERE score >= 90 |
<= |
Less than or equal | WHERE price <= 100 |
AND |
Logical AND | WHERE age > 21 AND city = 'NYC' |
OR |
Logical OR | WHERE dept = 'Eng' OR dept = 'Sales' |
ToySQL supports two column types:
| Type | Go Representation | Range | Description |
|---|---|---|---|
INT |
int64 |
0 to 9,223,372,036,854,775,807 | 64-bit signed integer |
STRING |
string |
up to 65,535 characters | Variable-length text |
Comparison rules:
- INT columns: numeric ordering
- STRING columns: case-sensitive lexicographic (byte-by-byte) ordering, where uppercase letters precede lowercase
data/
└── databases/
├── company/
│ ├── metadata.json
│ ├── employees.json
│ └── departments.json
└── analytics/
├── metadata.json
└── events.json
{
"name": "company",
"created_at": "2024-06-15T10:30:00Z"
}{
"schema": [
{"name": "id", "type": "INT"},
{"name": "name", "type": "STRING"},
{"name": "age", "type": "INT"},
{"name": "department", "type": "STRING"}
],
"rows": [
{"id": 1, "name": "Alice", "age": 30, "department": "Engineering"},
{"id": 2, "name": "Bob", "age": 25, "department": "Marketing"}
]
}ToySQL uses a unified error type (SQLError) across all pipeline stages, providing structured error codes and optional source position information.
flowchart TD
subgraph "Error Categories"
direction TB
LEX_ERR["**Lexer Errors (1xx)**<br/>• Unrecognized character<br/>• Unterminated string<br/>• Integer overflow<br/>• Identifier too long<br/>• String too long"]
PARSE_ERR["**Parser Errors (2xx)**<br/>• Unexpected token<br/>• Trailing tokens"]
PLAN_ERR["**Planner Errors (3xx)**<br/>• No database selected<br/>• Database/table not found<br/>• Database/table exists<br/>• Column not found<br/>• Type mismatch<br/>• HAVING without GROUP BY<br/>• LIMIT not positive<br/>• SUM/AVG on STRING"]
STORE_ERR["**Storage Errors (4xx)**<br/>• Database dir not found<br/>• Table file not found<br/>• Column count mismatch<br/>• Insert type mismatch<br/>• File I/O error"]
end
subgraph "Error Structure"
ERR["SQLError<br/>• Code: ErrorCode<br/>• Message: string<br/>• Pos: *Position (line, col)"]
end
LEX_ERR --> ERR
PARSE_ERR --> ERR
PLAN_ERR --> ERR
STORE_ERR --> ERR
Error propagation strategy:
- Errors at any pipeline stage immediately halt processing and bubble up to the REPL
- The REPL displays the error and continues accepting input (never crashes)
- Each statement is atomic: it either fully succeeds or has no effect
- Error messages include context (column names, table names, positions) to be human-readable and actionable
ToySQL employs a dual testing approach combining deterministic unit tests with property-based tests for comprehensive correctness validation.
flowchart TD
subgraph "Integration Tests"
E2E["End-to-end pipeline tests<br/><i>cmd/toysql/integration_test.go</i>"]
end
subgraph "Property-Based Tests (rapid)"
P1["Lexer round-trip"]
P2["Parser round-trip"]
P3["Case-insensitive keywords"]
P4["String literal extraction"]
P5["Integer literal extraction"]
P6["Whitespace independence"]
P7["Identifier classification"]
P8["Catalog schema round-trip"]
P9["Storage insert-scan round-trip"]
P10["Filter correctness"]
P11["Sort correctness"]
P12["GroupBy uniqueness"]
P13["JOIN correctness"]
end
subgraph "Unit Tests"
U1["Each error condition"]
U2["Each SQL statement type"]
U3["Each operator behavior"]
U4["Edge cases & boundary values"]
end
Property-based tests verify universal correctness properties across randomly generated inputs using the rapid library. Each property test runs a minimum of 100 iterations with generated inputs.
| Property | Description | Validates |
|---|---|---|
| Lexer round-trip | tokenize → pretty-print → tokenize = same tokens | Req 1.8, 1.9 |
| Parser round-trip | parse → print → parse = same AST | Req 2.13, 2.14 |
| Case-insensitive keywords | Any case permutation produces same token type | Req 1.7 |
| String literal extraction | Token literal = original string without quotes | Req 1.3 |
| Integer literal extraction | Token literal = string representation of integer | Req 1.4 |
| Whitespace independence | Extra whitespace doesn't change token types/literals | Req 1.5 |
| Identifier classification | Non-keyword ident strings produce IDENTIFIER token | Req 1.10, 1.12 |
| Schema round-trip | CreateTable + GetTableSchema returns same schema | Req 3.4, 3.9 |
| List sorting | Database/table lists are alphabetically sorted | Req 3.7, 3.8 |
| Insert-scan round-trip | Inserted rows are all retrievable with same values | Req 4.5, 4.8 |
| Delete correctness | Only matching rows removed, count is correct | Req 4.9 |
| Update correctness | Only matching rows modified, count is correct | Req 4.10 |
| Filter correctness | Correct rows pass each comparison operator | Req 6.1–6.10 |
| AND conjunction | AND returns intersection of individual filters | Req 6.7 |
| ORDER BY | Output is sorted by specified column/direction | Req 7.1, 7.2 |
| LIMIT | Result has at most N rows | Req 7.3 |
| GROUP BY uniqueness | One row per unique value combination | Req 7.4 |
| Aggregate correctness | COUNT/SUM/AVG/MIN/MAX compute correctly | Req 7.5 |
| JOIN correctness | Returns exactly matching row pairs | Req 8.1 |
# Run all tests
go test ./...
# Run tests with verbose output
go test -v ./...
# Run only property tests
go test -v -run Property ./...
# Run tests for a specific package
go test -v ./pkg/engine/...toysql/
├── cmd/
│ └── toysql/
│ ├── main.go # REPL entry point and pipeline wiring
│ ├── formatter.go # Result formatting (aligned tables)
│ ├── formatter_test.go # Unit tests for formatting
│ ├── formatter_property_test.go # Property tests for formatting
│ ├── integration_test.go # End-to-end pipeline tests
│ └── main_test.go # REPL behavior tests
├── pkg/
│ ├── ast/
│ │ ├── ast.go # All AST node type definitions
│ │ └── ast_test.go # AST interface tests
│ ├── catalog/
│ │ ├── catalog.go # Catalog Manager implementation
│ │ ├── catalog_test.go # Unit tests
│ │ └── catalog_property_test.go # Property tests (schema round-trip, sorting)
│ ├── engine/
│ │ ├── engine.go # Execution Engine (dispatch + INSERT/SELECT/DELETE/UPDATE)
│ │ ├── engine_test.go # Unit tests
│ │ ├── engine_property_test.go # Property tests
│ │ ├── filter.go # Filter operator (WHERE + HAVING evaluation)
│ │ ├── filter_test.go # Unit tests for filtering
│ │ ├── filter_property_test.go # Property tests for filter correctness
│ │ ├── groupby.go # GroupBy operator (aggregation)
│ │ ├── groupby_test.go # Unit tests for GROUP BY
│ │ ├── groupby_property_test.go # Property tests for grouping
│ │ ├── join.go # Join operator (nested loop INNER JOIN)
│ │ ├── join_test.go # Unit tests for JOIN
│ │ ├── join_property_test.go # Property tests for JOIN
│ │ ├── sort.go # Sort + Limit operators
│ │ ├── sort_test.go # Unit tests for ORDER BY/LIMIT
│ │ └── sort_property_test.go # Property tests for sorting
│ ├── errors/
│ │ ├── errors.go # Unified error types and error codes
│ │ └── errors_test.go # Error formatting tests
│ ├── lexer/
│ │ ├── lexer.go # Lexer (tokenizer) implementation
│ │ ├── lexer_test.go # Unit tests
│ │ ├── lexer_property_test.go # Property tests (round-trip, keywords, etc.)
│ │ ├── printer.go # Pretty printer (tokens → SQL text)
│ │ └── printer_test.go # Printer unit tests
│ ├── parser/
│ │ ├── parser.go # Parser entry point and statement dispatch
│ │ ├── parser_ddl.go # DDL parsing (CREATE, SHOW, USE, DESCRIBE)
│ │ ├── parser_ddl_test.go # DDL parser tests
│ │ ├── parser_dml.go # DML parsing (INSERT, SELECT, DELETE, UPDATE)
│ │ ├── parser_dml_test.go # DML parser tests
│ │ ├── parser_test.go # General parser tests
│ │ ├── parser_unit_test.go # Parser unit test cases
│ │ ├── parser_property_test.go # Property tests (round-trip)
│ │ ├── helpers.go # Parser helper functions (peek, advance, expect)
│ │ ├── helpers_test.go # Helper tests
│ │ ├── printer.go # AST printer (AST → tokens)
│ │ └── printer_test.go # AST printer tests
│ ├── planner/
│ │ ├── planner.go # Query Planner interface + plan types
│ │ ├── planner_ddl.go # DDL planning (validation + plan creation)
│ │ ├── planner_dml.go # DML planning (validation + plan creation)
│ │ ├── helpers.go # Planner helper functions
│ │ └── planner_test.go # Planner tests
│ └── storage/
│ ├── interface.go # StorageEngine interface definition
│ ├── json_storage.go # JSON file-based implementation
│ ├── json_storage_test.go # Unit tests
│ └── json_storage_property_test.go # Property tests (CRUD round-trips)
├── data/
│ └── .gitkeep # Placeholder for runtime database files
├── go.mod # Go module definition
├── go.sum # Dependency checksums
└── README.md # This file
Go was chosen for its simplicity, strong typing, excellent standard library, and built-in testing framework. The single-binary compilation makes distribution trivial.
JSON files make the storage layer inspectable and debuggable. You can open any table file in a text editor to see exactly what's stored. While not performant for large datasets, it perfectly serves the educational goal of transparency.
Hand-writing the lexer and parser is the core educational value. It demonstrates:
- How tokenization works (lexer state machine)
- How recursive descent parsing produces ASTs
- How grammar rules map to code structure
- How error reporting with positions works
Property-based tests catch edge cases that hand-written unit tests miss. They verify invariants like "for ALL valid inputs, X holds" rather than just "for THESE specific inputs, X holds." Key examples:
- Round-trip properties ensure lossless transformations
- Filter correctness verifies all 6 operators against arbitrary data
- Sorting properties verify output ordering for random datasets
The nested loop algorithm is O(n×m) but is the simplest to implement correctly and understand. For educational purposes, clarity trumps performance. A real database would use hash joins or merge joins for large tables.
Removing concurrency concerns keeps the codebase focused on database fundamentals. There's no need for locks, transaction isolation, or MVCC — just pure pipeline processing.
| Limitation | Reason |
|---|---|
| No indexes | Educational simplicity; all scans are sequential |
| No transactions / ACID | Single-threaded with atomic file writes is sufficient |
| No NULL support | Simplifies type system and comparison logic |
| Two types only (INT, STRING) | Minimal type system covers most concepts |
| No subqueries | Keeps parser and planner complexity manageable |
| No concurrent access | Avoids locking/MVCC complexity |
| Full table rewrite on mutation | Simple persistence; no WAL or page-based storage |
| Package | Purpose |
|---|---|
pgregory.net/rapid |
Property-based testing framework |
github.com/chzyer/readline |
REPL line editing and history |
Both are minimal, well-maintained libraries with no transitive dependency chains.
This project is built for educational purposes. Feel free to study, modify, and learn from it.