Skip to content

AppThreat/blint-db

Repository files navigation

blint-db

blint-db v2 is a family of pre-compiled SQLite databases generated by building open-source projects and ingesting raw blint 3 metadata for each produced binary.

Unlike the earlier symbol-only database, v2 stores:

  • project and build provenance
  • per-binary metadata summaries
  • normalized symbols from multiple blint buckets
  • dependency evidence
  • optional disassembly fingerprints (assembly_hash, instruction_hash, function metrics)
  • compact callgraph metadata

The goal is to make blint-db a stronger package-identification corpus for blint sbom, especially for native libraries where exact package metadata is absent.

blint-db is published via GitHub Packages and Hugging Face datasets.

Each generated database is compacted before the run completes. The build pipeline checkpoints and truncates WAL state, runs VACUUM, and records SQLite page and freelist statistics in the provenance sidecar.

Why v2

blint 3 exposes much richer metadata than the old deep SBOM properties used by the original blint-db pipeline. v2 therefore ingests the primary metadata contract directly instead of flattening everything through CycloneDX properties.

That means the database can capture:

  • functions, symtab_symbols, dynamic_symbols, imports, and exports
  • build_info and security_properties
  • llvm_target_tuple, binary_type, and machine/format hints
  • disassembled_functions hashes and behavior metrics when --disassemble is enabled
  • top-level callgraph metadata without storing full assembly text

Main use cases

Use blint-db v2 to:

  • improve component identification for binaries analyzed by blint
  • build architecture-aware symbol and function-hash corpora
  • compare stripped/unstripped or versioned binary families
  • support ML or heuristic matching experiments on binary metadata

Database schema

The schema is versioned through SchemaMeta and currently targets schema version 2.

SchemaMeta

Tracks schema/version metadata.

Projects

Stores the logical source project.

Key fields:

  • name
  • purl
  • ecosystem
  • metadata_json
  • source_sbom_json

Builds

Stores one concrete build context for a project.

Key fields:

  • build_system
  • target_os
  • target_arch
  • target_triplet
  • llvm_target_tuple
  • build_mode
  • optimization
  • is_stripped
  • metadata_json

Binaries

Stores one binary artifact produced by a build.

Key fields:

  • file_path
  • relative_path
  • binary_type
  • exe_type
  • machine_type
  • llvm_target_tuple
  • file hashes (sha256, sha1, md5)
  • build_info_json
  • security_properties_json
  • callgraph_json
  • sanitized metadata_json

The stored metadata intentionally omits the heavy raw disassembled_functions payload. Hashes and metrics are normalized into FunctionFingerprints instead.

Symbols

Normalized symbol inventory across multiple blint metadata buckets.

Each row tracks:

  • name
  • source (functions, imports, symtab_symbols, dynamic_symbols, etc.)
  • optional address and size
  • is_imported
  • is_exported
  • is_function
  • is_variable
  • metadata_json

Dependencies

Normalized dependency evidence from:

  • dynamic_entries
  • libraries
  • import_dependencies
  • go_dependencies
  • rust_dependencies
  • dotnet_dependencies

FunctionFingerprints

Stores normalized disassembly-derived function fingerprints when available.

Key fields:

  • function_key
  • name
  • address
  • rva_or_address
  • assembly_hash
  • instruction_hash
  • instruction_count
  • function_type
  • boolean feature flags (has_indirect_call, has_pac, etc.)
  • instruction_metrics_json
  • register and call-target summaries

SourceGraphs

Registers a source-level callgraph produced by a source analyzer (for example rusi for Rust). Identified by a stable source_key so re-ingesting updates the existing row.

Key fields:

  • source_key
  • project_id (optional link to Projects)
  • name
  • purl
  • tool and tool_schema_version
  • node_count and edge_count

CallGraphNodes

Stores callgraph nodes for both source and binary graphs, discriminated by graph_kind (source or binary). owner_id is the source_graph_id for source graphs and the binary_id for binary graphs. The canon_name column holds the canonical, generic-free, hash-free function name that both sides join on, computed by blint's canonicalization.

Key fields:

  • graph_kind, owner_id, node_ref
  • canon_name, raw_name, address, kind, is_local
  • features_json

CallGraphEdges

Stores callgraph edges for both source and binary graphs (same graph_kind and owner_id discriminators), with src_ref, dst_ref, edge_type, and confidence.

These three tables let an unknown binary be matched against many stored source graphs at once. See the callgraph matching section below.

Storage and compaction policy

blint-db keeps write-time performance reasonable while still producing compact final artifacts.

Current SQLite tuning includes:

  • page_size=4096
  • auto_vacuum=INCREMENTAL
  • journal_mode=WAL
  • journal_size_limit=1048576
  • temp_store=MEMORY
  • secure_delete=OFF

At the end of each ingest or ecosystem build command, blint-db performs:

  • PRAGMA wal_checkpoint(TRUNCATE)
  • PRAGMA optimize
  • VACUUM
  • PRAGMA incremental_vacuum

This keeps the shipped database file small without forcing slow full compaction after every individual binary insert.

CLI overview

The CLI has two primary modes:

1. Ingest a single binary or metadata file

blint-db --db-file blint-v2.db ingest \
  --project-name demo \
  --project-purl pkg:generic/demo@1.0.0 \
  --ecosystem manual \
  --build-system manual \
  -i /path/to/libdemo.so

Ingest an already-generated metadata JSON file instead:

blint-db --db-file blint-v2.db ingest \
  --project-name demo \
  --metadata-file /path/to/libdemo-metadata.json

Enable disassembly-aware ingestion for a live binary analysis run:

blint-db --db-file blint-v2.db --disassemble ingest \
  --project-name demo \
  -i /path/to/libdemo.so

2. Build corpus databases from package ecosystems

Meson / wrapdb:

blint-db --db-file blint-v2.db --clean-start --disassemble build-meson

For a smaller real corpus that is useful for local SBOM end-to-end checks:

export BLINT_DB_MESON_STRIP=0
blint-db --db-file ./temp/meson-small.db --clean-start --disassemble build-meson -s zlib bzip2

vcpkg:

blint-db --db-file blint-v2.db --clean-start --disassemble build-vcpkg

Homebrew (macOS):

blint-db --db-file blint-v2.db --clean-start --disassemble build-homebrew

Cargo / crates.io:

blint-db --db-file blint-v2.db --clean-start build-cargo

Conan Center / C and C++:

blint-db --db-file blint-v2.db --clean-start build-conan

The default Cargo corpus is a curated, version-pinned manifest in blint_db/inputs/cargo-crates.csv. Rows can repeat the same crate@version under different named feature_profile values when you want to capture multiple native build shapes for one release. Use -s/--select-project after the build-cargo subcommand when you want to build exact crates manually:

blint-db --db-file blint-v2.db --clean-start build-cargo -s choose@1.3.7 hexyl@0.17.0

For a small curated Homebrew subset spanning C, C++, Rust, Go, and Swift formulas:

blint-db --db-file blint-v2.db --clean-start --disassemble -f build-homebrew

For the small curated Cargo subset used by the smoke-test workflow:

blint-db --db-file blint-v2.db --clean-start -f build-cargo

The Conan Center corpus is also curated and package-oriented because Conan Center is best consumed here through a pinned manifest rather than remote-wide enumeration. The packaged manifest lives in blint_db/inputs/conan-center-packages.csv and supports repeated reference rows under named configuration selectors. The default matrix explicitly carries both shared-release and static-debug variants for the top native packages.

Build explicit Conan package variants:

blint-db --db-file blint-v2.db --clean-start build-conan -s fmt/11.2.0#shared-release zlib/1.3.1#static-debug

For the smaller curated Conan subset used by smoke workflows:

blint-db --db-file blint-v2.db --clean-start -f build-conan

Each ecosystem build also emits a provenance sidecar JSON next to the database by default (for example blint.metadata.json when the database is blint.db). That sidecar includes final table counts and SQLite size statistics after compaction. Its projects block also records selected_count, attempted_count, success_count, failure_count, status_counts, and build_failures. Each projects.build_failures[] entry is a flattened per-project failure record derived from projects.outcomes[].failure, with stable keys such as selector, project_name, ecosystem, build_system, status, stage, and message, plus optional fields like returncode and exception_type when available.

Legacy aliases are still accepted for workflow continuity:

blint-db --clean-start -Z1 --disassemble
blint-db --clean-start -Z2 --disassemble

3. Build a binary-to-source callgraph corpus and match an unknown binary

blint-db can store both the binary callgraph that blint recovers and the source callgraph that a source analyzer produces, then identify an unknown binary by how many canonical function names it shares with each stored source graph. This is the corpus-scale version of the single-pair blint callgraph-match command. The current source analyzer is rusi, for Rust.

Generate a list of the most downloaded crates from crates.io and persist it as a curated-schema CSV (defaults to blint_db/inputs/cargo-top-crates.csv):

blint-db gen-cargo-top-crates --count 100 --output blint_db/inputs/cargo-top-crates.csv

Build a corpus from those crates with both the binary callgraph (requires --disassemble) and the source callgraph. The rusi command is supplied with --rusi-cmd, or through the BLINT_DB_RUSI_CMD (or RUSI_CMD) environment variable. The base command is whatever runs rusi in your environment, for example a path to a built rusi binary or cargo run -p rusi-cli -- from a rusi checkout:

export BLINT_DB_CARGO_CRATES_FILE=./blint_db/inputs/cargo-top-crates.csv
blint-db --db-file blint-v2.db --clean-start build-cargo \
  --disassemble \
  --with-source-callgraph \
  --rusi-cmd "/path/to/rusi"

--with-source-callgraph runs rusi over each crate's extracted source and ingests the source callgraph linked to the same project and package URL as the binary. It still works for library-only crates that produce no binary, since the source graph is independent of the build artifacts.

Identify an unknown binary against the corpus. Pass a binary to disassemble, or a pre-generated blint metadata JSON:

blint-db --db-file blint-v2.db match-callgraph --input ./some-binary --limit 10
blint-db --db-file blint-v2.db match-callgraph --metadata-file ./some-binary-metadata.json

The command prints the source graphs ranked by the number of shared canonical function names, for example:

Binary functions: 22173 (named: 22173). Top source matches:
  pkg:cargo/wasm-tools@1.247.0  shared_functions=2682 source_functions=15079 tool=rusi

Matching by name is reliable for unstripped binaries. For stripped binaries the single-pair blint callgraph-match command can additionally recover some functions by call structure. See the docs/CALLGRAPH_MATCH.md document in the blint repository for the algorithm, configuration, and limitations.

Python usage

The low-level ingestion helpers live in blint_db.ingest. Lookup helpers live in blint_db.handlers.sqlite_handler.

Example:

from blint_db.ingest import ingest_binary_file

ingest_binary_file(
    "/path/to/libdemo.so",
    db_file="blint-v2.db",
    project_name="demo",
    project_purl="pkg:generic/demo@1.0.0",
    ecosystem="manual",
    build_system="manual",
    disassemble=True,
)

Project-level lookup example:

from blint_db.handlers.sqlite_handler import (
    lookup_project_function_hash_matches,
    lookup_project_symbol_matches,
)

lookup_project_symbol_matches(
    ["SSL_read", "SSL_write", "EVP_EncryptInit_ex"],
    db_file="blint-v2.db",
)

lookup_project_function_hash_matches(
    instruction_hashes=["0123456789abcdef..."],
    db_file="blint-v2.db",
)

These project-level helpers aggregate matches across all binaries belonging to the same stored project and return project_id, project_name, and project_purl along with match counts.

Callgraph corpus example. Register a source callgraph and identify a binary against the corpus by shared canonical function names:

import json
from blint_db.handlers.blint_handler import collect_blint_metadata
from blint_db.handlers.callgraph_handler import extract_binary_callgraph
from blint_db.handlers.sqlite_handler import match_canon_names_against_source_corpus
from blint_db.ingest import ingest_source_callgraph

ingest_source_callgraph(
    source_callgraph=json.load(open("callgraph.json")),
    source_key="wasm-tools@1.247.0",
    db_file="blint-v2.db",
    name="wasm-tools",
    purl="pkg:cargo/wasm-tools@1.247.0",
    tool="rusi",
)

metadata = collect_blint_metadata("/path/to/binary", disassemble=True)
canon_names = [n["canon_name"] for n in extract_binary_callgraph(metadata)["nodes"]]
matches = match_canon_names_against_source_corpus(canon_names, db_file="blint-v2.db")

match_binary_against_source_corpus(binary_id, ...) is the equivalent for a binary already ingested into the database.

Build provenance example:

from blint_db.utils.provenance import write_run_metadata

write_run_metadata(
    command="build-meson",
    db_file="blint.db",
    disassemble=True,
    selected_projects=["zlib", "bzip2"],
)

Disassembly requirements

Disassembly fingerprints depend on nyxstone in addition to blint itself.

In addition to Python dependencies, runners need an LLVM 18 toolchain available. The repository workflows install LLVM explicitly before running disassembly-enabled database builds.

Local validation

Known from the repository configuration:

cd /path/to/blint-db
python -m pytest -q

If you want to install the nyxstone-enabled dependency set locally, use the extended extra exposed by blint-db.

For the normal blint-db workflow you do not need a separate blint checkout: uv will install the pinned blint source declared in blint-db.

On macOS with Homebrew LLVM 18, nyxstone 0.1.1 also needs explicit compiler/linker flags in addition to NYXSTONE_LLVM_PREFIX:

cd /path/to/blint-db
export PATH="/opt/homebrew/opt/llvm@18/bin:/opt/homebrew/bin:$PATH"
export NYXSTONE_LLVM_PREFIX=/opt/homebrew/opt/llvm@18
export CC=/opt/homebrew/opt/llvm@18/bin/clang
export CXX=/opt/homebrew/opt/llvm@18/bin/clang++
export CXXFLAGS='-std=c++17'
export LDFLAGS='-L/opt/homebrew/lib -Wl,-rpath,/opt/homebrew/lib'
uv sync --upgrade --all-extras --all-groups --all-packages -p 3.13

Homebrew corpus runs use brew directly and can optionally force source builds or reinstalls:

export BLINT_DB_HOMEBREW_BUILD_FROM_SOURCE=1
export BLINT_DB_HOMEBREW_REINSTALL_EXISTING=1
uv run blint-db --clean-start --db-file ./blint.db build-homebrew -s fmt ripgrep xcbeautify

Cargo corpus runs fetch exact crates from crates.io, verify the published crate checksum, and build them in isolated Cargo home/target directories under BLINT_DB_BOOTSTRAP_PATH:

export BLINT_DB_CARGO_CRATES_FILE=./blint_db/inputs/cargo-crates.csv
export BLINT_DB_CARGO_PROFILE=release
uv run blint-db --clean-start --db-file ./blint.db build-cargo -s choose@1.3.7 b3sum@1.8.5

If you want to relocate all curated manifests together, set BLINT_DB_INPUTS_DIR to a directory that contains cargo-crates.csv and/or homebrew-formulas.csv. That same shared override also applies to conan-center-packages.csv.

Conan corpus runs create an isolated CONAN_HOME, resolve the selected package graph, deploy package artifacts into a build-local folder, and then ingest the resulting binaries into blint-db:

export BLINT_DB_CONAN_PACKAGES_FILE=./blint_db/inputs/conan-center-packages.csv
export BLINT_DB_CONAN_REMOTE=conancenter
export BLINT_DB_CONAN_BUILD_TYPE=Release
uv run blint-db --clean-start --db-file ./blint.db build-conan -s fmt/11.2.0#shared-release libcurl/8.14.1#shared-release

Regenerating curated manifests

blint-db includes helper scripts to create curated input CSVs with configurable limits.

Homebrew formulas from formulae.brew.sh analytics:

cd /path/to/blint-db
python scripts/generate_homebrew_manifest.py --limit 25 --output ./blint_db/inputs/homebrew-formulas.csv

Cargo crates from crates.io download ranking and optional search/category filters:

cd /path/to/blint-db
python scripts/generate_cargo_manifest.py --limit 25 --category command-line-utilities --include-dev-profile --output ./blint_db/inputs/cargo-crates.csv

Conan Center references from the built-in ranked seed list, or refreshed through conan search when available:

cd /path/to/blint-db
python scripts/generate_conan_manifest.py --limit 20 --resolve-with-conan --output ./blint_db/inputs/conan-center-packages.csv

Notes on data sources:

  • Homebrew generation uses formulae.brew.sh analytics and per-formula JSON metadata.
  • Cargo generation uses crates.io ranking/search APIs.
  • Conan generation uses a deterministic ranked seed list by default, with optional conan search resolution because Conan Center does not expose a simple public “top packages” index comparable to Homebrew analytics or crates.io downloads.

Useful Cargo-specific overrides:

  • BLINT_DB_CARGO_EXECUTABLE
  • BLINT_DB_CARGO_CRATES_FILE
  • BLINT_DB_CARGO_PROFILE
  • BLINT_DB_CARGO_TARGET
  • BLINT_DB_CARGO_FETCH_ARGS
  • BLINT_DB_CARGO_BUILD_ARGS

Useful Conan-specific overrides:

  • BLINT_DB_CONAN_EXECUTABLE
  • BLINT_DB_CONAN_PACKAGES_FILE
  • BLINT_DB_CONAN_REMOTE
  • BLINT_DB_CONAN_REMOTE_URL
  • BLINT_DB_CONAN_BUILD_TYPE
  • BLINT_DB_CONAN_HOST_PROFILE
  • BLINT_DB_CONAN_BUILD_PROFILE
  • BLINT_DB_CONAN_DEPLOYER
  • BLINT_DB_CONAN_GRAPH_ARGS
  • BLINT_DB_CONAN_INSTALL_ARGS

The Cargo curated CSV accepts these columns:

  • crate
  • version
  • feature_profile (optional selector suffix used as crate@version#feature_profile)
  • profile
  • default_features
  • features
  • bins
  • package
  • target

The Conan curated CSV accepts these columns:

  • reference
  • configuration (optional selector suffix used as name/version#configuration)
  • settings
  • options
  • conf
  • package_type
  • shared
  • build_type
  • target_os
  • target_arch
  • artifact_roots
  • notes

The curated Homebrew smoke subset is defined in blint_db/inputs/homebrew-formulas.csv. The curated Conan Center smoke/full subset is defined in blint_db/inputs/conan-center-packages.csv.

If you explicitly want to validate blint-db against a checked-out local blint workspace instead of the pinned source, install blint-db first and then overlay the local repository editably:

cd /path/to/blint-db
uv sync --all-extras --dev
uv pip install --python .venv/bin/python --editable /path/to/blint

Real end-to-end validation with blint

A reproducible local check is available directly in the blint repo:

cd /path/to/blint
python tests/scripts/validate_blintdb_small_corpus.py --ecosystems meson
python tests/scripts/validate_blintdb_small_corpus.py --ecosystems vcpkg
python tests/scripts/validate_blintdb_small_corpus.py --ecosystems homebrew

The script reads tests/data/blintdb-small-corpus.json, builds the requested databases, retains the relevant artifacts, runs blint sbom in normal and deep modes, and writes a JSON summary under .tmp-blintdb-small-corpus/.

If you want to run the Meson flow manually, the equivalent steps are:

cd /path/to/blint-db
export BLINT_DB_MESON_STRIP=0
python -m blint_db.cli --clean-start --db-file ./temp/meson-small.db --disassemble build-meson -s zlib bzip2

Build artifacts are removed by the CLI after ingestion by default. If you want to keep the Meson/vcpkg build outputs for analyst-side validation or debugging, pass --retain-build-artifacts; otherwise, rebuild the selected projects in-place when you want binaries for analyst-side validation:

cd /path/to/blint-db
python - <<'PY'
from blint_db.handlers.language_handlers.meson_handler import find_meson_executables, meson_build

for project_name in ("zlib", "bzip2"):
    meson_build(project_name)
    for binary in find_meson_executables(project_name):
        print(binary)
PY

Then point blint at the generated database:

cd /path/to/blint
mkdir -p ./.tmp-meson-db
cp /path/to/blint-db/temp/meson-small.db ./.tmp-meson-db/blint.db
export BLINTDB_HOME=$PWD/.tmp-meson-db
export USE_BLINTDB=true

poetry run blint sbom -i /path/to/libz.1.dylib --use-blintdb --stdout
poetry run blint sbom -i /path/to/libz.1.dylib --use-blintdb --deep --stdout

For binaries that were built into the corpus itself, expect near-exact package identification in both modes. In deep mode, the matched component should also carry non-zero internal:blintdb_matched_instruction_hash_count or internal:blintdb_matched_assembly_hash_count evidence.

Workflow notes

The GitHub workflows under .github/workflows/ are designed for long-running dataset builds and install LLVM before running uv sync so nyxstone can be built safely for --disassemble on Linux and macOS runners.

Each workflow-generated ecosystem database is accompanied by a metadata sidecar JSON describing the build context, installed tool versions, selected projects, repository commits, and final table counts. ORAS publication includes both the SQLite database and the metadata JSON, and the Linux Hugging Face uploads publish the same pair of files.

Funding

This project is funded through NGI Zero Core, a fund established by NLnet with financial support from the European Commission's Next Generation Internet program. Learn more at the NLnet project page.

NLnet foundation logo
NGI Zero Logo

Citation

@misc{blint-db,
  author = {Team AppThreat},
  month = Mar,
  title = {{AppThreat blint-db}},
  howpublished = {{https://huggingface.co/datasets/AppThreat/blint-db}},
  year = {2026}
}

About

Binary symbols database for OWASP blint

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages