Skip to content

S0UGATA/security-kg

Repository files navigation

security-kg

CI Dataset Update HuggingFace Python 3.11+ License Visualizer

Convert security data from 16 sources into Subject-Predicate-Object (SPO) knowledge-graph triples in Parquet format.

Sources: ATT&CK · CAPEC · CWE · CVE · CPE · D3FEND · ATLAS · CAR · ENGAGE · EPSS · KEV · Vulnrichment · GHSA · Sigma · ExploitDB · MISP Galaxies

Data Flow

---
config:
  layout: dagre
  theme: neo
---
flowchart LR
    STIX["ATT&CK STIX JSON"]:::src --> CONV["convert.py"]:::conv
    CXML["CAPEC XML"]:::src --> CONV
    WXML["CWE XML"]:::src --> CONV
    CVEJ["CVE JSON 5.x"]:::src --> CONV
    CPEJ["CPE JSON"]:::src --> CONV
    D3FJ["D3FEND JSON-LD"]:::src --> CONV
    ATLY["ATLAS YAML"]:::src --> CONV
    CARY["CAR YAML"]:::src --> CONV
    ENGJ["ENGAGE JSON"]:::src --> CONV
    EPSC["EPSS CSV"]:::src --> CONV
    KEVJ["KEV JSON"]:::src --> CONV
    VULJ["Vulnrichment JSON"]:::src --> CONV
    GHSJ["GHSA JSON"]:::src --> CONV
    SIGY["Sigma YAML"]:::src --> CONV
    EDBC["ExploitDB CSV"]:::src --> CONV
    MSPJ["MISP Galaxy JSON"]:::src --> CONV

    CONV --> ATK["enterprise / mobile / ics / attack-all"]:::out --> CMB["combined.parquet"]:::conv
    CONV --> CAP["capec"]:::out --> CMB
    CONV --> CW["cwe"]:::out --> CMB
    CONV --> CVE["cve"]:::out --> CMB
    CONV --> CPE["cpe"]:::out --> CMB
    CONV --> D3F["d3fend"]:::out --> CMB
    CONV --> ATL["atlas"]:::out --> CMB
    CONV --> CAR["car"]:::out --> CMB
    CONV --> ENG["engage"]:::out --> CMB
    CONV --> EPS["epss"]:::out --> CMB
    CONV --> KEV["kev"]:::out --> CMB
    CONV --> VUL["vulnrichment"]:::out --> CMB
    CONV --> GHS["ghsa"]:::out --> CMB
    CONV --> SIG["sigma"]:::out --> CMB
    CONV --> EDB["exploitdb"]:::out --> CMB
    CONV --> MSG["misp_galaxy"]:::out --> CMB

    CMB --> HF["HuggingFace Hub"]:::hf

    classDef src fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef conv fill:#f3f4f6,stroke:#6b7280,color:#374151
    classDef out fill:#fef3c7,stroke:#f59e0b,color:#78350f
    classDef hf fill:#d1fae5,stroke:#10b981,color:#064e3b
Loading

Knowledge Graph Structure

---
config:
  layout: dagre
  theme: neo
---
graph LR
    %% ATT&CK core
    C[Campaign]:::attack -->|attributed-to| G[Group]:::attack
    C -->|uses| T[Technique]:::attack
    G -->|uses| T
    G -->|uses| SW[Malware / Tool]:::attack
    SW -->|uses| T
    ST[Sub-technique]:::attack -->|subtechnique-of| T
    T -->|belongs-to-tactic| TAC[Tactic]:::attack
    MIT[Mitigation]:::attack -->|mitigates| T
    DC[DataComponent]:::attack -->|detects| T

    %% Defense & detection → Technique
    DT[DefensiveTechnique]:::d3fend -->|counters| T
    AN[Analytic]:::car -->|detects-technique| T
    AN -->|maps-to-d3fend| DT
    EA[EngagementActivity]:::engage -->|engages-technique| T
    AT[ATLAS Technique]:::atlas -->|related-attack-technique| T

    %% MISP Galaxy → ATT&CK + threat context
    TA[ThreatActor]:::misp -->|related-attack-id| T
    TA -->|targets-country| CTR[Country]:::misp
    TA -->|targets-sector| SEC[Sector]:::misp

    %% CAPEC ↔ CWE bridge
    AP[Attack Pattern]:::capec -->|maps-to-technique| T
    AP -->|related-weakness| W[Weakness]:::cwe
    W -->|related-attack-pattern| AP

    %% Vulnerability chain
    V[Vulnerability]:::cve -->|related-weakness| W
    V -->|affects-cpe| P[Platform]:::cpe
    V -.->|epss-score| ES((EPSS)):::epss
    V -.->|kev| KE((KEV)):::kev

    classDef attack fill:#dbeafe,stroke:#3b82f6,color:#1e3a5f
    classDef capec fill:#fef3c7,stroke:#f59e0b,color:#78350f
    classDef cwe fill:#fce7f3,stroke:#ec4899,color:#831843
    classDef cve fill:#fee2e2,stroke:#ef4444,color:#7f1d1d
    classDef cpe fill:#e0e7ff,stroke:#6366f1,color:#312e81
    classDef d3fend fill:#d1fae5,stroke:#10b981,color:#064e3b
    classDef car fill:#fef9c3,stroke:#eab308,color:#713f12
    classDef engage fill:#ede9fe,stroke:#8b5cf6,color:#4c1d95
    classDef atlas fill:#cffafe,stroke:#06b6d4,color:#164e63
    classDef epss fill:#f3f4f6,stroke:#6b7280,color:#374151
    classDef kev fill:#f3f4f6,stroke:#6b7280,color:#374151
    classDef misp fill:#fdf2f8,stroke:#db2777,color:#831843
Loading

Legend: Blue = ATT&CK · Amber = CAPEC · Pink = CWE · Red = CVE · Indigo = CPE · Green = D3FEND · Cyan = ATLAS · Yellow = CAR · Violet = ENGAGE · Fuchsia = MISP Galaxies

Usage

# Install dependencies
pip install -r requirements.txt

# Convert everything (all 15 sources) and produce combined.parquet
python src/convert.py

# Convert only ATT&CK
python src/convert.py --sources attack

# Convert a single ATT&CK domain
python src/convert.py --sources attack --domains enterprise

# Convert only CAPEC and CWE (skip others)
python src/convert.py --sources capec cwe

# Convert CVE, EPSS, and KEV together
python src/convert.py --sources cve epss kev

# Skip combined.parquet generation
python src/convert.py --no-combined

# Run individual converters standalone
python src/convert_attack.py
python src/convert_capec.py
python src/convert_cve.py
python src/convert_kev.py

# Use Parquet v1 format for backward compatibility (default is v2)
python src/convert.py --parquet-format v1

Source files are cached in source/ by default. Files are versioned using Last-Modified or ETag headers and only re-downloaded when the source has been updated. Sources that don't provide version headers are always re-downloaded.

Output goes to output/:

File Source Est. Triples
enterprise.parquet ATT&CK Enterprise ~40-50K
mobile.parquet ATT&CK Mobile ~5-7K
ics.parquet ATT&CK ICS ~4-5K
attack-all.parquet ATT&CK combined (deduplicated) ~50-60K
capec.parquet CAPEC attack patterns ~8-10K
cwe.parquet CWE weaknesses ~14-16K
cve.parquet CVE vulnerabilities ~3-4M
cpe.parquet CPE platform enumeration ~10-15M
d3fend.parquet D3FEND defensive techniques ~8-10K
atlas.parquet ATLAS AI/ML techniques ~1-2K
car.parquet CAR analytics ~1-2K
engage.parquet ENGAGE adversary engagement ~1-2K
epss.parquet EPSS exploit prediction scores ~600-700K
kev.parquet KEV known exploited vulns ~15-20K
vulnrichment.parquet CISA Vulnrichment (SSVC, CVSS, CWE) ~500K-1M
ghsa.parquet GitHub Security Advisories ~300-400K
sigma.parquet Sigma detection rules ~30-40K
exploitdb.parquet ExploitDB public exploits ~300-400K
misp_galaxy.parquet MISP Galaxy clusters ~100-200K
combined.parquet All sources merged (deduplicated) ~15-20M

Cross-Source Links

ATT&CK <──> CAPEC <──> CWE <──> CVE <──> CPE
  ^                              ^
  ├── D3FEND (counters)          ├── EPSS (scores)
  ├── ATLAS (AI parallel)        ├── KEV (exploited)
  ├── CAR (detects)              ├── Vulnrichment (SSVC/CVSS)
  ├── ENGAGE (engages)           ├── GHSA (advisories)
  ├── Sigma (detects)            ├── Sigma (related CVE)
  └── MISP Galaxies (cross-refs) └── ExploitDB (exploits)

Tests

# Unit tests (no network access required)
python -m pytest tests/ -v --ignore=tests/test_integration.py

# Integration tests (downloads real ATT&CK data)
python -m pytest tests/test_integration.py -v

# All tests
python -m pytest tests/ -v

Visualizer

Explore the Parquet files interactively at security-kg-viz.

Cross-Source Analysis Notebook

The cross-source visualizations notebook demonstrates 16 analyses that are only possible because all 15 sources are merged into a single graph — including SSVC patch prioritization, defensive gap analysis, kill chain coverage, exploit weaponization timelines, supply chain risk scoring, and more.

pip install -e ".[viz]"
jupyter notebook tests/cross_source_visualizations.ipynb

HuggingFace Dataset

The dataset is published at s0u9ata/security-kg on HuggingFace Hub and auto-updated weekly via GitHub Actions.

See the dataset card for schema details, example queries, and usage with the datasets library.

Future Data Sources

The following sources were researched and evaluated for inclusion. They are deferred for now but may be added in future versions.

High-Value Deferred Sources

Source Format Why Deferred
EUVD JSON EU vulnerability database, structured, CVE-linked. New (launched 2025), API still maturing.
OSV JSON Google's open-source vulnerability DB with bulk download. Focused on software packages rather than CVE-level vulnerabilities.

International Sources Investigated

Source Country Status
JVN iPedia Japan RSS feeds available, CVE-linked, bilingual (JP/EN). Limited bulk structured data access.
ThaiCERT Thailand 504 APT group threat cards, structured. Niche coverage, limited API.
CNNVD / CNVD China Access restrictions for non-Chinese IPs, data quality concerns, significant latency vs NVD.
KrCERT / KNVD South Korea Limited public API, Korean-language only.
BSI Germany Advisories available, German-language, no bulk structured feed.
ANSSI France Advisories and IOC reports, French-language, limited machine-readable data.
CERT-In India CVE CNA, publishes advisories but no bulk structured data download.
AusCERT Australia RSS feeds available, English-language. Limited structured data beyond advisories.
CERT-EU EU Threat landscape reports, limited machine-readable data.
BDU (FSTEC) Russia Poor data quality, slow updates, access restrictions.

Specialized / Niche Sources

Source Why Not Included
MAEC Malware attribute enumeration. Sparse community adoption, limited structured data available.
OVAL Compliance-focused XML definitions. Very large, focused on system configuration rather than threat context.
CCE Configuration enumeration (Excel format). Narrow scope, limited cross-linking potential.

Source Licensing & Attribution

This project is licensed under Apache 2.0. The underlying source data is provided under various licenses as detailed below.

Source License Attribution
ATT&CK Custom royalty-free (MITRE) © The MITRE Corporation. Reproduced and distributed with the permission of The MITRE Corporation.
CAPEC Custom royalty-free (MITRE) © The MITRE Corporation. Reproduced and distributed with the permission of The MITRE Corporation.
CWE Custom royalty-free (MITRE) © The MITRE Corporation. Reproduced and distributed with the permission of The MITRE Corporation.
CVE Custom permissive (MITRE) © The MITRE Corporation. CVE® is a registered trademark of The MITRE Corporation.
CPE / NVD Public domain (NIST) This product uses data from the NVD API but is not endorsed or certified by the NVD.
D3FEND MIT License © The MITRE Corporation. MITRE D3FEND™ is a trademark of The MITRE Corporation.
ATLAS Apache 2.0 © MITRE.
CAR Apache 2.0 © The MITRE Corporation.
ENGAGE Apache 2.0 (GitHub repo) / Custom restrictive (website ToU) © The MITRE Corporation. Reproduced and distributed with the permission of The MITRE Corporation. Note: the GitHub repo is licensed Apache 2.0, but the website terms restrict use to internal/non-commercial purposes. Clarification pending with MITRE.
EPSS Custom permissive (FIRST) Jacobs, Romanosky, Edwards, Roytman, Adjerid (2021), Exploit Prediction Scoring System, Digital Threats Research and Practice, 2(3). See first.org/epss.
KEV Public domain (U.S. Gov) Source: CISA Known Exploited Vulnerabilities Catalog.
Vulnrichment CC0 1.0 Universal Source: CISA Vulnrichment.
GHSA CC BY 4.0 Source: GitHub Advisory Database. Licensed under CC BY 4.0.
Sigma Detection Rule License 1.1 Source: SigmaHQ. Licensed under DRL 1.1. Rule author attribution is preserved in triples.
ExploitDB GPLv2+ Source: OffSec ExploitDB. Derived factual metadata (IDs, CVE mappings, dates) extracted under GPLv2+.
MISP Galaxies CC0 1.0 / BSD 2-Clause Source: MISP Project. Dual-licensed under CC0 1.0 and BSD 2-Clause.

License

Apache 2.0 — see Source Licensing & Attribution for individual source terms.