Skip to content

Latest commit

 

History

History
626 lines (528 loc) · 28.3 KB

File metadata and controls

626 lines (528 loc) · 28.3 KB
language
en
license apache-2.0
task_categories
graph-ml
tags
knowledge-graph
cybersecurity
mitre-attack
capec
cwe
cve
cpe
d3fend
atlas
car
engage
epss
kev
vulnrichment
ghsa
sigma
exploitdb
misp-galaxy
stix
threat-intelligence
triples
pretty_name Security Knowledge Graph Triples (ATT&CK / CAPEC / CWE / CVE / CPE / D3FEND / ATLAS / CAR / ENGAGE / EPSS / KEV / Vulnrichment / GHSA / Sigma / ExploitDB / MISP Galaxies)
size_categories
10M<n<100M
configs
config_name data_files default
enterprise
split path
train
data/enterprise.parquet
true
config_name data_files
mobile
split path
train
data/mobile.parquet
config_name data_files
ics
split path
train
data/ics.parquet
config_name data_files
attack-all
split path
train
data/attack-all.parquet
config_name data_files
capec
split path
train
data/capec.parquet
config_name data_files
cwe
split path
train
data/cwe.parquet
config_name data_files
cve
split path
train
data/cve.parquet
config_name data_files
cpe
split path
train
data/cpe.parquet
config_name data_files
d3fend
split path
train
data/d3fend.parquet
config_name data_files
atlas
split path
train
data/atlas.parquet
config_name data_files
car
split path
train
data/car.parquet
config_name data_files
engage
split path
train
data/engage.parquet
config_name data_files
epss
split path
train
data/epss.parquet
config_name data_files
kev
split path
train
data/kev.parquet
config_name data_files
vulnrichment
split path
train
data/vulnrichment.parquet
config_name data_files
ghsa
split path
train
data/ghsa.parquet
config_name data_files
sigma
split path
train
data/sigma.parquet
config_name data_files
exploitdb
split path
train
data/exploitdb.parquet
config_name data_files
misp_galaxy
split path
train
data/misp_galaxy.parquet
config_name data_files
combined
split path
train
data/combined.parquet
dataset_info
features
name dtype
subject
string
name dtype
predicate
string
name dtype
object
string

Security Knowledge Graph Triples

Security data from 16 sources represented as Subject-Predicate-Object (SPO) triples in Parquet format, ready for knowledge-graph construction, graph-ML, RAG pipelines, and threat-intelligence analysis.

Sources: ATT&CK · CAPEC · CWE · CVE · CPE · D3FEND · ATLAS · CAR · ENGAGE · EPSS · KEV · Vulnrichment · GHSA · Sigma · ExploitDB · MISP Galaxies

Last updated: 2026-04-06T13:23:46Z

Quick Start

from datasets import load_dataset

ds = load_dataset("s0u9ata/security-kg", "enterprise")
print(ds["train"][0])
# {'subject': 'T1059.001', 'predicate': 'rdf:type', 'object': 'Technique'}

Configurations

Config Description Est. Triples Status
enterprise (default) Enterprise ATT&CK 42,041 Current
mobile Mobile ATT&CK 5,307 Current
ics ICS ATT&CK 3,756 Current
attack-all ATT&CK combined (deduplicated) 49,622 Current
capec CAPEC attack patterns 8,114 Current
cwe CWE weaknesses 14,565 Current
cve CVE vulnerabilities 3,546,666 Current
cpe CPE platform enumeration 12,399,534 Current
d3fend D3FEND defensive techniques 8,154 Current
atlas ATLAS AI/ML techniques 1,420 Current
car CAR analytics 1,617 Current
engage ENGAGE adversary engagement 1,464 Current
epss EPSS exploit prediction scores 649,788 Current
kev KEV known exploited vulns 17,054 Current
vulnrichment CISA Vulnrichment (SSVC, CVSS, CWE enrichment) 656,237 Current
ghsa GitHub Security Advisories 327,142 Current
sigma Sigma detection rules 32,750 Current
exploitdb ExploitDB public exploits 346,303 Current
misp_galaxy MISP Galaxy threat intelligence clusters 177,294 Current
combined All sources merged (deduplicated) 18,237,724 Current

Knowledge Graph Structure

     Group  Campaign
        \      /
          uses
           |
           v
      TECHNIQUE -----> Tactic
        ^  ^  ^
        |  |  |
        |  |  +-- D3FEND (counters)
        |  |  +-- CAR (detects)
        |  |  +-- Sigma (detects)
        |  |  +-- ENGAGE (engages)
        |  |  +-- ATLAS (related)
        |  |  +-- MISP Galaxies (cross-refs)
        |  |
        |  +-- Mitigation (mitigates)
        |  +-- DataComponent (detects)
        |
        +-- maps-to -- CAPEC
                         |
                  related-weakness
                         |
                         v
                        CWE
                         ^
                         |
                  related-weakness
                         |
                        CVE ----> CPE
                         ^
                         |
                   EPSS (score)
                   KEV (exploited)
                   GHSA (advisory)
                   Vulnrichment (SSVC)
                   ExploitDB (exploit)

Schema

Each row is a single triple with three string columns:

Column Description Examples
subject Entity ID T1059.001, G0016, CAPEC-66, CWE-79, CVE-2024-1234, cpe:2.3:a:apache:httpd:*, D3-FE, AML.T0000, CAR-2024-01-001, EAC0001, GHSA-xxxx-yyyy-zzzz, EDB-16929
predicate Property name or relationship type rdf:type, name, uses, mitigates, epss-score, counters, ssvc-exploitation, exploits-cve, detects-technique
object Value or target entity ID Technique, PowerShell, T1059, CWE-89, 0.97500, SecurityAdvisory, SigmaRule, Exploit

Predicate Reference

ATT&CK Entity Properties

Predicate Description Example object value
rdf:type Entity type Technique, Group, Malware, Tool, Tactic, Mitigation, Campaign, DataSource, DataComponent
name Display name PowerShell
description Full description text Adversaries may abuse PowerShell...
platform Applicable platform Windows, Linux, macOS
domain ATT&CK domain enterprise-attack
alias Alternative name Cozy Bear
is-subtechnique Whether entity is a sub-technique True, False
belongs-to-tactic Tactic ATT&CK ID TA0002
shortname Tactic shortname credential-access
url ATT&CK website URL https://attack.mitre.org/techniques/T1059/001
created / modified Timestamps 2020-01-14 17:18:32...

ATT&CK Relationship Predicates

Predicate Typical subject / object Example
uses Group/Campaign/Software / Technique G0016 / T1059.001
mitigates Mitigation / Technique M1049 / T1059.001
subtechnique-of Sub-technique / Parent technique T1059.001 / T1059
detects DataComponent / Technique DC0001 / T1059.001
attributed-to Campaign / Group C0018 / G0016

CAPEC Predicates

Predicate Description Example object value
rdf:type AttackPattern AttackPattern
name / description Display name / full text SQL Injection
abstraction / status Level / status Standard, Stable
likelihood / severity Attack likelihood / severity High
child-of Parent attack pattern CAPEC-248
related-weakness Related CWE CWE-89
maps-to-technique Mapped ATT&CK technique T1190.002

CWE Predicates

Predicate Description Example object value
rdf:type Weakness Weakness
name / description Display name / full text Cross-site Scripting (XSS)
abstraction / status Level / status Base, Stable
likelihood-of-exploit Exploitation likelihood High
child-of Parent weakness CWE-74
related-attack-pattern Related CAPEC CAPEC-86
platform Applicable platform JavaScript
consequence-scope / consequence-impact Impact Confidentiality, Read Data
introduction-phase Introduction phase Implementation

CVE Predicates

Predicate Description Example object value
rdf:type Vulnerability Vulnerability
state CVE state PUBLISHED
description English description A remote code execution...
date-published / date-updated Timestamps 2024-01-15T00:00:00.000Z
assigner Assigning organization microsoft
vendor / product Affected vendor/product Microsoft, Windows
affects-cpe Affected CPE string cpe:2.3:o:microsoft:windows_10:*
platform Affected platform x64
related-weakness Related CWE CWE-79
cvss-base-score / cvss-severity CVSS metrics 9.8, CRITICAL

CPE Predicates

Predicate Description Example object value
rdf:type Platform Platform
part CPE part type application, operating_system, hardware
vendor / product / version Components apache, httpd, 2.4.51
title English display name Apache HTTP Server 2.4.51
created / modified Timestamps 2021-10-07

D3FEND Predicates

Predicate Description Example object value
rdf:type DefensiveTechnique or OffensiveTechnique DefensiveTechnique
name / definition Display name / definition File Encryption
synonym Alternative name Disk Encryption
child-of Parent technique PlatformHardening
counters Countered offensive technique T1059

ATLAS Predicates

Predicate Description Example object value
rdf:type Tactic, Technique, CaseStudy, Mitigation Technique
name / description Display name / full text ML Supply Chain Compromise
maturity Technique maturity Reviewed
belongs-to-tactic Parent tactic AML.TA0001
subtechnique-of Parent technique AML.T0000
related-attack-technique Linked ATT&CK technique T1195
related-attack-tactic Linked ATT&CK tactic TA0001
uses-technique Case study technique AML.T0000
mitigates Mitigated technique AML.T0000

CAR Predicates

Predicate Description Example object value
rdf:type Analytic Analytic
title / description Analytic name / full text Suspicious PowerShell Commands
platform Applicable platform Windows
information-domain Information domain Host
analytic-type Type of analytic Situational Awareness
detects-technique Detected ATT&CK technique T1059
detects-subtechnique Detected subtechnique T1059.001
covers-tactic Covered ATT&CK tactic Execution
maps-to-d3fend Linked D3FEND technique D3-PSA

ENGAGE Predicates

Predicate Description Example object value
rdf:type EngagementActivity or AdversaryVulnerability EngagementActivity
name / description Display name / full text Software Manipulation
engages-technique Engaged ATT&CK technique T1001
vulnerability-of ATT&CK technique this adversary vulnerability applies to T1001
addresses-vulnerability Addressed adversary vulnerability EAV0001

EPSS Predicates

Predicate Description Example object value
epss-score Exploit probability (0-1) 0.97500
epss-percentile Score percentile (0-1) 0.99900

KEV Predicates

Predicate Description Example object value
rdf:type KnownExploitedVulnerability KnownExploitedVulnerability
kev-vendor / kev-product Affected vendor/product Microsoft, Windows
kev-name / kev-description Vulnerability name/description Windows Privilege Escalation
kev-date-added / kev-due-date Dates 2024-01-15
kev-required-action Required remediation action Apply updates per vendor instructions.
kev-ransomware-use Ransomware campaign use Known, Unknown
related-weakness Related CWE CWE-269

Vulnrichment Predicates

Predicate Description Example object value
ssvc-exploitation SSVC exploitation status active, poc, none
ssvc-automatable Whether exploitation is automatable yes, no
ssvc-technical-impact Technical impact level total, partial
adp-cvss-base-score CISA-analyzed CVSS base score 9.8
adp-cvss-severity CISA-analyzed CVSS severity CRITICAL
adp-related-weakness CISA-assigned CWE CWE-79
adp-affects-cpe CISA-assigned CPE cpe:2.3:o:microsoft:windows_10:*

GHSA Predicates

Predicate Description Example object value
rdf:type SecurityAdvisory SecurityAdvisory
summary Advisory summary XSS vulnerability in example-package
date-published / date-modified Timestamps 2024-01-15T00:00:00Z
severity Severity level HIGH, MODERATE, LOW, CRITICAL
related-cve Associated CVE CVE-2024-1234
related-weakness Associated CWE CWE-79
cvss-vector CVSS v3 vector string CVSS:3.1/AV:N/AC:L/...
affects-package Affected package (ecosystem/name) npm/example-package
fixed-in Fixed version for package (ecosystem/name@version) npm/example-package@2.0.1

Sigma Predicates

Predicate Description Example object value
rdf:type SigmaRule SigmaRule
title / description Rule name / full text Suspicious PowerShell Download
status Rule maturity stable, test, experimental
level Detection severity critical, high, medium, low, informational
author / date Rule author / creation date Security Researcher, 2024-01-15
logsource-category Log source category process_creation, network_connection
logsource-product Log source product windows, linux
logsource-service Log source service sshd, sysmon
detects-technique Detected ATT&CK technique T1059.001
related-cve Related CVE CVE-2024-1234

ExploitDB Predicates

Predicate Description Example object value
rdf:type Exploit Exploit
description Exploit description Apache HTTP Server RCE
date-published Publication date 2024-01-15
author Exploit author Metasploit
exploit-type Exploit category remote, local, dos, webapps
platform Target platform linux, windows, aix
verified Verified by OffSec True
exploits-cve Exploited CVE CVE-2024-1234

MISP Galaxy Predicates

Predicate Description Example object value
rdf:type Galaxy entity type ThreatActor, Ransomware, Botnet, RAT
name Display name APT1
description Full description (text)
galaxy Galaxy cluster type threat-actor, ransomware
synonym Alternative name Comment Crew
country Country code (ISO 3166-1) CN
cfr-suspected-state-sponsor Suspected state sponsor China
targets-country Targeted country United States
targets-sector Targeted sector Government
attribution-confidence Confidence level 50
similar-to Similar/duplicate entity misp:<uuid>
uses Uses technique/tool misp:<uuid>
used-by Used by actor misp:<uuid>
variant-of Variant relationship misp:<uuid>
targets Targets entity misp:<uuid>
attributed-to Attributed to entity misp:<uuid>
misp-related Generic relationship misp:<uuid>
related-attack-id Cross-link to ATT&CK T1059.001, G0006

Dataset Creation

Source Data

Source Feed Format
ATT&CK mitre-attack/attack-stix-data STIX 2.0 JSON
CAPEC capec_latest.xml XML
CWE cwec_latest.xml.zip XML (ZIP)
CVE CVEProject/cvelistV5 JSON 5.x (ZIP)
CPE nvdcpe-2.0.tar.gz JSON (tar.gz)
D3FEND d3fend.json JSON-LD
ATLAS ATLAS.yaml YAML
CAR mitre-attack/car YAML (ZIP)
ENGAGE attack_mapping.json JSON
EPSS epss_scores-current.csv.gz CSV (gzip)
KEV known_exploited_vulnerabilities.json JSON
Vulnrichment cisagov/vulnrichment JSON 5.x (ZIP)
GHSA github/advisory-database OSV JSON (ZIP)
Sigma SigmaHQ/sigma YAML (ZIP)
ExploitDB files_exploits.csv CSV
MISP Galaxies MISP/misp-galaxy JSON (ZIP)

Conversion Pipeline

The converter downloads source data, extracts entity property triples and relationship triples, and writes them as Parquet files. The source code and full documentation are at:

github.com/S0UGATA/security-kg

To regenerate or update this dataset:

git clone https://github.com/S0UGATA/security-kg.git
cd security-kg
pip install -r requirements.txt
python src/convert.py

This produces fresh Parquet files in output/ from the latest data across all 16 sources.

Visualizer

Explore the Parquet files interactively at security-kg-viz.

Use Cases

  • Knowledge Graph Construction: Load triples into Neo4j, RDFLib, or NetworkX for graph queries
  • Graph ML: Train graph neural networks (GNNs) on security data structure for link prediction
  • RAG / LLM Grounding: Use triples as structured context for retrieval-augmented generation
  • Threat Intelligence: Query relationships between groups, techniques, vulnerabilities, and mitigations
  • Vulnerability Prioritization: Combine SSVC, EPSS, KEV, and ExploitDB data for risk-based triage
  • Defensive Gap Analysis: Find heavily-used ATT&CK techniques with insufficient detection coverage
  • Supply Chain Risk: Score open-source packages by linking GHSA advisories to CVE/EPSS/KEV enrichment
  • Security Automation: Programmatically map detections to techniques to tactics

Cross-Source Analysis Notebook

The repository includes a Jupyter notebook with 16 cross-source analyses and visualizations built on combined.parquet — covering SSVC patch prioritization, defensive gap analysis, kill chain tactic coverage, exploit weaponization timelines, ransomware CWE pipelines, supply chain package risk, and more.

Example Queries

SSVC Patch Prioritization (Vulnrichment + EPSS + KEV)

import pandas as pd
from datasets import load_dataset

# Load combined graph for cross-source queries
ds = load_dataset("s0u9ata/security-kg", "combined")
df = ds["train"].to_pandas()

# Build SSVC triage matrix: exploitation status × automatable × EPSS score
ssvc = df[df.predicate == "ssvc-exploitation"][["subject", "object"]].rename(columns={"object": "exploitation"})
auto = df[df.predicate == "ssvc-automatable"][["subject", "object"]].rename(columns={"object": "automatable"})
epss = df[df.predicate == "epss-score"][["subject", "object"]].copy()
epss["epss"] = epss.object.astype(float)

triage = ssvc.merge(auto, on="subject").merge(epss[["subject", "epss"]], on="subject")

# Highest priority: actively exploited + automatable + high EPSS
critical = triage[(triage.exploitation == "active") & (triage.automatable == "yes") & (triage.epss > 0.9)]
print(f"Immediate action: {len(critical)} CVEs")

Defensive Gap Analysis (ATT&CK + Sigma + D3FEND + CAR)

# Find ATT&CK techniques heavily used by APT groups but poorly covered by detections
uses = df[(df.predicate == "uses") & df.subject.str.startswith("G")]
group_usage = uses.groupby("object").subject.nunique().rename("groups_using")

# Count detection sources per technique (Sigma + CAR + D3FEND + ENGAGE)
sigma = df[df.predicate == "detects-technique"].groupby("object").subject.nunique().rename("detections")
d3fend = df[df.predicate == "restricts"].groupby("object").subject.nunique().rename("defenses")

coverage = pd.DataFrame(group_usage).join(sigma).join(d3fend).fillna(0)
gaps = coverage[(coverage.groups_using > 10) & (coverage.detections < 5)]
print(f"High-usage, low-detection techniques: {len(gaps)}")

Supply Chain Risk (GHSA + CVE + EPSS + KEV + ExploitDB)

# Score open-source packages by aggregating risk from linked CVEs
ghsa_cve = df[df.predicate == "related-cve"][["subject", "object"]].rename(columns={"subject": "ghsa", "object": "cve"})
packages = df[df.predicate == "affects-package"][["subject", "object"]].rename(columns={"subject": "ghsa", "object": "pkg"})

epss_scores = df[df.predicate == "epss-score"][["subject", "object"]].copy()
epss_scores["epss"] = epss_scores.object.astype(float)

kev_cves = set(df[(df.predicate == "rdf:type") & (df.object == "KnownExploitedVulnerability")].subject)
exploit_cves = set(df[df.predicate == "exploits-cve"].object)

# Join package → GHSA → CVE → enrichment
risk = packages.merge(ghsa_cve, on="ghsa").merge(epss_scores[["subject", "epss"]], left_on="cve", right_on="subject")
risk["in_kev"] = risk.cve.isin(kev_cves)
risk["has_exploit"] = risk.cve.isin(exploit_cves)
risk["ecosystem"] = risk.pkg.str.split("/").str[0]

# Top ecosystems by high-risk CVE count
high_risk = risk[(risk.epss > 0.5) | risk.in_kev | risk.has_exploit]
print(high_risk.groupby("ecosystem").cve.nunique().sort_values(ascending=False).head(10))

CAPEC → CWE → CVE (Attack Pattern Chain)

capec = load_dataset("s0u9ata/security-kg", "capec")["train"].to_pandas()
cve = load_dataset("s0u9ata/security-kg", "cve")["train"].to_pandas()

# Find CWEs related to SQL Injection (CAPEC-66)
cwe_ids = capec[(capec.subject == "CAPEC-66") & (capec.predicate == "related-weakness")].object.tolist()

# Find CVEs with those CWEs
for cwe_id in cwe_ids:
    related_cves = cve[(cve.predicate == "related-weakness") & (cve.object == cwe_id)].subject.unique()
    print(f"{cwe_id}: {len(related_cves)} CVEs")

D3FEND (Defensive Taxonomy)

ds = load_dataset("s0u9ata/security-kg", "d3fend")
df = ds["train"].to_pandas()

# All 497 defensive techniques in the D3FEND taxonomy
defenses = df[(df.predicate == "rdf:type") & (df.object == "DefensiveTechnique")]
print(f"Defensive techniques: {len(defenses)}")

# Find children of a category (e.g., all techniques under Network Traffic Analysis)
children = df[(df.predicate == "child-of") & (df.object == "NetworkTrafficAnalysis")].subject.tolist()

# Get their names
names = df[df.predicate == "name"][["subject", "object"]]
print(names[names.subject.isin(children)].to_string(index=False))

Source Licensing & Attribution

This dataset is published under the Apache 2.0 license. The underlying source data is provided under various licenses as detailed below. By using this dataset, you agree to comply with each source's respective terms.

Source License Attribution
ATT&CK Custom royalty-free (MITRE) © The MITRE Corporation. Reproduced and distributed with the permission of The MITRE Corporation.
CAPEC Custom royalty-free (MITRE) © The MITRE Corporation. Reproduced and distributed with the permission of The MITRE Corporation.
CWE Custom royalty-free (MITRE) © The MITRE Corporation. Reproduced and distributed with the permission of The MITRE Corporation.
CVE Custom permissive (MITRE) © The MITRE Corporation. CVE® is a registered trademark of The MITRE Corporation.
CPE / NVD Public domain (NIST) This product uses data from the NVD API but is not endorsed or certified by the NVD.
D3FEND MIT License © The MITRE Corporation. MITRE D3FEND™ is a trademark of The MITRE Corporation.
ATLAS Apache 2.0 © MITRE.
CAR Apache 2.0 © The MITRE Corporation.
ENGAGE Apache 2.0 (GitHub repo) / Custom restrictive (website ToU) © The MITRE Corporation. Reproduced and distributed with the permission of The MITRE Corporation. Note: the GitHub repo is licensed Apache 2.0, but the website terms restrict use to internal/non-commercial purposes. Clarification pending with MITRE.
EPSS Custom permissive (FIRST) Jacobs, Romanosky, Edwards, Roytman, Adjerid (2021), Exploit Prediction Scoring System, Digital Threats Research and Practice, 2(3). See first.org/epss.
KEV Public domain (U.S. Gov) Source: CISA Known Exploited Vulnerabilities Catalog.
Vulnrichment CC0 1.0 Universal Source: CISA Vulnrichment.
GHSA CC BY 4.0 Source: GitHub Advisory Database. Licensed under CC BY 4.0.
Sigma Detection Rule License 1.1 Source: SigmaHQ. Licensed under DRL 1.1. Rule author attribution is preserved in triples.
ExploitDB GPLv2+ Source: OffSec ExploitDB. Derived factual metadata (IDs, CVE mappings, dates) extracted under GPLv2+.
MISP Galaxies CC0 1.0 / BSD 2-Clause Source: MISP Project. Dual-licensed under CC0 1.0 and BSD 2-Clause.

License

Apache 2.0 — see Source Licensing & Attribution for individual source terms.