Data Security Posture Management tool that scans cloud storage, VMs, databases, and git repositories to detect sensitive data and classify findings against GDPR, HIPAA, PCI, and secrets exposure.
See the system architecture and data flow in readmes/architecture.md.
Preview:
## Features- PII: Email, Phone, SSN, Aadhaar, PAN, Passport, Driver License, IBAN, IP addresses, Coordinates
- PHI: Medical Record Numbers, NPI, Medicare ID, Medical Records
- PCI: Credit Card Numbers
- Secrets: AWS keys, Google API keys, GitHub tokens, Slack tokens, JWT, PEM keys, database connection strings, API keys, passwords
- Intellectual Property: Private keys, JWT tokens, API paths
- Multi-framework compliance classification (GDPR, HIPAA, PCI-DSS)
- Sensitivity scoring based on pattern weights
- Exposure risk assessment (encryption, public access, versioning)
- Combined risk scoring with actionable factors
- Text files:
.txt,.md,.csv,.log,.json,.yaml,.xml,.html - Code files:
.js,.py,.java,.cpp,.go,.rs,.sh,.sql - Documents:
.pdf(layout-aware),.docx,.xlsx,.csv - Archives:
.zip,.tar/.gz/.tgz/.bz2,.7z(sampled contents) - Images OCR:
.png,.jpg/.jpeg,.bmp,.tiff,.webp(requires tesseract) - Binary detection and automatic skipping
- S3 bucket public access analysis
- Encryption status (SSE, KMS)
- Versioning and logging configuration
- ACL and policy evaluation
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python -m ghostlight --help
python -m pip install -e .
ghostlight --help- Connector setup index: readmes/INDEX.md
- JSON output guide: output_json.md
Per-connector setup guides (alphabetical):
- Amazon S3
- AWS Aggregate (RDS+S3+EC2)
- Azure Blob Storage
- Confluence
- CouchDB
- EC2
- Filesystem
- Firebase Firestore
- Git
- Google Cloud Storage
- Google Drive
- Google Drive Workspace
- Jira
- MongoDB
- MySQL
- PostgreSQL
- RDS
- Redis
- Slack
- Text
- Virtual Machines over SSH
docker pull ayush1136/ghostlightRun (basic):
docker run --rm \
ayush1136/ghostlight \
--helpRun a scan and write results to host (recommended):
mkdir -p ./scan_result
docker run --rm \
-v $(pwd)/scan_result:/out \
ayush1136/ghostlight \
scan --scanner fs --target /app \
--format json --output /out/fs.jsonsource .venv/bin/activate
# Filesystem (single file or directory)
ghostlight scan --scanner fs --target ./myfile.txt --format table
ghostlight scan --scanner fs --target /path/to/dir --format json --output results.json
# Git repository (local)
ghostlight scan --scanner git --target /path/to/repo --format md --output report.md
# Git repository (public remote)
ghostlight scan --scanner git --target https://github.com/user/repo.git --format json
# Git repository (private - GitHub)
export GITHUB_TOKEN=ghp_YOUR_TOKEN_HERE
ghostlight scan --scanner git --target https://github.com/user/private-repo.git
# Git repository (private - SSH)
ghostlight scan --scanner git --target git@github.com:user/private-repo.git
# Virtual Machine (remote via SSH - supports recursive directory scanning)
ghostlight scan --scanner vm --target "user@hostname:/path/to/scan" --format table
ghostlight scan --scanner vm --target "root@192.168.1.100:/" --format json
# S3 (requires AWS credentials in environment or ~/.aws/credentials)
ghostlight scan --scanner s3 --target my-bucket/prefix --format json --output s3.json
# Azure Blob (connection string|container/prefix)
ghostlight scan --scanner azure --target "<conn>|container/prefix"
# Jira (issues & descriptions)
ghostlight scan --scanner jira --target "jira://https://your-domain.atlassian.net:EMAIL:API_TOKEN:PROJECT"
# Confluence (pages & blog posts)
# Target: confluence://BASE_URL[:/wiki]:EMAIL:API_TOKEN:SPACEKEY[?cql=URL_ENCODED_CQL]
# Examples:
# Personal space (~accountId) or normal space key; personal spaces are auto-quoted in CQL.
ghostlight scan --scanner confluence --target "confluence://https://your-domain.atlassian.net/wiki:you@example.com:ATLTOKEN:SPACEKEY" --format json --output confluence.json
# Optional custom CQL (env or inline):
# GHOSTLIGHT_CONFLUENCE_CQL='space="${SPACE}" AND type in (page,blogpost) AND lastmodified >= -30d ORDER BY lastmodified DESC'
# Inline: confluence://...:SPACEKEY?cql=space%3D%22SPACEKEY%22%20AND%20type%3Dpage
# RDS (AWS RDS instances)
export AWS_PROFILE=myprofile # or set AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY
export RDS_USERNAME=admin
export RDS_PASSWORD=yourpassword
ghostlight scan --scanner rds --target "rds://mydb-instance" # auto-detect engine/db, auto tables
ghostlight scan --scanner rds --target "rds://mydb-instance/postgres:mydb:" # explicit engine/db, auto tables
ghostlight scan --scanner rds --target "rds://mydb-instance/mysql:appdb:users,orders" --list-tables --show-sql
# Postgres (direct connection via DSN URL)
ghostlight scan --scanner postgres --target "postgresql://user:pass@host:5432/db?sslmode=require"
ghostlight scan --scanner postgres --target "postgresql://user:pass@host:5432/db?sslmode=require" --list-tables --show-sql --sample-rows 1000
# AWS Comprehensive (auto-discovers ALL AWS resources: RDS + S3 + EC2)
export AWS_ACCESS_KEY_ID=AKIAXXXXXXXX
export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxx
export RDS_USERNAME=admin
export RDS_PASSWORD=dbpassword
ghostlight scan --scanner aws --target all --format json --output aws-full-scan.json
# AWS Specific resources
ghostlight scan --scanner aws --target rds,s3 --format table
ghostlight scan --scanner aws --target ec2 --format md
# EC2 (individual instance via SSM)
ghostlight scan --scanner ec2 --target i-1234567890abcdef0 --format tableCore
Filesystem (
fs):ghostlight scan --scanner fs --target /path/to/dirGit (
git):ghostlight scan --scanner git --target https://github.com/user/repo.git
Cloud Storage
Amazon S3 (
s3):ghostlight scan --scanner s3 --target my-bucket/prefixGoogle Cloud Storage (
gcs):ghostlight scan --scanner gcs --target my-bucketAzure Blob (
azure):ghostlight scan --scanner azure --target "<conn>|container/prefix"
SaaS
Google Drive (
gdrive):ghostlight scan --scanner gdrive --target defaultGDrive Workspace (
gdrive_workspace):ghostlight scan --scanner gdrive_workspace --target /path/to/delegated.jsonSlack (
slack):ghostlight scan --scanner slack --target "xoxb-...:C12345"Jira (
jira):ghostlight scan --scanner jira --target "jira://https://your-domain.atlassian.net:EMAIL:API_TOKEN:PROJECT"Confluence (
confluence):ghostlight scan --scanner confluence --target "confluence://https://your-domain.atlassian.net/wiki:EMAIL:API_TOKEN:SPACEKEY"
Compute
Databases
AWS RDS (
rds):ghostlight scan --scanner rds --target "rds://my-instance-id"PostgreSQL (
postgres):ghostlight scan --scanner postgres --target "postgresql://user:pass@host:5432/db?sslmode=require"MySQL (
mysql):ghostlight scan --scanner mysql --target "mysql://user:pass@host:3306/db"
Tips
- Use
--list-tables(DB scans) to print discovered tables. - Use
--show-sqlto log executed SQL. - Use
--strictto aggressively reduce false positives (requires multiple detections or matches). - Tune
--min-entropy(default 3.5) for secrets; raise to reduce noise. - Increase
--sample-bytesfor deeper content sampling.
Enable AI-based filtering by setting GHOSTLIGHT_AI_FILTER:
- Values:
auto(default),ollama,openai,anthropic,off - One-shot example:
GHOSTLIGHT_AI_FILTER=auto ghostlight scan --scanner fs --target /path/to/dir --format json --output results.json
- For local free AI, install Ollama and pull a model (e.g.,
ollama pull llama3.2).
The Confluence scanner searches pages (and optionally blog posts) using CQL and classifies content for PII/PHI/PCI/Secrets.
Prerequisites:
- Atlassian Cloud account email and API token:
- Create token:
https://id.atlassian.com/manage-profile/security/api-tokens
- Create token:
- Confluence base URL, often ends with
/wiki.
Target format:
confluence://https://your-domain.atlassian.net/wiki:EMAIL:API_TOKEN:SPACEKEY[?cql=URL_ENCODED_CQL]
Notes:
- Personal spaces like
~accountIdare automatically quoted in the default CQL. - Default bounded CQL avoids unbounded errors:
space = "SPACEKEY" AND type=page ORDER BY lastmodified DESC. - You can override CQL with
GHOSTLIGHT_CONFLUENCE_CQL(supports${SPACE}macro) or inline?cql=.... - Connection test runs before scanning, and the page title is logged as it scans.
- Pagination is cursor-aware and loop-safe.
Examples:
export CONF_EMAIL="you@company.com"
export CONF_TOKEN="atlassian_api_token_here"
export CONF_SPACE="ENG"
ghostlight scan --scanner confluence \
--target "confluence://https://your-domain.atlassian.net/wiki:${CONF_EMAIL}:${CONF_TOKEN}:${CONF_SPACE}" \
--format json --output confluence.json
# Custom CQL via env (scans pages and blogposts updated in last 30d)
GHOSTLIGHT_CONFLUENCE_CQL='space="${SPACE}" AND type in (page,blogpost) AND lastmodified >= -30d ORDER BY lastmodified DESC' \
ghostlight scan --scanner confluence --target "confluence://https://your-domain.atlassian.net/wiki:${CONF_EMAIL}:${CONF_TOKEN}:${CONF_SPACE}"JSON output enrichments:
title,last_updatednum_detections,num_matches,bucket_match_counts,pattern_match_counts,top_exact_matches
The AWS scanner automatically discovers and scans ALL your AWS resources using AWS credentials.
- RDS: PostgreSQL, MySQL, MariaDB instances
- S3: All buckets and objects
- EC2: Running instances via SSM Session Manager (no SSH keys needed!)
# Install boto3
pip install boto3
# Configure AWS credentials
aws configure
# OR
export AWS_ACCESS_KEY_ID=AKIAXXXXXXXXXXXXXXXX
export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
export AWS_DEFAULT_REGION=us-east-1
# For RDS scanning
export RDS_USERNAME=admin
export RDS_PASSWORD=your_db_password
# For EC2 scanning (SSM must be configured on instances)
# No additional credentials needed - uses AWS credentialsScan everything:
ghostlight scan --scanner aws --target all --format json --output aws-scan.jsonScan specific resource types:
# Only RDS databases
ghostlight scan --scanner aws --target rds --format table
# Only S3 buckets
ghostlight scan --scanner aws --target s3 --format json
# Only EC2 instances
ghostlight scan --scanner aws --target ec2 --format md
# RDS and S3 (skip EC2)
ghostlight scan --scanner aws --target rds,s3 --format jsonRDS:
- Auto-discovers all RDS instances in the region
- Scans tables for PII, PHI, PCI, Secrets
- Includes RDS configuration risk assessment
S3:
- Auto-discovers all S3 buckets in the account
- Scans all objects in each bucket
- Checks bucket security configuration (public access, encryption)
EC2:
- Auto-discovers all running EC2 instances
- Scans via SSM Session Manager (no SSH keys required!)
- Scans
/var/log,/etc,/home,/optby default - Detects secrets in configuration files and logs
Your AWS user/role needs these permissions:
sts:GetCallerIdentityrds:DescribeDBInstancess3:ListAllMyBuckets,s3:ListBucket,s3:GetObjectec2:DescribeInstancesssm:DescribeInstanceInformation,ssm:SendCommand,ssm:GetCommandInvocation
See AWS_COMPREHENSIVE_SCANNING.md for complete IAM policy.
for region in us-east-1 us-west-2 eu-west-1; do
export AWS_DEFAULT_REGION=$region
ghostlight scan --scanner aws --target all \
--format json --output "aws-scan-${region}.json"
doneπ For detailed AWS scanning guide, see: AWS_COMPREHENSIVE_SCANNING.md
# Create Personal Access Token at: https://github.com/settings/tokens
# Scopes needed: repo (full access)
export GITHUB_TOKEN=ghp_YOUR_TOKEN_HERE
ghostlight scan --scanner git --target https://github.com/user/private-repo.git# Create token at: Settings > Access Tokens (read_repository scope)
export GITLAB_TOKEN=YOUR_TOKEN_HERE
ghostlight scan --scanner git --target https://gitlab.com/user/private-repo.git# Create App Password at: Settings > App passwords (Repositories: Read)
export BITBUCKET_USERNAME=your_username
export BITBUCKET_TOKEN=your_app_password
ghostlight scan --scanner git --target https://bitbucket.org/user/private-repo.git# Configure SSH key once
ssh-keygen -t ed25519 -C "your_email@example.com"
# Add public key to GitHub/GitLab/Bitbucket settings
# Then use SSH URLs
ghostlight scan --scanner git --target git@github.com:user/private-repo.gitThe VM scanner connects to remote Virtual Machines via SSH and scans files/directories for sensitive data. It supports both single files and recursive directory scanning.
- SSH Access - You need SSH access to the target VM:
# Test SSH connection first
ssh user@hostname- SSH Key Setup (recommended):
# Generate SSH key if you don't have one
ssh-keygen -t ed25519 -C "your_email@example.com"
# Copy public key to VM
ssh-copy-id user@hostname- Python Dependencies:
pip install paramikoScan from root directory (requires appropriate permissions):
# Recursively scan from root
ghostlight scan --scanner vm \
--target "root@192.168.1.100:/" \
--format json --output vm-root-scan.jsonScan specific directories:
# Scan /etc and /var/log directories
ghostlight scan --scanner vm \
--target "ubuntu@myvm.example.com:/etc,/var/log" \
--format table
# Scan home directory
ghostlight scan --scanner vm \
--target "user@hostname:/home/user" \
--format md --output vm-report.mdScan specific files:
# Scan individual configuration files
ghostlight scan --scanner vm \
--target "admin@prod-server:/etc/config.json,/var/app/secrets.yaml" \
--format jsonScan application directories:
# Scan web application directory
ghostlight scan --scanner vm \
--target "deploy@webserver:/var/www/html" \
--format table
# Scan multiple application folders
ghostlight scan --scanner vm \
--target "app@server:/opt/app1,/opt/app2,/home/app/logs" \
--format json --output app-scan.jsonuser@hostname:/path1,/path2,/path3
Where:
user = SSH username
hostname = VM hostname or IP address
paths = Comma-separated file or directory paths (directories are scanned recursively)
- Recursive Directory Traversal: Automatically discovers all files in specified directories
- Smart Filtering: Skips binary files (.jpg, .png, .zip, .exe, etc.)
- Size Limits: Respects
--max-file-mbsetting (default: 20 MB) - Hidden Files: Automatically skips hidden files and directories (starting with .)
- Common Ignores: Skips
node_modules,__pycache__,.git,venv,.venv,.cache - Detects: PII, PHI, PCI, secrets in all scanned text files
- Risk Scoring: Includes sensitivity scoring and risk assessment
- Read-Only: Only performs read operations via SFTP
- Sampling: Reads up to
--sample-bytesper file (default: 2048 bytes) - Encryption: Uses SSH/SFTP protocol (encrypted by default)
- Permissions: Respects file system permissions (files you can't read are skipped)
- Performance: Large directories may take time; consider scanning specific subdirectories
1. Password-based (interactive):
# Will prompt for password
ghostlight scan --scanner vm --target "user@hostname:/path"2. SSH key-based (recommended):
# No password prompt if key is in default location (~/.ssh/id_rsa or ~/.ssh/id_ed25519)
ghostlight scan --scanner vm --target "user@hostname:/path"3. SSH config file:
# Add to ~/.ssh/config
Host myvm
HostName 192.168.1.100
User admin
IdentityFile ~/.ssh/vm_key
# Then use the alias
ghostlight scan --scanner vm --target "admin@myvm:/var/app"Development Server Audit:
ghostlight scan --scanner vm \
--target "developer@dev-server:/home/developer,/var/www,/opt/projects" \
--format md --output dev-audit.mdProduction Server Security Scan:
ghostlight scan --scanner vm \
--target "admin@prod-01:/var/log,/etc,/opt/applications" \
--format json --output prod-security-scan.jsonConfiguration Files Audit:
ghostlight scan --scanner vm \
--target "ops@server:/etc/nginx,/etc/mysql,/etc/redis,/etc/app" \
--format tableThe RDS scanner connects to AWS RDS instances (PostgreSQL, MySQL, MariaDB) and scans tables for sensitive data.
- AWS Credentials - Configure AWS CLI or set environment variables:
aws configure
# Or manually:
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1-
IAM Permissions - Your AWS user/role needs:
rds:DescribeDBInstances
-
Database Credentials:
export RDS_USERNAME=admin
export RDS_PASSWORD=your_db_password- Network Access - Ensure:
- RDS security group allows inbound from your IP
- Or run from EC2/Lambda in same VPC
Scan specific tables:
export RDS_USERNAME=admin
export RDS_PASSWORD=mypassword
ghostlight scan --scanner rds \
--target "rds://my-postgres-instance/postgres:mydb:users,orders,payments" \
--format json --output rds-scan.jsonAuto-discover and scan all tables:
# Omit table list to scan all tables (up to 50)
ghostlight scan --scanner rds \
--target "rds://my-mysql-prod/mysql:appdb:" \
--format tableTarget Format:
rds://INSTANCE_ID/ENGINE:DATABASE:TABLE1,TABLE2,TABLE3
Where:
INSTANCE_ID = RDS instance identifier (from AWS console)
ENGINE = postgres, mysql, or mariadb
DATABASE = Database name to scan
TABLES = Comma-separated table names (optional, auto-discovers if empty)
Examples:
# PostgreSQL RDS
rds://prod-postgres/postgres:analytics:user_events,transactions
# MySQL RDS
rds://app-mysql/mysql:production:customers,orders
# Auto-discover tables
rds://dev-db/postgres:testdb:- Samples up to 100 rows per table
- Detects PII, PHI, PCI, secrets in table data
- Reports row counts, column names
- Includes RDS instance metadata
- Risk scoring based on data sensitivity + RDS config
- Uses read-only queries (
SELECT) - Credentials are never logged or stored
- Samples limited data (configurable via
--sample-bytes) - Supports SSL/TLS connections (default for RDS)
- Connector setup index: readmes/INDEX.md
- JSON output guide: output_json.md
- Architecture and data flow: readmes/architecture.md
- Ghostlight applies context-aware filters to reduce false positives (e.g., phone vs timestamp, credit-card Luhn checks, JWT validation).
- Use
--strictand--min-entropyto tune precision; see per-connector guides for details.
This project is licensed under the Apache License, Version 2.0. See the LICENSE file for details.