Skip to content

AyushAggarwal1/ghostlight

ghostlight License: Apache-2.0 Docker Pulls CI

Data Security Posture Management tool that scans cloud storage, VMs, databases, and git repositories to detect sensitive data and classify findings against GDPR, HIPAA, PCI, and secrets exposure.

Quick Commands

Icons Tool Command
Filesystem ghostlight scan --scanner fs --target /path/to/dir
Git ghostlight scan --scanner git --target https://github.com/user/repo.git
Amazon S3 ghostlight scan --scanner s3 --target my-bucket/prefix
Google Cloud Storage ghostlight scan --scanner gcs --target my-bucket
Azure Blob ghostlight scan --scanner azure --target "conn/container/prefix"
Google Drive ghostlight scan --scanner gdrive --target default
GDrive Workspace ghostlight scan --scanner gdrive_workspace --target /path/to/delegated.json
Slack ghostlight scan --scanner slack --target "xoxb-...:C12345"
VM over SSH ghostlight scan --scanner vm --target "user@host:/etc,/var/log"
AWS RDS ghostlight scan --scanner rds --target "rds://my-instance-id"
PostgreSQL ghostlight scan --scanner postgres --target "postgresql://user:pass@host:5432/db?sslmode=require"
MySQL ghostlight scan --scanner mysql --target "mysql://user:pass@host:3306/db"
Jira ghostlight scan --scanner jira --target "jira://https://your-domain.atlassian.net:EMAIL:API_TOKEN:PROJECT"
Confluence ghostlight scan --scanner confluence --target "confluence://https://your-domain.atlassian.net/wiki:EMAIL:API_TOKEN:SPACEKEY"

Architecture

See the system architecture and data flow in readmes/architecture.md.

Preview:

Ghostlight data flow

## Features

Sensitive Data Detection

  • PII: Email, Phone, SSN, Aadhaar, PAN, Passport, Driver License, IBAN, IP addresses, Coordinates
  • PHI: Medical Record Numbers, NPI, Medicare ID, Medical Records
  • PCI: Credit Card Numbers
  • Secrets: AWS keys, Google API keys, GitHub tokens, Slack tokens, JWT, PEM keys, database connection strings, API keys, passwords
  • Intellectual Property: Private keys, JWT tokens, API paths

Classification & Risk Scoring

  • Multi-framework compliance classification (GDPR, HIPAA, PCI-DSS)
  • Sensitivity scoring based on pattern weights
  • Exposure risk assessment (encryption, public access, versioning)
  • Combined risk scoring with actionable factors

Multi-Format Support

  • Text files: .txt, .md, .csv, .log, .json, .yaml, .xml, .html
  • Code files: .js, .py, .java, .cpp, .go, .rs, .sh, .sql
  • Documents: .pdf (layout-aware), .docx, .xlsx, .csv
  • Archives: .zip, .tar/.gz/.tgz/.bz2, .7z (sampled contents)
  • Images OCR: .png, .jpg/.jpeg, .bmp, .tiff, .webp (requires tesseract)
  • Binary detection and automatic skipping

Cloud Configuration Checks

  • S3 bucket public access analysis
  • Encryption status (SSE, KMS)
  • Versioning and logging configuration
  • ACL and policy evaluation

Quickstart

python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python -m ghostlight --help
python -m pip install -e .
ghostlight --help

Documentation

Per-connector setup guides (alphabetical):

Docker

docker pull ayush1136/ghostlight

Run (basic):

docker run --rm \
  ayush1136/ghostlight \
  --help

Run a scan and write results to host (recommended):

mkdir -p ./scan_result
docker run --rm \
  -v $(pwd)/scan_result:/out \
  ayush1136/ghostlight \
  scan --scanner fs --target /app \
  --format json --output /out/fs.json

Usage

source .venv/bin/activate

# Filesystem (single file or directory)
ghostlight scan --scanner fs --target ./myfile.txt --format table
ghostlight scan --scanner fs --target /path/to/dir --format json --output results.json

# Git repository (local)
ghostlight scan --scanner git --target /path/to/repo --format md --output report.md

# Git repository (public remote)
ghostlight scan --scanner git --target https://github.com/user/repo.git --format json

# Git repository (private - GitHub)
export GITHUB_TOKEN=ghp_YOUR_TOKEN_HERE
ghostlight scan --scanner git --target https://github.com/user/private-repo.git

# Git repository (private - SSH)
ghostlight scan --scanner git --target git@github.com:user/private-repo.git

# Virtual Machine (remote via SSH - supports recursive directory scanning)
ghostlight scan --scanner vm --target "user@hostname:/path/to/scan" --format table
ghostlight scan --scanner vm --target "root@192.168.1.100:/" --format json

# S3 (requires AWS credentials in environment or ~/.aws/credentials)
ghostlight scan --scanner s3 --target my-bucket/prefix --format json --output s3.json

# Azure Blob (connection string|container/prefix)
ghostlight scan --scanner azure --target "<conn>|container/prefix"

# Jira (issues & descriptions)
ghostlight scan --scanner jira --target "jira://https://your-domain.atlassian.net:EMAIL:API_TOKEN:PROJECT"

# Confluence (pages & blog posts)
# Target: confluence://BASE_URL[:/wiki]:EMAIL:API_TOKEN:SPACEKEY[?cql=URL_ENCODED_CQL]
# Examples:
#   Personal space (~accountId) or normal space key; personal spaces are auto-quoted in CQL.
ghostlight scan --scanner confluence --target "confluence://https://your-domain.atlassian.net/wiki:you@example.com:ATLTOKEN:SPACEKEY" --format json --output confluence.json
# Optional custom CQL (env or inline):
#   GHOSTLIGHT_CONFLUENCE_CQL='space="${SPACE}" AND type in (page,blogpost) AND lastmodified >= -30d ORDER BY lastmodified DESC'
# Inline: confluence://...:SPACEKEY?cql=space%3D%22SPACEKEY%22%20AND%20type%3Dpage

# RDS (AWS RDS instances)
export AWS_PROFILE=myprofile  # or set AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY
export RDS_USERNAME=admin
export RDS_PASSWORD=yourpassword
ghostlight scan --scanner rds --target "rds://mydb-instance"               # auto-detect engine/db, auto tables
ghostlight scan --scanner rds --target "rds://mydb-instance/postgres:mydb:" # explicit engine/db, auto tables
ghostlight scan --scanner rds --target "rds://mydb-instance/mysql:appdb:users,orders" --list-tables --show-sql

# Postgres (direct connection via DSN URL)
ghostlight scan --scanner postgres --target "postgresql://user:pass@host:5432/db?sslmode=require"
ghostlight scan --scanner postgres --target "postgresql://user:pass@host:5432/db?sslmode=require" --list-tables --show-sql --sample-rows 1000

# AWS Comprehensive (auto-discovers ALL AWS resources: RDS + S3 + EC2)
export AWS_ACCESS_KEY_ID=AKIAXXXXXXXX
export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxx
export RDS_USERNAME=admin
export RDS_PASSWORD=dbpassword
ghostlight scan --scanner aws --target all --format json --output aws-full-scan.json

# AWS Specific resources
ghostlight scan --scanner aws --target rds,s3 --format table
ghostlight scan --scanner aws --target ec2 --format md

# EC2 (individual instance via SSM)
ghostlight scan --scanner ec2 --target i-1234567890abcdef0 --format table

Supported Scanners

Core

  • Filesystem (fs): ghostlight scan --scanner fs --target /path/to/dir
  • Git (git): ghostlight scan --scanner git --target https://github.com/user/repo.git

Cloud Storage

  • Amazon S3 (s3): ghostlight scan --scanner s3 --target my-bucket/prefix
  • Google Cloud Storage (gcs): ghostlight scan --scanner gcs --target my-bucket
  • Azure Blob (azure): ghostlight scan --scanner azure --target "<conn>|container/prefix"

SaaS

  • Google Drive (gdrive): ghostlight scan --scanner gdrive --target default
  • GDrive Workspace (gdrive_workspace): ghostlight scan --scanner gdrive_workspace --target /path/to/delegated.json
  • Slack (slack): ghostlight scan --scanner slack --target "xoxb-...:C12345"
  • Jira (jira): ghostlight scan --scanner jira --target "jira://https://your-domain.atlassian.net:EMAIL:API_TOKEN:PROJECT"
  • Confluence (confluence): ghostlight scan --scanner confluence --target "confluence://https://your-domain.atlassian.net/wiki:EMAIL:API_TOKEN:SPACEKEY"

Compute

  • VM over SSH (vm): ghostlight scan --scanner vm --target "user@host:/etc,/var/log"

Databases

  • AWS RDS (rds): ghostlight scan --scanner rds --target "rds://my-instance-id"
  • PostgreSQL (postgres): ghostlight scan --scanner postgres --target "postgresql://user:pass@host:5432/db?sslmode=require"
  • MySQL (mysql): ghostlight scan --scanner mysql --target "mysql://user:pass@host:3306/db"

Tips

  • Use --list-tables (DB scans) to print discovered tables.
  • Use --show-sql to log executed SQL.
  • Use --strict to aggressively reduce false positives (requires multiple detections or matches).
  • Tune --min-entropy (default 3.5) for secrets; raise to reduce noise.
  • Increase --sample-bytes for deeper content sampling.

AI False-Positive Reduction (optional)

Enable AI-based filtering by setting GHOSTLIGHT_AI_FILTER:

  • Values: auto (default), ollama, openai, anthropic, off
  • One-shot example:
    GHOSTLIGHT_AI_FILTER=auto ghostlight scan --scanner fs --target /path/to/dir --format json --output results.json
  • For local free AI, install Ollama and pull a model (e.g., ollama pull llama3.2).

Confluence Scanner

The Confluence scanner searches pages (and optionally blog posts) using CQL and classifies content for PII/PHI/PCI/Secrets.

Prerequisites:

  • Atlassian Cloud account email and API token:
    • Create token: https://id.atlassian.com/manage-profile/security/api-tokens
  • Confluence base URL, often ends with /wiki.

Target format:

confluence://https://your-domain.atlassian.net/wiki:EMAIL:API_TOKEN:SPACEKEY[?cql=URL_ENCODED_CQL]

Notes:

  • Personal spaces like ~accountId are automatically quoted in the default CQL.
  • Default bounded CQL avoids unbounded errors: space = "SPACEKEY" AND type=page ORDER BY lastmodified DESC.
  • You can override CQL with GHOSTLIGHT_CONFLUENCE_CQL (supports ${SPACE} macro) or inline ?cql=....
  • Connection test runs before scanning, and the page title is logged as it scans.
  • Pagination is cursor-aware and loop-safe.

Examples:

export CONF_EMAIL="you@company.com"
export CONF_TOKEN="atlassian_api_token_here"
export CONF_SPACE="ENG"

ghostlight scan --scanner confluence \
  --target "confluence://https://your-domain.atlassian.net/wiki:${CONF_EMAIL}:${CONF_TOKEN}:${CONF_SPACE}" \
  --format json --output confluence.json

# Custom CQL via env (scans pages and blogposts updated in last 30d)
GHOSTLIGHT_CONFLUENCE_CQL='space="${SPACE}" AND type in (page,blogpost) AND lastmodified >= -30d ORDER BY lastmodified DESC' \
ghostlight scan --scanner confluence --target "confluence://https://your-domain.atlassian.net/wiki:${CONF_EMAIL}:${CONF_TOKEN}:${CONF_SPACE}"

JSON output enrichments:

  • title, last_updated
  • num_detections, num_matches, bucket_match_counts, pattern_match_counts, top_exact_matches

AWS Comprehensive Scanning

The AWS scanner automatically discovers and scans ALL your AWS resources using AWS credentials.

Supported Resources

  • RDS: PostgreSQL, MySQL, MariaDB instances
  • S3: All buckets and objects
  • EC2: Running instances via SSM Session Manager (no SSH keys needed!)

Prerequisites

# Install boto3
pip install boto3

# Configure AWS credentials
aws configure
# OR
export AWS_ACCESS_KEY_ID=AKIAXXXXXXXXXXXXXXXX
export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
export AWS_DEFAULT_REGION=us-east-1

# For RDS scanning
export RDS_USERNAME=admin
export RDS_PASSWORD=your_db_password

# For EC2 scanning (SSM must be configured on instances)
# No additional credentials needed - uses AWS credentials

Usage Examples

Scan everything:

ghostlight scan --scanner aws --target all --format json --output aws-scan.json

Scan specific resource types:

# Only RDS databases
ghostlight scan --scanner aws --target rds --format table

# Only S3 buckets
ghostlight scan --scanner aws --target s3 --format json

# Only EC2 instances
ghostlight scan --scanner aws --target ec2 --format md

# RDS and S3 (skip EC2)
ghostlight scan --scanner aws --target rds,s3 --format json

What Gets Discovered & Scanned

RDS:

  • Auto-discovers all RDS instances in the region
  • Scans tables for PII, PHI, PCI, Secrets
  • Includes RDS configuration risk assessment

S3:

  • Auto-discovers all S3 buckets in the account
  • Scans all objects in each bucket
  • Checks bucket security configuration (public access, encryption)

EC2:

  • Auto-discovers all running EC2 instances
  • Scans via SSM Session Manager (no SSH keys required!)
  • Scans /var/log, /etc, /home, /opt by default
  • Detects secrets in configuration files and logs

IAM Permissions Required

Your AWS user/role needs these permissions:

  • sts:GetCallerIdentity
  • rds:DescribeDBInstances
  • s3:ListAllMyBuckets, s3:ListBucket, s3:GetObject
  • ec2:DescribeInstances
  • ssm:DescribeInstanceInformation, ssm:SendCommand, ssm:GetCommandInvocation

See AWS_COMPREHENSIVE_SCANNING.md for complete IAM policy.

Multi-Region Scanning

for region in us-east-1 us-west-2 eu-west-1; do
  export AWS_DEFAULT_REGION=$region
  ghostlight scan --scanner aws --target all \
    --format json --output "aws-scan-${region}.json"
done

πŸ“– For detailed AWS scanning guide, see: AWS_COMPREHENSIVE_SCANNING.md

Authentication for Private Repositories

GitHub

# Create Personal Access Token at: https://github.com/settings/tokens
# Scopes needed: repo (full access)
export GITHUB_TOKEN=ghp_YOUR_TOKEN_HERE
ghostlight scan --scanner git --target https://github.com/user/private-repo.git

GitLab

# Create token at: Settings > Access Tokens (read_repository scope)
export GITLAB_TOKEN=YOUR_TOKEN_HERE
ghostlight scan --scanner git --target https://gitlab.com/user/private-repo.git

Bitbucket

# Create App Password at: Settings > App passwords (Repositories: Read)
export BITBUCKET_USERNAME=your_username
export BITBUCKET_TOKEN=your_app_password
ghostlight scan --scanner git --target https://bitbucket.org/user/private-repo.git

SSH Authentication (All providers)

# Configure SSH key once
ssh-keygen -t ed25519 -C "your_email@example.com"
# Add public key to GitHub/GitLab/Bitbucket settings
# Then use SSH URLs
ghostlight scan --scanner git --target git@github.com:user/private-repo.git

Scanning Virtual Machines (Remote via SSH)

The VM scanner connects to remote Virtual Machines via SSH and scans files/directories for sensitive data. It supports both single files and recursive directory scanning.

Prerequisites

  1. SSH Access - You need SSH access to the target VM:
# Test SSH connection first
ssh user@hostname
  1. SSH Key Setup (recommended):
# Generate SSH key if you don't have one
ssh-keygen -t ed25519 -C "your_email@example.com"

# Copy public key to VM
ssh-copy-id user@hostname
  1. Python Dependencies:
pip install paramiko

Usage Examples

Scan from root directory (requires appropriate permissions):

# Recursively scan from root
ghostlight scan --scanner vm \
  --target "root@192.168.1.100:/" \
  --format json --output vm-root-scan.json

Scan specific directories:

# Scan /etc and /var/log directories
ghostlight scan --scanner vm \
  --target "ubuntu@myvm.example.com:/etc,/var/log" \
  --format table

# Scan home directory
ghostlight scan --scanner vm \
  --target "user@hostname:/home/user" \
  --format md --output vm-report.md

Scan specific files:

# Scan individual configuration files
ghostlight scan --scanner vm \
  --target "admin@prod-server:/etc/config.json,/var/app/secrets.yaml" \
  --format json

Scan application directories:

# Scan web application directory
ghostlight scan --scanner vm \
  --target "deploy@webserver:/var/www/html" \
  --format table

# Scan multiple application folders
ghostlight scan --scanner vm \
  --target "app@server:/opt/app1,/opt/app2,/home/app/logs" \
  --format json --output app-scan.json

Target Format

user@hostname:/path1,/path2,/path3

Where:
  user     = SSH username
  hostname = VM hostname or IP address
  paths    = Comma-separated file or directory paths (directories are scanned recursively)

What Gets Scanned

  • Recursive Directory Traversal: Automatically discovers all files in specified directories
  • Smart Filtering: Skips binary files (.jpg, .png, .zip, .exe, etc.)
  • Size Limits: Respects --max-file-mb setting (default: 20 MB)
  • Hidden Files: Automatically skips hidden files and directories (starting with .)
  • Common Ignores: Skips node_modules, __pycache__, .git, venv, .venv, .cache
  • Detects: PII, PHI, PCI, secrets in all scanned text files
  • Risk Scoring: Includes sensitivity scoring and risk assessment

Security & Performance Notes

  • Read-Only: Only performs read operations via SFTP
  • Sampling: Reads up to --sample-bytes per file (default: 2048 bytes)
  • Encryption: Uses SSH/SFTP protocol (encrypted by default)
  • Permissions: Respects file system permissions (files you can't read are skipped)
  • Performance: Large directories may take time; consider scanning specific subdirectories

SSH Authentication Options

1. Password-based (interactive):

# Will prompt for password
ghostlight scan --scanner vm --target "user@hostname:/path"

2. SSH key-based (recommended):

# No password prompt if key is in default location (~/.ssh/id_rsa or ~/.ssh/id_ed25519)
ghostlight scan --scanner vm --target "user@hostname:/path"

3. SSH config file:

# Add to ~/.ssh/config
Host myvm
    HostName 192.168.1.100
    User admin
    IdentityFile ~/.ssh/vm_key

# Then use the alias
ghostlight scan --scanner vm --target "admin@myvm:/var/app"

Common Scenarios

Development Server Audit:

ghostlight scan --scanner vm \
  --target "developer@dev-server:/home/developer,/var/www,/opt/projects" \
  --format md --output dev-audit.md

Production Server Security Scan:

ghostlight scan --scanner vm \
  --target "admin@prod-01:/var/log,/etc,/opt/applications" \
  --format json --output prod-security-scan.json

Configuration Files Audit:

ghostlight scan --scanner vm \
  --target "ops@server:/etc/nginx,/etc/mysql,/etc/redis,/etc/app" \
  --format table

Scanning AWS RDS Databases

The RDS scanner connects to AWS RDS instances (PostgreSQL, MySQL, MariaDB) and scans tables for sensitive data.

Prerequisites

  1. AWS Credentials - Configure AWS CLI or set environment variables:
aws configure
# Or manually:
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1
  1. IAM Permissions - Your AWS user/role needs:

    • rds:DescribeDBInstances
  2. Database Credentials:

export RDS_USERNAME=admin
export RDS_PASSWORD=your_db_password
  1. Network Access - Ensure:
    • RDS security group allows inbound from your IP
    • Or run from EC2/Lambda in same VPC

Usage Examples

Scan specific tables:

export RDS_USERNAME=admin
export RDS_PASSWORD=mypassword
ghostlight scan --scanner rds \
  --target "rds://my-postgres-instance/postgres:mydb:users,orders,payments" \
  --format json --output rds-scan.json

Auto-discover and scan all tables:

# Omit table list to scan all tables (up to 50)
ghostlight scan --scanner rds \
  --target "rds://my-mysql-prod/mysql:appdb:" \
  --format table

Target Format:

rds://INSTANCE_ID/ENGINE:DATABASE:TABLE1,TABLE2,TABLE3

Where:
  INSTANCE_ID = RDS instance identifier (from AWS console)
  ENGINE      = postgres, mysql, or mariadb
  DATABASE    = Database name to scan
  TABLES      = Comma-separated table names (optional, auto-discovers if empty)

Examples:

# PostgreSQL RDS
rds://prod-postgres/postgres:analytics:user_events,transactions

# MySQL RDS
rds://app-mysql/mysql:production:customers,orders

# Auto-discover tables
rds://dev-db/postgres:testdb:

What Gets Scanned

  • Samples up to 100 rows per table
  • Detects PII, PHI, PCI, secrets in table data
  • Reports row counts, column names
  • Includes RDS instance metadata
  • Risk scoring based on data sensitivity + RDS config

Security Notes

  • Uses read-only queries (SELECT)
  • Credentials are never logged or stored
  • Samples limited data (configurable via --sample-bytes)
  • Supports SSL/TLS connections (default for RDS)

Further Reading

Notes

  • Ghostlight applies context-aware filters to reduce false positives (e.g., phone vs timestamp, credit-card Luhn checks, JWT validation).
  • Use --strict and --min-entropy to tune precision; see per-connector guides for details.

License

This project is licensed under the Apache License, Version 2.0. See the LICENSE file for details.

About

Your data has secrets. GHOSTLIGHT (πŸ•΅οΈβ€β™‚οΈ) expose them.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors