ghostlight

Data Security Posture Management tool that scans cloud storage, VMs, databases, and git repositories to detect sensitive data and classify findings against GDPR, HIPAA, PCI, and secrets exposure.

Quick Commands

Icons	Tool	Command
	Filesystem	`ghostlight scan --scanner fs --target /path/to/dir`
	Git	`ghostlight scan --scanner git --target https://github.com/user/repo.git`
	Amazon S3	`ghostlight scan --scanner s3 --target my-bucket/prefix`
	Google Cloud Storage	`ghostlight scan --scanner gcs --target my-bucket`
	Azure Blob	`ghostlight scan --scanner azure --target "conn/container/prefix"`
	Google Drive	`ghostlight scan --scanner gdrive --target default`
	GDrive Workspace	`ghostlight scan --scanner gdrive_workspace --target /path/to/delegated.json`
	Slack	`ghostlight scan --scanner slack --target "xoxb-...:C12345"`
	VM over SSH	`ghostlight scan --scanner vm --target "user@host:/etc,/var/log"`
	AWS RDS	`ghostlight scan --scanner rds --target "rds://my-instance-id"`
	PostgreSQL	`ghostlight scan --scanner postgres --target "postgresql://user:pass@host:5432/db?sslmode=require"`
	MySQL	`ghostlight scan --scanner mysql --target "mysql://user:pass@host:3306/db"`
	Jira	`ghostlight scan --scanner jira --target "jira://https://your-domain.atlassian.net:EMAIL:API_TOKEN:PROJECT"`
	Confluence	`ghostlight scan --scanner confluence --target "confluence://https://your-domain.atlassian.net/wiki:EMAIL:API_TOKEN:SPACEKEY"`

Architecture

See the system architecture and data flow in readmes/architecture.md.

Preview:

## Features

Sensitive Data Detection

PII: Email, Phone, SSN, Aadhaar, PAN, Passport, Driver License, IBAN, IP addresses, Coordinates
PHI: Medical Record Numbers, NPI, Medicare ID, Medical Records
PCI: Credit Card Numbers
Secrets: AWS keys, Google API keys, GitHub tokens, Slack tokens, JWT, PEM keys, database connection strings, API keys, passwords
Intellectual Property: Private keys, JWT tokens, API paths

Classification & Risk Scoring

Multi-framework compliance classification (GDPR, HIPAA, PCI-DSS)
Sensitivity scoring based on pattern weights
Exposure risk assessment (encryption, public access, versioning)
Combined risk scoring with actionable factors

Multi-Format Support

Text files: .txt, .md, .csv, .log, .json, .yaml, .xml, .html
Code files: .js, .py, .java, .cpp, .go, .rs, .sh, .sql
Documents: .pdf (layout-aware), .docx, .xlsx, .csv
Archives: .zip, .tar/.gz/.tgz/.bz2, .7z (sampled contents)
Images OCR: .png, .jpg/.jpeg, .bmp, .tiff, .webp (requires tesseract)
Binary detection and automatic skipping

Cloud Configuration Checks

S3 bucket public access analysis
Encryption status (SSE, KMS)
Versioning and logging configuration
ACL and policy evaluation

Quickstart

python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python -m ghostlight --help
python -m pip install -e .
ghostlight --help

Documentation

Connector setup index: readmes/INDEX.md
JSON output guide: output_json.md

Per-connector setup guides (alphabetical):

Docker

docker pull ayush1136/ghostlight

Run (basic):

docker run --rm \
  ayush1136/ghostlight \
  --help

Run a scan and write results to host (recommended):

mkdir -p ./scan_result
docker run --rm \
  -v $(pwd)/scan_result:/out \
  ayush1136/ghostlight \
  scan --scanner fs --target /app \
  --format json --output /out/fs.json

Usage

source .venv/bin/activate

# Filesystem (single file or directory)
ghostlight scan --scanner fs --target ./myfile.txt --format table
ghostlight scan --scanner fs --target /path/to/dir --format json --output results.json

# Git repository (local)
ghostlight scan --scanner git --target /path/to/repo --format md --output report.md

# Git repository (public remote)
ghostlight scan --scanner git --target https://github.com/user/repo.git --format json

# Git repository (private - GitHub)
export GITHUB_TOKEN=ghp_YOUR_TOKEN_HERE
ghostlight scan --scanner git --target https://github.com/user/private-repo.git

# Git repository (private - SSH)
ghostlight scan --scanner git --target git@github.com:user/private-repo.git

# Virtual Machine (remote via SSH - supports recursive directory scanning)
ghostlight scan --scanner vm --target "user@hostname:/path/to/scan" --format table
ghostlight scan --scanner vm --target "root@192.168.1.100:/" --format json

# S3 (requires AWS credentials in environment or ~/.aws/credentials)
ghostlight scan --scanner s3 --target my-bucket/prefix --format json --output s3.json

# Azure Blob (connection string|container/prefix)
ghostlight scan --scanner azure --target "<conn>|container/prefix"

# Jira (issues & descriptions)
ghostlight scan --scanner jira --target "jira://https://your-domain.atlassian.net:EMAIL:API_TOKEN:PROJECT"

# Confluence (pages & blog posts)
# Target: confluence://BASE_URL[:/wiki]:EMAIL:API_TOKEN:SPACEKEY[?cql=URL_ENCODED_CQL]
# Examples:
#   Personal space (~accountId) or normal space key; personal spaces are auto-quoted in CQL.
ghostlight scan --scanner confluence --target "confluence://https://your-domain.atlassian.net/wiki:you@example.com:ATLTOKEN:SPACEKEY" --format json --output confluence.json
# Optional custom CQL (env or inline):
#   GHOSTLIGHT_CONFLUENCE_CQL='space="${SPACE}" AND type in (page,blogpost) AND lastmodified >= -30d ORDER BY lastmodified DESC'
# Inline: confluence://...:SPACEKEY?cql=space%3D%22SPACEKEY%22%20AND%20type%3Dpage

# RDS (AWS RDS instances)
export AWS_PROFILE=myprofile  # or set AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY
export RDS_USERNAME=admin
export RDS_PASSWORD=yourpassword
ghostlight scan --scanner rds --target "rds://mydb-instance"               # auto-detect engine/db, auto tables
ghostlight scan --scanner rds --target "rds://mydb-instance/postgres:mydb:" # explicit engine/db, auto tables
ghostlight scan --scanner rds --target "rds://mydb-instance/mysql:appdb:users,orders" --list-tables --show-sql

# Postgres (direct connection via DSN URL)
ghostlight scan --scanner postgres --target "postgresql://user:pass@host:5432/db?sslmode=require"
ghostlight scan --scanner postgres --target "postgresql://user:pass@host:5432/db?sslmode=require" --list-tables --show-sql --sample-rows 1000

# AWS Comprehensive (auto-discovers ALL AWS resources: RDS + S3 + EC2)
export AWS_ACCESS_KEY_ID=AKIAXXXXXXXX
export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxx
export RDS_USERNAME=admin
export RDS_PASSWORD=dbpassword
ghostlight scan --scanner aws --target all --format json --output aws-full-scan.json

# AWS Specific resources
ghostlight scan --scanner aws --target rds,s3 --format table
ghostlight scan --scanner aws --target ec2 --format md

# EC2 (individual instance via SSM)
ghostlight scan --scanner ec2 --target i-1234567890abcdef0 --format table

Supported Scanners

Core

Filesystem (fs): ghostlight scan --scanner fs --target /path/to/dir
Git (git): ghostlight scan --scanner git --target https://github.com/user/repo.git

Cloud Storage

Amazon S3 (s3): ghostlight scan --scanner s3 --target my-bucket/prefix
Google Cloud Storage (gcs): ghostlight scan --scanner gcs --target my-bucket
Azure Blob (azure): ghostlight scan --scanner azure --target "<conn>|container/prefix"

SaaS

Google Drive (gdrive): ghostlight scan --scanner gdrive --target default
GDrive Workspace (gdrive_workspace): ghostlight scan --scanner gdrive_workspace --target /path/to/delegated.json
Slack (slack): ghostlight scan --scanner slack --target "xoxb-...:C12345"
Jira (jira): ghostlight scan --scanner jira --target "jira://https://your-domain.atlassian.net:EMAIL:API_TOKEN:PROJECT"
Confluence (confluence): ghostlight scan --scanner confluence --target "confluence://https://your-domain.atlassian.net/wiki:EMAIL:API_TOKEN:SPACEKEY"

Compute

VM over SSH (vm): ghostlight scan --scanner vm --target "user@host:/etc,/var/log"

Databases

AWS RDS (rds): ghostlight scan --scanner rds --target "rds://my-instance-id"
PostgreSQL (postgres): ghostlight scan --scanner postgres --target "postgresql://user:pass@host:5432/db?sslmode=require"
MySQL (mysql): ghostlight scan --scanner mysql --target "mysql://user:pass@host:3306/db"

Tips

Use --list-tables (DB scans) to print discovered tables.
Use --show-sql to log executed SQL.
Use --strict to aggressively reduce false positives (requires multiple detections or matches).
Tune --min-entropy (default 3.5) for secrets; raise to reduce noise.
Increase --sample-bytes for deeper content sampling.

AI False-Positive Reduction (optional)

Enable AI-based filtering by setting GHOSTLIGHT_AI_FILTER:

Values: auto (default), ollama, openai, anthropic, off

One-shot example:

GHOSTLIGHT_AI_FILTER=auto ghostlight scan --scanner fs --target /path/to/dir --format json --output results.json

For local free AI, install Ollama and pull a model (e.g., ollama pull llama3.2).

Confluence Scanner

The Confluence scanner searches pages (and optionally blog posts) using CQL and classifies content for PII/PHI/PCI/Secrets.

Prerequisites:

Atlassian Cloud account email and API token:
- Create token: https://id.atlassian.com/manage-profile/security/api-tokens
Confluence base URL, often ends with /wiki.

Target format:

confluence://https://your-domain.atlassian.net/wiki:EMAIL:API_TOKEN:SPACEKEY[?cql=URL_ENCODED_CQL]

Notes:

Personal spaces like ~accountId are automatically quoted in the default CQL.
Default bounded CQL avoids unbounded errors: space = "SPACEKEY" AND type=page ORDER BY lastmodified DESC.
You can override CQL with GHOSTLIGHT_CONFLUENCE_CQL (supports ${SPACE} macro) or inline ?cql=....
Connection test runs before scanning, and the page title is logged as it scans.
Pagination is cursor-aware and loop-safe.

Examples:

export CONF_EMAIL="you@company.com"
export CONF_TOKEN="atlassian_api_token_here"
export CONF_SPACE="ENG"

ghostlight scan --scanner confluence \
  --target "confluence://https://your-domain.atlassian.net/wiki:${CONF_EMAIL}:${CONF_TOKEN}:${CONF_SPACE}" \
  --format json --output confluence.json

# Custom CQL via env (scans pages and blogposts updated in last 30d)
GHOSTLIGHT_CONFLUENCE_CQL='space="${SPACE}" AND type in (page,blogpost) AND lastmodified >= -30d ORDER BY lastmodified DESC' \
ghostlight scan --scanner confluence --target "confluence://https://your-domain.atlassian.net/wiki:${CONF_EMAIL}:${CONF_TOKEN}:${CONF_SPACE}"

JSON output enrichments:

title, last_updated
num_detections, num_matches, bucket_match_counts, pattern_match_counts, top_exact_matches

AWS Comprehensive Scanning

The AWS scanner automatically discovers and scans ALL your AWS resources using AWS credentials.

Supported Resources

RDS: PostgreSQL, MySQL, MariaDB instances
S3: All buckets and objects
EC2: Running instances via SSM Session Manager (no SSH keys needed!)

Prerequisites

# Install boto3
pip install boto3

# Configure AWS credentials
aws configure
# OR
export AWS_ACCESS_KEY_ID=AKIAXXXXXXXXXXXXXXXX
export AWS_SECRET_ACCESS_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
export AWS_DEFAULT_REGION=us-east-1

# For RDS scanning
export RDS_USERNAME=admin
export RDS_PASSWORD=your_db_password

# For EC2 scanning (SSM must be configured on instances)
# No additional credentials needed - uses AWS credentials

Usage Examples

Scan everything:

ghostlight scan --scanner aws --target all --format json --output aws-scan.json

Scan specific resource types:

# Only RDS databases
ghostlight scan --scanner aws --target rds --format table

# Only S3 buckets
ghostlight scan --scanner aws --target s3 --format json

# Only EC2 instances
ghostlight scan --scanner aws --target ec2 --format md

# RDS and S3 (skip EC2)
ghostlight scan --scanner aws --target rds,s3 --format json

What Gets Discovered & Scanned

RDS:

Auto-discovers all RDS instances in the region
Scans tables for PII, PHI, PCI, Secrets
Includes RDS configuration risk assessment

S3:

Auto-discovers all S3 buckets in the account
Scans all objects in each bucket
Checks bucket security configuration (public access, encryption)

EC2:

Auto-discovers all running EC2 instances
Scans via SSM Session Manager (no SSH keys required!)
Scans /var/log, /etc, /home, /opt by default
Detects secrets in configuration files and logs

IAM Permissions Required

Your AWS user/role needs these permissions:

sts:GetCallerIdentity
rds:DescribeDBInstances
s3:ListAllMyBuckets, s3:ListBucket, s3:GetObject
ec2:DescribeInstances
ssm:DescribeInstanceInformation, ssm:SendCommand, ssm:GetCommandInvocation

See AWS_COMPREHENSIVE_SCANNING.md for complete IAM policy.

Multi-Region Scanning

for region in us-east-1 us-west-2 eu-west-1; do
  export AWS_DEFAULT_REGION=$region
  ghostlight scan --scanner aws --target all \
    --format json --output "aws-scan-${region}.json"
done

📖 For detailed AWS scanning guide, see: AWS_COMPREHENSIVE_SCANNING.md

Authentication for Private Repositories

GitHub

# Create Personal Access Token at: https://github.com/settings/tokens
# Scopes needed: repo (full access)
export GITHUB_TOKEN=ghp_YOUR_TOKEN_HERE
ghostlight scan --scanner git --target https://github.com/user/private-repo.git

GitLab

# Create token at: Settings > Access Tokens (read_repository scope)
export GITLAB_TOKEN=YOUR_TOKEN_HERE
ghostlight scan --scanner git --target https://gitlab.com/user/private-repo.git

Bitbucket

# Create App Password at: Settings > App passwords (Repositories: Read)
export BITBUCKET_USERNAME=your_username
export BITBUCKET_TOKEN=your_app_password
ghostlight scan --scanner git --target https://bitbucket.org/user/private-repo.git

SSH Authentication (All providers)

# Configure SSH key once
ssh-keygen -t ed25519 -C "your_email@example.com"
# Add public key to GitHub/GitLab/Bitbucket settings
# Then use SSH URLs
ghostlight scan --scanner git --target git@github.com:user/private-repo.git

Scanning Virtual Machines (Remote via SSH)

The VM scanner connects to remote Virtual Machines via SSH and scans files/directories for sensitive data. It supports both single files and recursive directory scanning.

Prerequisites

SSH Access - You need SSH access to the target VM:

# Test SSH connection first
ssh user@hostname

SSH Key Setup (recommended):

# Generate SSH key if you don't have one
ssh-keygen -t ed25519 -C "your_email@example.com"

# Copy public key to VM
ssh-copy-id user@hostname

Python Dependencies:

pip install paramiko

Usage Examples

Scan from root directory (requires appropriate permissions):

# Recursively scan from root
ghostlight scan --scanner vm \
  --target "root@192.168.1.100:/" \
  --format json --output vm-root-scan.json

Scan specific directories:

# Scan /etc and /var/log directories
ghostlight scan --scanner vm \
  --target "ubuntu@myvm.example.com:/etc,/var/log" \
  --format table

# Scan home directory
ghostlight scan --scanner vm \
  --target "user@hostname:/home/user" \
  --format md --output vm-report.md

Scan specific files:

# Scan individual configuration files
ghostlight scan --scanner vm \
  --target "admin@prod-server:/etc/config.json,/var/app/secrets.yaml" \
  --format json

Scan application directories:

# Scan web application directory
ghostlight scan --scanner vm \
  --target "deploy@webserver:/var/www/html" \
  --format table

# Scan multiple application folders
ghostlight scan --scanner vm \
  --target "app@server:/opt/app1,/opt/app2,/home/app/logs" \
  --format json --output app-scan.json

Target Format

user@hostname:/path1,/path2,/path3

Where:
  user     = SSH username
  hostname = VM hostname or IP address
  paths    = Comma-separated file or directory paths (directories are scanned recursively)

What Gets Scanned

Recursive Directory Traversal: Automatically discovers all files in specified directories
Smart Filtering: Skips binary files (.jpg, .png, .zip, .exe, etc.)
Size Limits: Respects --max-file-mb setting (default: 20 MB)
Hidden Files: Automatically skips hidden files and directories (starting with .)
Common Ignores: Skips node_modules, __pycache__, .git, venv, .venv, .cache
Detects: PII, PHI, PCI, secrets in all scanned text files
Risk Scoring: Includes sensitivity scoring and risk assessment

Security & Performance Notes

Read-Only: Only performs read operations via SFTP
Sampling: Reads up to --sample-bytes per file (default: 2048 bytes)
Encryption: Uses SSH/SFTP protocol (encrypted by default)
Permissions: Respects file system permissions (files you can't read are skipped)
Performance: Large directories may take time; consider scanning specific subdirectories

SSH Authentication Options

1. Password-based (interactive):

# Will prompt for password
ghostlight scan --scanner vm --target "user@hostname:/path"

2. SSH key-based (recommended):

# No password prompt if key is in default location (~/.ssh/id_rsa or ~/.ssh/id_ed25519)
ghostlight scan --scanner vm --target "user@hostname:/path"

3. SSH config file:

# Add to ~/.ssh/config
Host myvm
    HostName 192.168.1.100
    User admin
    IdentityFile ~/.ssh/vm_key

# Then use the alias
ghostlight scan --scanner vm --target "admin@myvm:/var/app"

Common Scenarios

Development Server Audit:

ghostlight scan --scanner vm \
  --target "developer@dev-server:/home/developer,/var/www,/opt/projects" \
  --format md --output dev-audit.md

Production Server Security Scan:

ghostlight scan --scanner vm \
  --target "admin@prod-01:/var/log,/etc,/opt/applications" \
  --format json --output prod-security-scan.json

Configuration Files Audit:

ghostlight scan --scanner vm \
  --target "ops@server:/etc/nginx,/etc/mysql,/etc/redis,/etc/app" \
  --format table

Scanning AWS RDS Databases

The RDS scanner connects to AWS RDS instances (PostgreSQL, MySQL, MariaDB) and scans tables for sensitive data.

Prerequisites

AWS Credentials - Configure AWS CLI or set environment variables:

aws configure
# Or manually:
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1

IAM Permissions - Your AWS user/role needs:
- rds:DescribeDBInstances
Database Credentials:

export RDS_USERNAME=admin
export RDS_PASSWORD=your_db_password

Network Access - Ensure:
- RDS security group allows inbound from your IP
- Or run from EC2/Lambda in same VPC

Usage Examples

Scan specific tables:

export RDS_USERNAME=admin
export RDS_PASSWORD=mypassword
ghostlight scan --scanner rds \
  --target "rds://my-postgres-instance/postgres:mydb:users,orders,payments" \
  --format json --output rds-scan.json

Auto-discover and scan all tables:

# Omit table list to scan all tables (up to 50)
ghostlight scan --scanner rds \
  --target "rds://my-mysql-prod/mysql:appdb:" \
  --format table

Target Format:

rds://INSTANCE_ID/ENGINE:DATABASE:TABLE1,TABLE2,TABLE3

Where:
  INSTANCE_ID = RDS instance identifier (from AWS console)
  ENGINE      = postgres, mysql, or mariadb
  DATABASE    = Database name to scan
  TABLES      = Comma-separated table names (optional, auto-discovers if empty)

Examples:

# PostgreSQL RDS
rds://prod-postgres/postgres:analytics:user_events,transactions

# MySQL RDS
rds://app-mysql/mysql:production:customers,orders

# Auto-discover tables
rds://dev-db/postgres:testdb:

What Gets Scanned

Samples up to 100 rows per table
Detects PII, PHI, PCI, secrets in table data
Reports row counts, column names
Includes RDS instance metadata
Risk scoring based on data sensitivity + RDS config

Security Notes

Uses read-only queries (SELECT)
Credentials are never logged or stored
Samples limited data (configurable via --sample-bytes)
Supports SSL/TLS connections (default for RDS)

Notes

Ghostlight applies context-aware filters to reduce false positives (e.g., phone vs timestamp, credit-card Luhn checks, JWT validation).
Use --strict and --min-entropy to tune precision; see per-connector guides for details.

License

This project is licensed under the Apache License, Version 2.0. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
.github		.github
ghostlight		ghostlight
readmes		readmes
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
docker_overview.md		docker_overview.md
output_json.md		output_json.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

ghostlight

Quick Commands

Architecture

Sensitive Data Detection

Classification & Risk Scoring

Multi-Format Support

Cloud Configuration Checks

Quickstart

Documentation

Docker

Usage

Supported Scanners

AI False-Positive Reduction (optional)

Confluence Scanner

AWS Comprehensive Scanning

Supported Resources

Prerequisites

Usage Examples

What Gets Discovered & Scanned

IAM Permissions Required

Multi-Region Scanning

Authentication for Private Repositories

GitHub

GitLab

Bitbucket

SSH Authentication (All providers)

Scanning Virtual Machines (Remote via SSH)

Prerequisites

Usage Examples

Target Format

What Gets Scanned

Security & Performance Notes

SSH Authentication Options

Common Scenarios

Scanning AWS RDS Databases

Prerequisites

Usage Examples

What Gets Scanned

Security Notes

Further Reading

Notes

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages