-
-
Notifications
You must be signed in to change notification settings - Fork 4
Test Cases and Validation
StegX maintains a rigorous continuous integration test suite and formal statistical benchmarks to validate both functional correctness and steganographic invisibility. This page documents the testing methodology, the mathematical definitions behind each statistical test, and the concrete results.
The pytest suite is organized into four directories:
| Directory | Scope | What It Validates |
|---|---|---|
tests/unit/ |
Individual functions | KDF output correctness, GF(2⁸) arithmetic, header pack/unpack round-trips, compression codec identity, SecureBuffer zeroization |
tests/integration/ |
End-to-end pipeline | Full encode → decode cycle across PNG, BMP, TIFF, WebP; all embedding methods; all compression codecs |
tests/security/ |
Adversarial scenarios | HMAC corruption detection, wrong-password rejection, brute-force timing validation, panic destruction completeness |
tests/system/ |
CLI interface | Argument parsing, exit codes, stdin/stdout piping, shell completion validation |
pip install -r requirements/dev.txt
python -m pytest tests/ -v --tb=shortAEAD Forgery Detection:
The test suite intentionally corrupts individual bits within the AEAD authentication tag and verifies that decrypt_data() raises AuthenticationFailure rather than returning corrupted plaintext. This validates resistance against Chosen Ciphertext Attacks (CCA2).
Deterministic Payload Recovery: Payloads of varying sizes (1 byte, 1 KB, 100 KB, 1 MB, 50 MB) are encoded with each combination of:
- Embedding method: LSB Matching, LSB Replacement, Matrix Hamming
- Compression codec: zstd, brotli, lzma, zlib, bz2, none
- Cost map: Laplacian, HILL, disabled
- Cipher mode: single (AES-GCM), dual (AES-GCM + ChaCha20)
After decoding, the output is verified byte-for-byte against the original using SHA-256 digest comparison.
Argon2id Timing Validation:
The test measures the wall-clock time of derive_master_key() with the default parameters and asserts that it exceeds a minimum threshold (e.g., 50ms), confirming that the memory-hard computation is actually being performed and not short-circuited.
Shamir Round-Trip:
Random payloads are split into
- Any
$K$ shares reconstruct the original payload exactly. - Any
$K-1$ shares fail to produce the correct payload (information-theoretic security). - Shares with inconsistent thresholds or duplicate x-coordinates are rejected.
Panic Destruction:
A stego image is created, then destroy_real_region_in_place() is called. The test verifies:
- The real region's LSBs have been overwritten with random data.
- The decoy region remains intact and extractable (in decoy mode).
- The original stego file has been atomically replaced.
- The
shredcommand was invoked on the original (Linux only).
SecureBuffer Zeroization:
A SecureBuffer is created with known key material. After .close(), the test reads the underlying bytearray and confirms every byte is zero.
Purpose: Detect Pairs of Values (PoV) artifacts caused by LSB substitution.
Background: In a natural image, pixel values
Mathematical Definition:
where
Under the null hypothesis (no steganography), pixel value pairs have naturally unequal frequencies, producing a low
StegX Benchmark Results:
| Tool | Embedding Method |
|
Detection |
|---|---|---|---|
| Steghide | Sequential LSB | 119,531.0 | Detected |
| StegX (standard) | Non-Linear LSB Matching | 4,209.3 | Borderline |
StegX (--extreme) |
Matrix Hamming | 1,187.78 | Undetected |
| Clean image (control) | None | 1,024.6 | Baseline |
Analysis: StegX with Matrix Embedding produces a
- Matrix Embedding modifies only
$\frac{n}{n+1}$ of blocks (≈87.5%), and within those, only 1 bit per block. - LSB Matching uses ±1 perturbation rather than forced replacement, avoiding the PoV artifact entirely.
- Adaptive cost-map filtering restricts embedding to high-texture regions where pixel value distributions are already noisy.
Purpose: Detect regions of suspiciously uniform randomness that indicate encrypted data.
Mathematical Definition:
where
Properties:
- A completely uniform distribution (all 256 values equally likely) yields
$H_{\max} = \log_2(256) = 8.0$ bits/byte. - Natural images exhibit
$H \in [6.5, 7.8]$ depending on texture complexity. - Encrypted ciphertext exhibits
$H \approx 7.99$ — near-perfect randomness.
The Attack Vector: Forensic tools scan the image in sliding windows. If a localized region of a flat sky (expected
How StegX Defeats This:
StegX's Laplacian/HILL cost maps exclude flat regions entirely. Data is embedded only in high-texture areas where the natural entropy is already
Furthermore, Matrix Embedding modifies so few bits (
This is orders of magnitude below the measurement precision of any steganalysis tool.
Purpose: Quantify visual degradation between the original cover image and the stego image, accounting for human visual perception.
Mathematical Definition:
where:
-
$\mu_x, \mu_y$ are the mean pixel intensities of the original and stego image patches -
$\sigma_x^2, \sigma_y^2$ are the variances -
$\sigma_{xy}$ is the covariance -
$C_1 = (K_1 L)^2$ ,$C_2 = (K_2 L)^2$ are stabilization constants ($L = 255$ for 8-bit images,$K_1 = 0.01$ ,$K_2 = 0.03$ )
Interpretation:
-
$\text{SSIM} = 1.0$ : Identical images -
$\text{SSIM} \geq 0.99$ : Visually indistinguishable -
$\text{SSIM} < 0.95$ : Noticeable artifacts
StegX Benchmark:
| Configuration | Payload Size | SSIM |
|---|---|---|
| LSB Matching, Laplacian | 10 KB in 1920×1080 | 0.999987 |
| LSB Matching, Laplacian | 100 KB in 1920×1080 | 0.99994 |
| Matrix Hamming, HILL | 10 KB in 1920×1080 | 0.999998 |
| Matrix Hamming, HILL | 100 KB in 1920×1080 | 0.99997 |
All configurations maintain
Purpose: Complementary metric to SSIM, measuring the ratio of maximum possible signal power to noise power.
Mathematical Definition:
where MSE is the Mean Squared Error between original and stego images:
Since LSB modifications change pixel values by at most ±1, the maximum MSE is bounded by
For Matrix Embedding at full capacity (
PSNR values above 50 dB are considered imperceptible. StegX consistently exceeds this threshold.
| Tool | Technique | Result Against StegX |
|---|---|---|
| stegseek | Brute-force + Steghide format detection | Fails completely (StegX uses a different container format and Argon2id KDF) |
| zsteg | LSB analysis, PoV detection, entropy scanning | No patterns found (Non-Linear embedding + adaptive filtering) |
| binwalk | Signature scanning, entropy analysis | Clean output (encrypted data has no recognizable signatures) |
| exiftool | Metadata inspection | Metadata clean (StegX strips all EXIF/PNG metadata on save) |
A stego image is created with a known 8-character password. A brute-force attack is simulated by calling derive_master_key() in a loop with random passwords and measuring throughput.
| Tool | KDF | Iterations/Memory | Passwords/sec (single core) | Time to crack 8-char alphanumeric |
|---|---|---|---|---|
| Steghide | MD5 | 1 | 20,000,000+ | < 1 second |
| OpenStego | PBKDF2-SHA256 | 1,000 | 500,000+ | ~2.8 hours |
| StegX (PBKDF2 mode) | PBKDF2-SHA256 | 600,000 | ~830 | ~2,700 years |
| StegX (default) | Argon2id | t=3, m=64MB | ~9 | ~$7.9 \times 10^{10}$ years |
The Argon2id configuration makes dictionary attacks and even targeted brute-force attacks computationally infeasible against passwords with reasonable entropy.
A comprehensive steganalysis resistance evaluation was conducted using 10 independent detection methods spanning three categories: classical statistical attacks, information-theoretic similarity measures, and machine-learning/deep-learning classifiers.
Test Environment:
| Component | Local (CPU) | Cloud (GPU) |
|---|---|---|
| Hardware | Intel CPU, 16GB RAM | Google Colab, Tesla T4 16GB VRAM |
| Dataset | 30 image pairs per mode | 500 image pairs |
| Payload | 512 bytes random binary | 256 bytes random binary |
| Image size | 512×512 RGB PNG | 256×256 RGB PNG |
| Script | tests/steganalysis/run_full_steganalysis.py |
tests/steganalysis/colab_cnn_steganalysis.py |
Embedding modes tested:
- Standard: LSB matching (±1) with PRNG-shuffled pixel positions
- Adaptive: Laplacian cost-map filtered embedding (high-edge regions only)
- Matrix: F5-style Hamming(7,3) matrix embedding (0.29 modifications per message bit)
- Adaptive + Matrix: Combined mode (strongest configuration)
Data integrity controls:
- All cover images are procedurally generated with controlled randomness (seeded PRNG) for reproducibility
- Cover/stego pairs from the same source image are never split across train/test partitions (GroupKFold for ML, image-level indexing for CNN), eliminating data leakage
- Balanced dataset: 50% cover, 50% stego in every experiment
- Test set contains only unseen images — no image appears in both training and evaluation
Four classical statistical attacks were applied independently to each cover and stego image. Detection significance was assessed via two-sided Mann-Whitney U tests comparing the distributions of cover statistics against stego statistics. A p-value above 0.05 indicates no statistically significant difference (undetected).
Chi-Square (χ²) Analysis: Measures the deviation of Pairs of Values (PoV) from expected uniformity in the LSB plane. Classic LSB replacement creates detectable asymmetry; LSB matching (±1) eliminates it.
RS Analysis: Classifies pixel blocks into Regular, Singular, and Unusable groups under positive and negative flipping masks. A discrepancy between R and S group counts reveals hidden data.
Sample Pair Analysis (SPA): Estimates embedding rate by counting close pixel pairs (|p₁ − p₂| ≤ 1) and comparing observed ratios to theoretical baselines.
Shannon Entropy Deviation: Computes per-channel Shannon entropy H = −Σ pᵢ log₂(pᵢ) and measures the absolute difference between cover and stego.
| Test | Metric | Standard | Adaptive | Matrix | Adaptive+Matrix |
|---|---|---|---|---|---|
| Chi-Square | p-value | 1.000 | 1.000 | 1.000 | 1.000 |
| Verdict | UNDETECTED | UNDETECTED | UNDETECTED | UNDETECTED | |
| RS Analysis | p-value | 0.959 | 0.751 | 0.994 | 0.784 |
| Verdict | UNDETECTED | UNDETECTED | UNDETECTED | UNDETECTED | |
| Sample Pair | p-value | 0.888 | 0.871 | 0.906 | 0.929 |
| Verdict | UNDETECTED | UNDETECTED | UNDETECTED | UNDETECTED | |
| Entropy | Δ | 0.000264 | 0.000249 | 0.000166 | 0.000150 |
| Verdict | UNDETECTED | UNDETECTED | UNDETECTED | UNDETECTED |
All p-values far exceed the 0.05 significance threshold. The chi-square p-value of 1.000 confirms that LSB matching completely eliminates the PoV asymmetry exploited by classical detectors.
Three information-theoretic metrics were computed between each cover/stego pair to quantify pixel-level distortion.
PSNR (Peak Signal-to-Noise Ratio): Measures pixel-level fidelity. PSNR > 50 dB is considered imperceptible to human vision.
SSIM (Structural Similarity Index): Measures structural degradation. SSIM > 0.999 indicates no perceivable structural change.
KL Divergence: Measures the statistical distance between the pixel-value probability distributions of the cover and stego images. KL < 0.001 indicates the distributions are effectively identical.
| Metric | Standard | Adaptive | Matrix | Adaptive+Matrix | Threshold |
|---|---|---|---|---|---|
| PSNR | 72.28 dB | 72.30 dB | 74.20 dB | 74.23 dB | > 50 dB |
| SSIM | 0.999998 | 0.999998 | 0.999999 | 0.999999 | > 0.999 |
| KL Divergence | 7.04×10⁻⁶ | 8.30×10⁻⁶ | 4.09×10⁻⁶ | 4.52×10⁻⁶ | < 0.001 |
| Verdict | IMPERCEPTIBLE | IMPERCEPTIBLE | IMPERCEPTIBLE | IMPERCEPTIBLE | — |
All metrics exceed their imperceptibility thresholds by orders of magnitude. Matrix Embedding mode achieves the highest PSNR (74+ dB) and lowest KL divergence due to its reduced per-bit modification rate.
Feature extraction: SRM-like (Spatial Rich Model) features were extracted from each image, including first-order and second-order residual histograms (horizontal and vertical), Laplacian residual histograms, LSB statistics, and PoV ratios — yielding a high-dimensional feature vector per image.
Classifiers:
- Random Forest: 200 trees, max depth 10
- Gradient Boosting: 100 trees, max depth 5
Cross-validation: 5-Fold GroupKFold, where the group key is the source image index. This guarantees that a cover image and its corresponding stego image are always in the same fold, preventing the classifier from memorizing image-specific textures rather than steganographic artifacts.
| Classifier | Standard | Adaptive | Matrix | Adaptive+Matrix |
|---|---|---|---|---|
| Random Forest | 50.0% ± 5.3% | 55.0% ± 14.5% | 50.0% ± 5.3% | 48.3% ± 6.2% |
| Gradient Boosting | 48.3% ± 9.7% | 51.7% ± 3.3% | 50.0% ± 5.3% | 55.0% ± 4.1% |
| Verdict | UNDETECTED | UNDETECTED | UNDETECTED | UNDETECTED |
All accuracies cluster around 50% (random guessing baseline), confirming that the SRM feature space contains no exploitable signal for distinguishing StegX stego images from clean covers.
Model: SRNet (Deep Residual Network for Steganalysis), a purpose-built CNN architecture for spatial-domain image steganalysis. 11,511,234 trainable parameters.
Dataset: 500 cover images + 500 stego images (Adaptive + Matrix Embedding mode). Split at the image level: 350 pairs for training, 75 pairs for validation, 75 pairs for testing. Zero overlap between splits was programmatically verified.
Training configuration:
| Parameter | Value |
|---|---|
| GPU | NVIDIA Tesla T4 (16GB VRAM) |
| Optimizer | Adam (lr = 2×10⁻⁴, weight decay = 10⁻⁵) |
| LR Schedule | StepLR (step=25, γ=0.5) |
| Epochs | 60 |
| Batch size | 16 |
| Loss | CrossEntropyLoss |
| Train augmentation | RandomCrop(256), RandomHorizontalFlip, RandomVerticalFlip |
| Test transform | CenterCrop(256) |
Results:
| Metric | Value | Interpretation |
|---|---|---|
| Test Accuracy | 50.0% | Equivalent to random guessing |
| AUC-ROC | 0.4981 | No discriminative power (0.5 = random) |
| Best Val Accuracy | 51.3% | No meaningful improvement over 60 epochs |
| Final Train Loss | 0.693 | = ln(2), theoretical minimum for random binary classification |
| Val Accuracy Trend | Flat at 50% | Model failed to learn any steganographic signal |
| Data Leakage | 0 pairs | Image-level split verified |
The validation accuracy remained locked at 50.0% across all 60 training epochs. The training loss converged to ln(2) ≈ 0.693, which is the information-theoretic minimum for a binary classifier making uniformly random predictions. These results confirm that even a state-of-the-art deep learning steganalysis architecture, trained specifically on StegX output, is unable to extract any distinguishing features from the embedded images.
Local statistical + ML tests (CPU, ~20 minutes):
pip install scikit-learn scipy numpy pillow
cd StegX
python tests/steganalysis/run_full_steganalysis.py --num-images 30 --modes standard adaptive matrix adaptive_matrixCNN deep learning test (Colab GPU, ~2-3 hours):
- Open Google Colab
- Upload
tests/steganalysis/colab_cnn_steganalysis.py - Set Runtime → GPU (T4)
- Run All
| Category | Methods | Verdict |
|---|---|---|
| Classical Steganalysis | Chi-Square, RS Analysis, Sample Pair Analysis, Entropy Deviation | All UNDETECTED |
| Image Quality | PSNR, SSIM, KL Divergence | All IMPERCEPTIBLE / INDISTINGUISHABLE |
| Machine Learning | Random Forest + Gradient Boosting (SRM features, GroupKFold) | All ~50% accuracy (random) |
| Deep Learning | SRNet CNN (11.5M params, 60 epochs, T4 GPU) | 50.0% accuracy, AUC 0.498 |
All 10 detection methods across 4 embedding modes returned UNDETECTED verdicts, confirming that StegX v2.0 achieves statistical invisibility across the full spectrum of known classical, machine-learning, and deep-learning steganalysis techniques.
User Guide
Technical Reference
Validation