Skip to content

Test Cases and Validation

AyhamAsfoor edited this page May 7, 2026 · 6 revisions

🧪 Test Cases & Statistical Validation

StegX maintains a rigorous continuous integration test suite and formal statistical benchmarks to validate both functional correctness and steganographic invisibility. This page documents the testing methodology, the mathematical definitions behind each statistical test, and the concrete results.


1. Automated Test Suite

1.1 Test Categories

The pytest suite is organized into four directories:

Directory Scope What It Validates
tests/unit/ Individual functions KDF output correctness, GF(2⁸) arithmetic, header pack/unpack round-trips, compression codec identity, SecureBuffer zeroization
tests/integration/ End-to-end pipeline Full encode → decode cycle across PNG, BMP, TIFF, WebP; all embedding methods; all compression codecs
tests/security/ Adversarial scenarios HMAC corruption detection, wrong-password rejection, brute-force timing validation, panic destruction completeness
tests/system/ CLI interface Argument parsing, exit codes, stdin/stdout piping, shell completion validation

1.2 Running the Suite

pip install -r requirements/dev.txt
python -m pytest tests/ -v --tb=short

1.3 Key Test Vectors

AEAD Forgery Detection: The test suite intentionally corrupts individual bits within the AEAD authentication tag and verifies that decrypt_data() raises AuthenticationFailure rather than returning corrupted plaintext. This validates resistance against Chosen Ciphertext Attacks (CCA2).

Deterministic Payload Recovery: Payloads of varying sizes (1 byte, 1 KB, 100 KB, 1 MB, 50 MB) are encoded with each combination of:

  • Embedding method: LSB Matching, LSB Replacement, Matrix Hamming
  • Compression codec: zstd, brotli, lzma, zlib, bz2, none
  • Cost map: Laplacian, HILL, disabled
  • Cipher mode: single (AES-GCM), dual (AES-GCM + ChaCha20)

After decoding, the output is verified byte-for-byte against the original using SHA-256 digest comparison.

Argon2id Timing Validation: The test measures the wall-clock time of derive_master_key() with the default parameters and asserts that it exceeds a minimum threshold (e.g., 50ms), confirming that the memory-hard computation is actually being performed and not short-circuited.

Shamir Round-Trip: Random payloads are split into $N$ shares with threshold $K$. The test verifies:

  • Any $K$ shares reconstruct the original payload exactly.
  • Any $K-1$ shares fail to produce the correct payload (information-theoretic security).
  • Shares with inconsistent thresholds or duplicate x-coordinates are rejected.

Panic Destruction: A stego image is created, then destroy_real_region_in_place() is called. The test verifies:

  • The real region's LSBs have been overwritten with random data.
  • The decoy region remains intact and extractable (in decoy mode).
  • The original stego file has been atomically replaced.
  • The shred command was invoked on the original (Linux only).

SecureBuffer Zeroization: A SecureBuffer is created with known key material. After .close(), the test reads the underlying bytearray and confirms every byte is zero.


2. Formal Statistical Steganalysis

2.1 Chi-Square ($\chi^2$) Analysis

Purpose: Detect Pairs of Values (PoV) artifacts caused by LSB substitution.

Background: In a natural image, pixel values $2k$ and $2k+1$ (e.g., 120 and 121) occur with naturally varying frequencies. Classical LSB substitution forces these pairs toward equal frequency because it randomly sets the LSB to 0 or 1 with equal probability, regardless of the original value. This equalization is the PoV anomaly.

Mathematical Definition:

$$\chi^2 = \sum_{i=0}^{127} \frac{(n_{2i} - n_{2i+1})^2}{n_{2i} + n_{2i+1}}$$

where $n_v$ is the observed frequency of pixel value $v$ in a single color channel.

Under the null hypothesis (no steganography), pixel value pairs have naturally unequal frequencies, producing a low $\chi^2$. Under LSB substitution, pairs are forced toward equality, producing a high $\chi^2$.

StegX Benchmark Results:

Tool Embedding Method $\chi^2$ Score Detection
Steghide Sequential LSB 119,531.0 Detected
StegX (standard) Non-Linear LSB Matching 4,209.3 Borderline
StegX (--extreme) Matrix Hamming 1,187.78 Undetected
Clean image (control) None 1,024.6 Baseline

Analysis: StegX with Matrix Embedding produces a $\chi^2$ value within the natural variance of unmodified images. This is because:

  1. Matrix Embedding modifies only $\frac{n}{n+1}$ of blocks (≈87.5%), and within those, only 1 bit per block.
  2. LSB Matching uses ±1 perturbation rather than forced replacement, avoiding the PoV artifact entirely.
  3. Adaptive cost-map filtering restricts embedding to high-texture regions where pixel value distributions are already noisy.

2.2 Shannon Entropy ($H$)

Purpose: Detect regions of suspiciously uniform randomness that indicate encrypted data.

Mathematical Definition:

$$H(X) = -\sum_{i=0}^{255} P(x_i) \log_2 P(x_i)$$

where $P(x_i)$ is the probability of pixel value $x_i$ occurring in the analyzed region.

Properties:

  • A completely uniform distribution (all 256 values equally likely) yields $H_{\max} = \log_2(256) = 8.0$ bits/byte.
  • Natural images exhibit $H \in [6.5, 7.8]$ depending on texture complexity.
  • Encrypted ciphertext exhibits $H \approx 7.99$ — near-perfect randomness.

The Attack Vector: Forensic tools scan the image in sliding windows. If a localized region of a flat sky (expected $H \approx 4.0$) suddenly shows $H \approx 7.99$, the presence of encrypted steganographic data is confirmed.

How StegX Defeats This:

StegX's Laplacian/HILL cost maps exclude flat regions entirely. Data is embedded only in high-texture areas where the natural entropy is already $H \geq 7.0$. The injection of $H \approx 7.99$ data into a region with $H \approx 7.5$ is statistically indistinguishable from natural sensor noise.

Furthermore, Matrix Embedding modifies so few bits ($R_m \approx 0.29$) that the overall entropy shift is negligible:

$$\Delta H \leq R_m \cdot \frac{1}{C} \approx \frac{0.29}{3.7 \times 10^6} \approx 7.8 \times 10^{-8} \text{ bits/pixel}$$

This is orders of magnitude below the measurement precision of any steganalysis tool.


2.3 Structural Similarity Index (SSIM)

Purpose: Quantify visual degradation between the original cover image and the stego image, accounting for human visual perception.

Mathematical Definition:

$$\text{SSIM}(x, y) = \frac{(2\mu_x \mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}$$

where:

  • $\mu_x, \mu_y$ are the mean pixel intensities of the original and stego image patches
  • $\sigma_x^2, \sigma_y^2$ are the variances
  • $\sigma_{xy}$ is the covariance
  • $C_1 = (K_1 L)^2$, $C_2 = (K_2 L)^2$ are stabilization constants ($L = 255$ for 8-bit images, $K_1 = 0.01$, $K_2 = 0.03$)

Interpretation:

  • $\text{SSIM} = 1.0$: Identical images
  • $\text{SSIM} \geq 0.99$: Visually indistinguishable
  • $\text{SSIM} < 0.95$: Noticeable artifacts

StegX Benchmark:

Configuration Payload Size SSIM
LSB Matching, Laplacian 10 KB in 1920×1080 0.999987
LSB Matching, Laplacian 100 KB in 1920×1080 0.99994
Matrix Hamming, HILL 10 KB in 1920×1080 0.999998
Matrix Hamming, HILL 100 KB in 1920×1080 0.99997

All configurations maintain $\text{SSIM} > 0.9999$, confirming zero perceptible visual degradation.


2.4 Peak Signal-to-Noise Ratio (PSNR)

Purpose: Complementary metric to SSIM, measuring the ratio of maximum possible signal power to noise power.

Mathematical Definition:

$$\text{PSNR} = 10 \cdot \log_{10}\left(\frac{255^2}{\text{MSE}}\right) \text{ dB}$$

where MSE is the Mean Squared Error between original and stego images:

$$\text{MSE} = \frac{1}{W \cdot H \cdot C} \sum_{x,y,c} (I_{\text{original}}(x,y,c) - I_{\text{stego}}(x,y,c))^2$$

Since LSB modifications change pixel values by at most ±1, the maximum MSE is bounded by $R_m$:

$$\text{MSE}_{\max} = R_m \cdot 1^2 = R_m$$

For Matrix Embedding at full capacity ($R_m \approx 0.125$ per pixel):

$$\text{PSNR}_{\min} = 10 \cdot \log_{10}\left(\frac{65{,}025}{0.125}\right) \approx 57.2 \text{ dB}$$

PSNR values above 50 dB are considered imperceptible. StegX consistently exceeds this threshold.


3. Resistance to Automated Steganalysis Tools

Tool Technique Result Against StegX
stegseek Brute-force + Steghide format detection Fails completely (StegX uses a different container format and Argon2id KDF)
zsteg LSB analysis, PoV detection, entropy scanning No patterns found (Non-Linear embedding + adaptive filtering)
binwalk Signature scanning, entropy analysis Clean output (encrypted data has no recognizable signatures)
exiftool Metadata inspection Metadata clean (StegX strips all EXIF/PNG metadata on save)

4. Comparative Brute-Force Resistance

4.1 Methodology

A stego image is created with a known 8-character password. A brute-force attack is simulated by calling derive_master_key() in a loop with random passwords and measuring throughput.

4.2 Results

Tool KDF Iterations/Memory Passwords/sec (single core) Time to crack 8-char alphanumeric
Steghide MD5 1 20,000,000+ < 1 second
OpenStego PBKDF2-SHA256 1,000 500,000+ ~2.8 hours
StegX (PBKDF2 mode) PBKDF2-SHA256 600,000 ~830 ~2,700 years
StegX (default) Argon2id t=3, m=64MB ~9 ~$7.9 \times 10^{10}$ years

The Argon2id configuration makes dictionary attacks and even targeted brute-force attacks computationally infeasible against passwords with reasonable entropy.

5. Advanced Steganalysis Resistance (Statistical + ML + CNN)

5.1 Methodology

A comprehensive steganalysis resistance evaluation was conducted using 10 independent detection methods spanning three categories: classical statistical attacks, information-theoretic similarity measures, and machine-learning/deep-learning classifiers.

Test Environment:

Component Local (CPU) Cloud (GPU)
Hardware Intel CPU, 16GB RAM Google Colab, Tesla T4 16GB VRAM
Dataset 30 image pairs per mode 500 image pairs
Payload 512 bytes random binary 256 bytes random binary
Image size 512×512 RGB PNG 256×256 RGB PNG
Script tests/steganalysis/run_full_steganalysis.py tests/steganalysis/colab_cnn_steganalysis.py

Embedding modes tested:

  • Standard: LSB matching (±1) with PRNG-shuffled pixel positions
  • Adaptive: Laplacian cost-map filtered embedding (high-edge regions only)
  • Matrix: F5-style Hamming(7,3) matrix embedding (0.29 modifications per message bit)
  • Adaptive + Matrix: Combined mode (strongest configuration)

Data integrity controls:

  • All cover images are procedurally generated with controlled randomness (seeded PRNG) for reproducibility
  • Cover/stego pairs from the same source image are never split across train/test partitions (GroupKFold for ML, image-level indexing for CNN), eliminating data leakage
  • Balanced dataset: 50% cover, 50% stego in every experiment
  • Test set contains only unseen images — no image appears in both training and evaluation

5.2 Classical Steganalysis Attacks

Four classical statistical attacks were applied independently to each cover and stego image. Detection significance was assessed via two-sided Mann-Whitney U tests comparing the distributions of cover statistics against stego statistics. A p-value above 0.05 indicates no statistically significant difference (undetected).

Chi-Square (χ²) Analysis: Measures the deviation of Pairs of Values (PoV) from expected uniformity in the LSB plane. Classic LSB replacement creates detectable asymmetry; LSB matching (±1) eliminates it.

RS Analysis: Classifies pixel blocks into Regular, Singular, and Unusable groups under positive and negative flipping masks. A discrepancy between R and S group counts reveals hidden data.

Sample Pair Analysis (SPA): Estimates embedding rate by counting close pixel pairs (|p₁ − p₂| ≤ 1) and comparing observed ratios to theoretical baselines.

Shannon Entropy Deviation: Computes per-channel Shannon entropy H = −Σ pᵢ log₂(pᵢ) and measures the absolute difference between cover and stego.

Test Metric Standard Adaptive Matrix Adaptive+Matrix
Chi-Square p-value 1.000 1.000 1.000 1.000
Verdict UNDETECTED UNDETECTED UNDETECTED UNDETECTED
RS Analysis p-value 0.959 0.751 0.994 0.784
Verdict UNDETECTED UNDETECTED UNDETECTED UNDETECTED
Sample Pair p-value 0.888 0.871 0.906 0.929
Verdict UNDETECTED UNDETECTED UNDETECTED UNDETECTED
Entropy Δ 0.000264 0.000249 0.000166 0.000150
Verdict UNDETECTED UNDETECTED UNDETECTED UNDETECTED

All p-values far exceed the 0.05 significance threshold. The chi-square p-value of 1.000 confirms that LSB matching completely eliminates the PoV asymmetry exploited by classical detectors.

5.3 Statistical Indistinguishability (Image Quality)

Three information-theoretic metrics were computed between each cover/stego pair to quantify pixel-level distortion.

PSNR (Peak Signal-to-Noise Ratio): Measures pixel-level fidelity. PSNR > 50 dB is considered imperceptible to human vision.

SSIM (Structural Similarity Index): Measures structural degradation. SSIM > 0.999 indicates no perceivable structural change.

KL Divergence: Measures the statistical distance between the pixel-value probability distributions of the cover and stego images. KL < 0.001 indicates the distributions are effectively identical.

Metric Standard Adaptive Matrix Adaptive+Matrix Threshold
PSNR 72.28 dB 72.30 dB 74.20 dB 74.23 dB > 50 dB
SSIM 0.999998 0.999998 0.999999 0.999999 > 0.999
KL Divergence 7.04×10⁻⁶ 8.30×10⁻⁶ 4.09×10⁻⁶ 4.52×10⁻⁶ < 0.001
Verdict IMPERCEPTIBLE IMPERCEPTIBLE IMPERCEPTIBLE IMPERCEPTIBLE

All metrics exceed their imperceptibility thresholds by orders of magnitude. Matrix Embedding mode achieves the highest PSNR (74+ dB) and lowest KL divergence due to its reduced per-bit modification rate.

5.4 ML Classifier Resistance (SRM Features + GroupKFold)

Feature extraction: SRM-like (Spatial Rich Model) features were extracted from each image, including first-order and second-order residual histograms (horizontal and vertical), Laplacian residual histograms, LSB statistics, and PoV ratios — yielding a high-dimensional feature vector per image.

Classifiers:

  • Random Forest: 200 trees, max depth 10
  • Gradient Boosting: 100 trees, max depth 5

Cross-validation: 5-Fold GroupKFold, where the group key is the source image index. This guarantees that a cover image and its corresponding stego image are always in the same fold, preventing the classifier from memorizing image-specific textures rather than steganographic artifacts.

Classifier Standard Adaptive Matrix Adaptive+Matrix
Random Forest 50.0% ± 5.3% 55.0% ± 14.5% 50.0% ± 5.3% 48.3% ± 6.2%
Gradient Boosting 48.3% ± 9.7% 51.7% ± 3.3% 50.0% ± 5.3% 55.0% ± 4.1%
Verdict UNDETECTED UNDETECTED UNDETECTED UNDETECTED

All accuracies cluster around 50% (random guessing baseline), confirming that the SRM feature space contains no exploitable signal for distinguishing StegX stego images from clean covers.

5.5 CNN Deep Learning Resistance (SRNet on GPU)

Model: SRNet (Deep Residual Network for Steganalysis), a purpose-built CNN architecture for spatial-domain image steganalysis. 11,511,234 trainable parameters.

Dataset: 500 cover images + 500 stego images (Adaptive + Matrix Embedding mode). Split at the image level: 350 pairs for training, 75 pairs for validation, 75 pairs for testing. Zero overlap between splits was programmatically verified.

Training configuration:

Parameter Value
GPU NVIDIA Tesla T4 (16GB VRAM)
Optimizer Adam (lr = 2×10⁻⁴, weight decay = 10⁻⁵)
LR Schedule StepLR (step=25, γ=0.5)
Epochs 60
Batch size 16
Loss CrossEntropyLoss
Train augmentation RandomCrop(256), RandomHorizontalFlip, RandomVerticalFlip
Test transform CenterCrop(256)

Results:

Metric Value Interpretation
Test Accuracy 50.0% Equivalent to random guessing
AUC-ROC 0.4981 No discriminative power (0.5 = random)
Best Val Accuracy 51.3% No meaningful improvement over 60 epochs
Final Train Loss 0.693 = ln(2), theoretical minimum for random binary classification
Val Accuracy Trend Flat at 50% Model failed to learn any steganographic signal
Data Leakage 0 pairs Image-level split verified

The validation accuracy remained locked at 50.0% across all 60 training epochs. The training loss converged to ln(2) ≈ 0.693, which is the information-theoretic minimum for a binary classifier making uniformly random predictions. These results confirm that even a state-of-the-art deep learning steganalysis architecture, trained specifically on StegX output, is unable to extract any distinguishing features from the embedded images.

5.6 Reproducing the Tests

Local statistical + ML tests (CPU, ~20 minutes):

pip install scikit-learn scipy numpy pillow
cd StegX
python tests/steganalysis/run_full_steganalysis.py --num-images 30 --modes standard adaptive matrix adaptive_matrix

CNN deep learning test (Colab GPU, ~2-3 hours):

  1. Open Google Colab
  2. Upload tests/steganalysis/colab_cnn_steganalysis.py
  3. Set Runtime → GPU (T4)
  4. Run All

5.7 Summary

Category Methods Verdict
Classical Steganalysis Chi-Square, RS Analysis, Sample Pair Analysis, Entropy Deviation All UNDETECTED
Image Quality PSNR, SSIM, KL Divergence All IMPERCEPTIBLE / INDISTINGUISHABLE
Machine Learning Random Forest + Gradient Boosting (SRM features, GroupKFold) All ~50% accuracy (random)
Deep Learning SRNet CNN (11.5M params, 60 epochs, T4 GPU) 50.0% accuracy, AUC 0.498

All 10 detection methods across 4 embedding modes returned UNDETECTED verdicts, confirming that StegX v2.0 achieves statistical invisibility across the full spectrum of known classical, machine-learning, and deep-learning steganalysis techniques.