Practical guidance for implementing APTS Reporting requirements. Each section provides a brief implementation approach, key considerations, and common pitfalls.
Note: This guide is informative, not normative. Recommended defaults and example values are suggested starting points; the Reporting README contains the authoritative requirements. Where this guide and the README differ, the README governs.
Implementation: Attach raw technical evidence (logs, screenshots, payloads, network traces) to every finding. Clearly demarcate any AI-generated summaries or analysis as distinct from raw evidence.
Key Considerations:
- Evidence must be reproducible by the client and independent reviewers
- Storage should support secure chain-of-custody for forensic purposes
Common Pitfalls:
- Including only AI summaries without raw technical data
- Mixing evidence with unsubstantiated interpretations
Implementation: Re-execute all Critical and High findings before report finalization using automated workflows. Tag each finding as "Confirmed" or "Unconfirmed" with reproduction timestamp and methodology.
Key Considerations:
- Automation must simulate human exploitation paths, not just tool-based detection
- Unconfirmed findings require escalation for manual review
Common Pitfalls:
- Assuming initial detection is sufficient without re-verification
- Automating reproduction without documenting the specific steps used
Implementation: Assign 0-100% confidence scores based on measurable factors (evidence completeness, reproduction success, tool consistency, manual verification). Document the scoring model used in the assessment methodology.
Key Considerations:
- Confidence scoring must be independent from severity/impact ratings
- Scoring methodology should be transparently disclosed to clients
Common Pitfalls:
- Conflating confidence with severity or criticality
- Using subjective confidence scores without documented factors
Confidence Scoring Methodology:
A recommended confidence scoring approach uses weighted factors:
| Factor | Weight | Description |
|---|---|---|
| Evidence quality | 30% | Direct evidence (exploit success) scores 100%; indirect evidence (version banner match) scores 50%; inference only scores 20% |
| Independent confirmation | 25% | Confirmed by 2+ independent methods scores 100%; single method scores 50% |
| Environmental factors | 20% | Default configuration scores 100%; custom/hardened environment scores 60% |
| Historical accuracy | 15% | Platform's historical true positive rate for this vulnerability class |
| Recency | 10% | Evidence gathered within 1 hour scores 100%; decays to 50% after 24 hours |
Confidence = Sum(factor_score * weight)
Thresholds: Confirmed (>= 90%), High Confidence (70-89%), Medium Confidence (50-69%), Low Confidence (< 50%).
Findings below 50% confidence SHOULD be flagged as "Unconfirmed" and excluded from the executive summary unless the customer requests full disclosure.
Implementation: Maintain a cryptographically-signed chain linking each finding to its discovery method, tool output, and operator actions. Use timestamped logs with digital signatures for forensic accountability.
Key Considerations:
- Provenance data must be immutable and timestamped
- Chain should include tool version, configuration, and execution parameters
Common Pitfalls:
- Losing tool output or discovery artifacts mid-assessment
- Failing to document who verified each finding and when
Implementation Aid: See the Evidence Package Manifest appendix for an illustrative machine-readable structure that links one finding to its raw artifacts, provenance events, and review state.
Implementation: Bind all evidence to the discovery chain using SHA-256 hashing of raw artifacts. Provide hash values in the report with instructions for client-side verification against original evidence files.
Key Considerations:
- Hashing must cover complete artifacts (logs, captures, transcripts)
- Include hash computation methodology in technical appendix
Common Pitfalls:
- Hashing only summaries instead of raw data
- Failing to provide verification instructions for clients
Implementation: Document and disclose the methodology for identifying and filtering false positives. Include per-severity false positive rates measured across the assessment (for example, "2 of 50 Medium findings verified as false positives").
Key Considerations:
- FP rates should be calculated and reported per severity level
- Methodology must explain how false positives were identified
Common Pitfalls:
- Omitting FP metrics from reports entirely
- Reporting inflated accuracy without disclosing how false positives were determined
Implementation: Designate reviewers independent from the automated tool chain to manually reproduce a statistically significant sample of Critical findings (minimum 80% coverage).
Key Considerations:
- Sample size should be defensible (for example, 100% for <5 findings, 80% for larger sets)
- Validation should occur before report delivery to allow remediation
Common Pitfalls:
- Validating findings after initial report without time for remediation
- Using the same operator who ran tools for validation
Implementation: Provide a coverage matrix mapping tested vulnerability classes to CWE/OWASP, clearly marking tested/excluded/partially-tested areas. Explain scope limitations and why certain vectors were excluded.
Key Considerations:
- Coverage should account for architecture constraints (for example, "client-side testing excluded per scope")
- Matrix should enable clients to assess residual risk from untested areas
Common Pitfalls:
- Implying complete coverage when only a subset of vulnerability classes were tested
- Failing to explain why certain CWEs were out-of-scope
Implementation: Document the methodology used to assess false negative risk. Publish per-vulnerability-class FN rate estimates based on control gaps, tool detection limits, and known blind spots.
Key Considerations:
- FN rates should be disclosed for each major vulnerability class
- Methodology should explain benchmarking approach and limitations
Common Pitfalls:
- Omitting FN estimates or providing only a single aggregate figure
- Failing to explain which vulnerability classes have higher FN risk
Implementation: Conduct quarterly benchmarks against controlled vulnerable environments (for example, DVWA, WebGoat) to validate detection accuracy. Document and trend results across tool versions and assessment methodologies.
Key Considerations:
- Benchmarks must use consistent, reproducible vulnerable targets
- Results should inform improvements to detection rules and methodologies
Common Pitfalls:
- Running benchmarks inconsistently or with varying targets
- Failing to correlate benchmark results with real assessment performance
Implementation: Provide a non-technical executive summary covering risk posture, overall findings distribution, coverage achieved, and key remediation priorities. Include context for risk-based decision making without requiring technical expertise.
Key Considerations:
- Summary should be actionable for C-level stakeholders
- Include trend data and comparison to industry baselines where available
Common Pitfalls:
- Writing executive summaries that are too technical
- Omitting business-context framing of findings
Implementation: For each finding, provide step-by-step remediation guidance mapped to effort categories (quick-fix, short-term, long-term). Prioritize findings by risk and implementability to guide resource allocation.
Key Considerations:
- Remediation steps should be specific to the assessed environment
- Effort estimates should account for organizational factors (for example, change control requirements)
Common Pitfalls:
- Generic remediation advice that doesn't address specific context
- Prioritization by severity alone without considering effort or business impact
Implementation: Document engagement timeline, any interruptions or scope changes, percentage of planned scope tested, and areas left untested or partially tested with reasons documented.
Key Considerations:
- Timeline should show actual compared to planned dates for phased assessments
- Scope changes must be documented with impact on coverage
Common Pitfalls:
- Omitting scope gaps from reports
- Failing to explain why certain areas were not completed
Implementation: For repeat clients, compare findings across engagements to identify new vulnerabilities, resolved issues, and persistent findings. Include trend metrics (closure rate, mean time to remediation, recurrence rate).
Key Considerations:
- Trend analysis should enable clients to assess remediation effectiveness
- Persistent findings may indicate systemic control issues requiring deeper investigation
Common Pitfalls:
- Treating each engagement as independent without cross-engagement analysis
- Failing to investigate reasons why findings persist across multiple assessments
Implementation: Maintain data fidelity through all post-assessment processes: deduplication logic, tenant isolation in multi-client environments, delivery tracking, and audit logs. Implement cryptographic signing for final deliverables.
Key Considerations:
- Deduplication must preserve original evidence and discovery context
- Delivery logs should provide proof of integrity to recipients
Common Pitfalls:
- Losing original discovery context during deduplication
- Failing to verify tenant isolation in multi-tenant assessment platforms
- Lack of tamper-evident controls on final report deliverables
Tier 1 (implement before any autonomous pentesting begins): RP-006 (false positive rate disclosure), RP-008 (vulnerability coverage disclosure), RP-011 (executive summary and risk overview).
Start with RP-011 (executive summary) and RP-008 (coverage disclosure). Customers need these in every report. Add RP-006 (false positive disclosure) to establish trust in findings.
Tier 2 (implement within first 3 engagements): RP-001 (evidence-based finding validation), RP-002 (automated reproduction of critical findings), RP-003 (confidence scoring), RP-004 (finding provenance chain), RP-005 (cryptographic evidence integrity), RP-009 (false negative rate disclosure), RP-012 (remediation guidance), RP-013 (engagement SLA compliance), RP-014 (trend analysis, SHOULD), RP-015 (downstream pipeline integrity, SHOULD).
Prioritize RP-001 and RP-002 (evidence and reproduction) first. Findings without evidence are worthless. Then add RP-003 and RP-004 (confidence scoring, provenance) for auditability, and RP-005 (cryptographic integrity) for tamper-proof evidence chains.
Tier 3 (implement based on assessment maturity): RP-007 (independent finding validation during assessment), RP-010 (detection effectiveness benchmarking, SHOULD).