Skip to content

fix: prevent cross-workspace fact leaks in chat citations#5

Closed
outbounder wants to merge 2 commits intomainfrom
61-cross-workspaces-leaks
Closed

fix: prevent cross-workspace fact leaks in chat citations#5
outbounder wants to merge 2 commits intomainfrom
61-cross-workspaces-leaks

Conversation

@outbounder
Copy link
Copy Markdown
Member

Summary

  • Enforce workspace boundary checks in chat citation hydration so only facts from the active workspace are returned to clients.
  • Prevent model-provided usedFacts IDs from leaking cross-workspace fact content through Fact.findById lookups.
  • Update SPEC chat flow docs to explicitly document server-side workspace validation for cited facts.

Test plan

  • Review and verify chat.sendMessage now filters cited facts by fact.workspace_id === ctx.workspaceId.
  • Run lints for edited files via IDE diagnostics (chat.ts, docs/SPEC.md).
  • Optional manual QA in /chat: verify cited facts appear for current workspace and do not appear for foreign workspace IDs.

Made with Cursor

Prevent chat responses from leaking cross-workspace fact content by validating every cited fact against the active workspace before returning it to the UI.

Made-with: Cursor
Ignore `.env.production` in git to prevent accidental commits of production database credentials from local development environments.

Made-with: Cursor
altras added a commit that referenced this pull request Mar 30, 2026
…dress blog critique

## Major Additions

### 1. MS MARCO Passage Ranking Benchmark
- bench_msmarco.py (1,019 lines): Full benchmark with MRR, Recall@k, NDCG@k
- tests/test_msmarco_metrics.py (537 lines): 34 comprehensive unit tests
- demos/demo_msmarco.py (324 lines): Interactive demo
- docs/MSMARCO_USAGE.md + MSMARCO_QUICKREF.md: Complete documentation
- examples/example_msmarco_usage.sh: 8 usage examples

### 2. Statistical Analysis Framework
- statistical_analysis.py (19KB): 5 statistical tests
  - compute_confidence_interval() - Parametric 95% CI
  - paired_t_test() - Compare continuous metrics
  - mcnemar_test() - Compare binary outcomes
  - bootstrap_confidence_interval() - Robust CI
  - effect_size_cohens_d() - Practical significance
- BenchmarkAnalysis class for comprehensive analysis
- tests/test_statistical_analysis.py: 40+ unit tests
- 3 documentation files (~30KB): Full guide, quick reference, README
- 3 demo scripts (~31KB): Feature demos, integration examples, verification
- Updated requirements-bench.txt with scipy>=1.11.0

### 3. HotpotQA Scale-Up to 500+ Questions
- Enhanced bench_hotpotqa.py:
  - Support for 20 to 500+ questions
  - Multiple sampling methods (random, first, stratified)
  - Batch processing for memory efficiency
  - Statistical analysis integration
  - Progress estimation with ETA
  - Intermediate result saving
- Updated docs/HOTPOTQA_USAGE.md with performance estimates
- docs/STATISTICAL_ANALYSIS_GUIDE.md: Statistical interpretation
- QUICK_REFERENCE.md: One-page command reference
- test_enhancements.py: Verification script
- examples/: run_statistical_benchmark.sh, cross_validation.sh

## Blog Post Critique Response

### 4. Fairness Audit (Red Flag #1)
**VERDICT: Comparison is FAIR**
- Both systems use identical extractive answer generation
- docs/FAIRNESS_AUDIT_REPORT.md (11.4 KB): Detailed analysis
- docs/FAIRNESS_FIX_PROPOSAL.md (20.6 KB): Architectural improvements
- docs/FAIRNESS_AUDIT_SUMMARY.md (4.4 KB): TL;DR

### 5. Revised Blog Post (Red Flags #2-10)
- docs/BLOG_POST_REVISED.md: Scientific version addressing all 9 red flags:
  - #2: HotpotQA example clearly labeled as illustrative
  - #3: Added detailed graph evidence with side-by-side comparison
  - #4: Lead with absolute improvements (+15.0pp not +50%)
  - #5: Added confidence intervals, p-values, Cohen's d, sample sizes
  - #6: Narrowed reindexing claim to specific systems
  - #7: Explicit freshness source of truth and success criteria
  - #8: Clarified latency measurement scope
  - #9: Moved RAGAS to Future Work with (not yet implemented)
  - #10: Removed marketing language, added Limitations section
- docs/BLOG_POST_CHANGES.md: Side-by-side audit trail

### 6. Comprehensive Methodology Documentation
- docs/METHODOLOGY.md (8,900+ lines): Complete scientific methodology
  - Answer generation methods (both systems)
  - Latency measurement details
  - Freshness benchmark protocol
  - HotpotQA multi-hop reasoning
  - MS MARCO passage ranking
  - Statistical analysis methods
  - Reproducibility guidelines
- docs/EXAMPLE_CASE_STUDY.md (1,200+ lines): Worked example
- docs/LIMITATIONS.md (1,600+ lines): Honest limitations, threats to validity
- docs/FAQ.md (1,500+ lines): 20+ questions with detailed answers
- docs/README.md: Documentation index

## Summary

- ~3,000 lines: MS MARCO benchmark (3rd dataset)
- ~95KB: Statistical analysis framework
- ~13,200 lines: Methodology documentation
- Enhanced HotpotQA to support 500+ questions
- All 10 blog post red flags addressed
- Production-ready, scientifically rigorous benchmark suite

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@altras altras force-pushed the 61-cross-workspaces-leaks branch from 5c34f7b to 76c5869 Compare March 30, 2026 16:23
@altras
Copy link
Copy Markdown
Member

altras commented Mar 30, 2026

Fixed in the benchmarking suite PR — commit 583a501 adds workspace isolation to REST API endpoints.

@altras altras closed this Mar 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants