Skip to content
130 changes: 130 additions & 0 deletions docs/development/third-party-integration-audit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# ScholarAIO Third-Party Integration Quality Audit

This document records the quality, reachability, and output validation status of the third-party integrations, APIs, CLIs, and optional toolchains supported by ScholarAIO.

Integrations are evaluated at the workflow boundary, checking CLI/skill entrypoints, provider implementations, setup diagnostics, output formatting, fallback behaviors, and failure handling. A config test or a broad unit-test filename is not enough evidence to mark an integration surface as Good.

---

## 1. Quality Matrix

| Integration / Surface | Category | Status | Verification Path / Test Evidence | Observed Result / Config & Version Boundaries |
| :--- | :--- | :--- | :--- | :--- |
| **qt-web-extractor (HTTP & MCP)** | Web / Agent | **needs-cleanup** | `extract_web` / `tests/test_webtools_source.py` | Sanitized output successfully resolves table-cell code fence corruption. Boundaries: `webextract.transport` (HTTP/MCP), `webextract.base_url`, `webextract.mcp_url`, `webextract.api_key`. |
| **GUILessBingSearch** | Web / Agent | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **MinerU Local API** | Parsing | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **MinerU Cloud CLI** | Parsing | **good** | `test_mineru.py` | Handles `mineru-open-api` subprocess calls; enforces filename constraints safely. Boundaries: `ingest.mineru_api_key` / `MINERU_TOKEN`, `mineru-open-api` CLI package. |
| **Paper2Any MCP Sidecar** | Parsing/MCP | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Docling Fallback** | Parsing | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **PyMuPDF Fallback** | Parsing | **good** | `test_pdf_fallback.py` | Robust extraction fallback when default parser fails. Boundaries: `pymupdf` / `fitz` dependency presence. |
| **arXiv Search (Atom API)** | Discovery | **good** | `test_arxiv_source.py` | Atom XML parser is stable; query filters match client expectations. Boundaries: HTTP requests to `export.arxiv.org`. |
| **arXiv PDF Download** | Discovery | **good** | `test_arxiv_source.py` | Enforces `RATE_LIMIT_DELAY = 3.0` spacing in `batch_download`. Boundaries: `https://arxiv.org/pdf/` endpoint. |
| **OpenAlex Explore** | Discovery | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Crossref / Semantic Scholar** | Discovery | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Zotero SQLite Import** | Import/Export | **good** | `test_workspace.py` | Parsed SQLite columns correctly map to `PaperMetadata`. Boundaries: Local `zotero.sqlite` database schema layout. |
| **Zotero Web API** | Import/Export | **usable-with-caveats** | `fetch_zotero_api` / `import-zotero` | pyzotero retrieves metadata; linked/external attachments are skipped by design. Boundaries: `zotero.api_key`, `zotero.library_id`, `ZOTERO_API_KEY`, `ZOTERO_LIBRARY_ID`. |
| **EndNote / RIS** | Import/Export | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **USPTO ODP / PPubs** | Patents | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **OpenAI-compatible Chat API** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Anthropic Messages API** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Google Gemini API** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Zhipu API** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **vLLM / Ollama Local** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Sentence-transformers Embeddings** | Vector/Embed | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **FAISS Vector / BERTopic** | Vector/Embed | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **MarkItDown Office Ingest** | Office/Output | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Office PPTX / DOCX Libraries** | Office/Output | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Mermaid / DOT Rendering** | Diagram | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Scientific Toolref (Quantum ESPRESSO, etc.)** | Toolref | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **AmberTools / PyMOL** | Scientific | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **rsync / SSH Backup** | System | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Setup Diagnostics** | System | **good** | `test_setup.py` | Reports dependency presence and credential state in bilingual strings. Boundaries: CLI `scholaraio setup check` / `run_check` path. |

---

## 2. Detailed Integration Audits (Workflow Boundary Analysis)

### 2.1 qt-web-extractor (HTTP & MCP)
* **CLI/Skill Entrypoint**:
* CLI: `scholaraio webextract <url>` (implemented in `cmd_webextract` inside [web.py](file:///c:/Users/hp/Desktop/Scholara_oss/scholaraio/interfaces/cli/web.py))
* Skill: `.claude/skills/webextract`
* **Provider/Service Implementation Path**:
* [webtools.py:extract_web](file:///c:/Users/hp/Desktop/Scholara_oss/scholaraio/providers/webtools.py#L613-L673)
* **Setup Diagnostics**:
* Tested via `scholaraio setup check` (calls `_optional_webtool_detail` inside [setup.py](file:///c:/Users/hp/Desktop/Scholara_oss/scholaraio/services/setup.py#L617-L665)), which executes `check_webextract_service` to verify that the HTTP/MCP endpoint responds.
* **Output Quality & Validation**:
* Outputs parsed GFM Markdown. Output quality is protected by `_clean_table_code_fences` to sanitize malformed block code fences in Wikipedia/infobox table cells, resolving broken table rendering.
* Verified via raw/cleaned fixtures: [wikipedia_infobox_bad.md](file:///c:/Users/hp/Desktop/Scholara_oss/tests/fixtures/wikipedia_infobox_bad.md) and [wikipedia_infobox_clean.md](file:///c:/Users/hp/Desktop/Scholara_oss/tests/fixtures/wikipedia_infobox_clean.md).
* **Fallback Behavior**:
* Configured via `webextract.transport` (HTTP or MCP). When configured as HTTP, failure to connect triggers fallback hint to MCP or setup checks.
* **Failure Handling**:
* Unreachable HTTP endpoints raise `WebExtractServiceUnavailableError`, returning a clean user-facing hint with exit code `1`.
* API/Server errors raise `WebExtractError`, showing warnings/errors instead of generic crashes.

### 2.2 MinerU Cloud CLI (`mineru-open-api`)
* **CLI/Skill Entrypoint**:
* CLI: `scholaraio ingest <pdf>` or `scholaraio/providers/mineru.py` main parser CLI.
* Skill: `.claude/skills/ingest`
* **Provider/Service Implementation Path**:
* [mineru.py:convert_pdf_cloud](file:///c:/Users/hp/Desktop/Scholara_oss/scholaraio/providers/mineru.py#L702-L810)
* **Setup Diagnostics**:
* Checked under `scholaraio setup check` via `_detect_mineru` which verifies presence of `mineru-open-api` in system path (`shutil.which`) and reads credential key values.
* **Output Quality & Validation**:
* Translates PDF structures to Markdown with images/formulas.
* Sanitizes cloud upload filenames via `_cloud_safe_pdf_name` to prevent platform-specific characters from crashing the extraction.
* Handles chunk merging for multi-part large PDF parsing.
* **Fallback Behavior**:
* When MinerU is missing or fails, it falls back to the list of alternatives defined in the configuration option `pdf_fallback_order` (e.g. `["docling", "pymupdf"]`).
* **Failure Handling**:
* Subprocess timeouts (`subprocess.TimeoutExpired`) are caught.
* Non-zero return codes from `mineru-open-api` raise descriptive errors containing stderr output.
* Retries are handled with exponential backoff (`attempts` based on `mineru_upload_retries`).

### 2.3 PyMuPDF Fallback (`fitz`)
* **CLI/Skill Entrypoint**:
* CLI: Invoked automatically as part of PDF ingestion when MinerU fails, or manually by setting `pdf_preferred_parser: pymupdf`.
* **Provider/Service Implementation Path**:
* [pdf_fallback.py:run_pymupdf](file:///c:/Users/hp/Desktop/Scholara_oss/scholaraio/providers/pdf_fallback.py#L142-L160)
* **Setup Diagnostics**:
* Checked in `scholaraio setup check` via `_check_dep_group("fitz")`.
* **Output Quality & Validation**:
* Extracts page-by-page flat plaintext with page headers (`## Page N\n\n`). Lacks complex block structure formatting but acts as a highly reliable baseline.
* **Fallback Behavior**:
* Represents the last-resort fallback in the fallback parser chain (since it has no model/server dependencies).
* **Failure Handling**:
* Catches general exception and formats error messages, skipping page crashes or file read errors gracefully without aborting the ingest execution pipeline.

### 2.4 arXiv Search & PDF Download
* **CLI/Skill Entrypoint**:
* CLI: `scholaraio search --arxiv` (runs `cmd_search` inside [search.py](file:///c:/Users/hp/Desktop/Scholara_oss/scholaraio/interfaces/cli/search.py)) and `scholaraio paper fetch` to retrieve PDFs.
* Skill: `.claude/skills/search`, `.claude/skills/paper-guided-reading`
* **Provider/Service Implementation Path**:
* [arxiv.py](file:///c:/Users/hp/Desktop/Scholara_oss/scholaraio/providers/arxiv.py) (`_query_arxiv_api`, `download_arxiv_pdf`, and `batch_download`).
* **Setup Diagnostics**:
* Setup checks verify internet connection and reachability of arXiv query export endpoints.
* **Output Quality & Validation**:
* Parses response XML via `defusedxml.ElementTree` to prevent XML External Entity (XXE) vulnerabilities, mapping properties directly to `ArxivPaper` dataclasses.
* Performs client-side field filtration (`_filter_search_results`) on author, title, and abstract fields to tighten results returned by arXiv's loose matching API.
* **Fallback Behavior**:
* Gracefully fails with standard warning logs if the arXiv endpoint is offline, returning empty results rather than hard crashes.
* **Failure Handling**:
* A requests session is mounted with a custom `urllib3` retry adapter to handle transient `429`, `502`, `503`, and `504` status codes automatically.
* Enforces a polite rate limit delay `RATE_LIMIT_DELAY = 3.0` between successive paper downloads in batch modes.

### 2.5 Zotero Integration (Web API & Local SQLite Import)
* **CLI/Skill Entrypoint**:
* CLI: `scholaraio import-zotero` command (`cmd_import_zotero` inside [import_zotero.py](file:///c:/Users/hp/Desktop/Scholara_oss/scholaraio/interfaces/cli/import_zotero.py)).
* Skill: `.claude/skills/import-zotero`
* **Provider/Service Implementation Path**:
* [zotero.py](file:///c:/Users/hp/Desktop/Scholara_oss/scholaraio/providers/zotero.py) (`fetch_zotero_api` for cloud Web API, `parse_zotero_local` for local SQLite databases).
* **Setup Diagnostics**:
* Checked in `setup.py` by verifying presence of Zotero API credentials.
* **Output Quality & Validation**:
* Maps Zotero types (e.g. `journalArticle`, `preprint`) to standard `PaperMetadata` types.
* Locates corresponding PDF attachments and copies them into the import directory.
* **Fallback Behavior**:
* Supports local SQLite database import via `--local <path/to/sqlite>` if API keys are missing or the API is unreachable.
* Skips unresolvable attachments/links instead of failing the import.
* **Failure Handling**:
* Catches `ImportError` on `pyzotero` to prompt users to install optional dependencies.
* Attachment download failures are caught per-item, logging warnings while continuing to parse the rest of the collection.
94 changes: 71 additions & 23 deletions scholaraio/providers/webtools.py
Original file line number Diff line number Diff line change
Expand Up @@ -572,6 +572,48 @@ def _extract_web_mcp(url: str, *, cfg: Config | None, timeout: float) -> dict:
}


def _clean_table_code_fences(text: str) -> str:
"""Sanitize Markdown table cells that contain block-level code blocks/fences.

Transforms:
| Col | ```\nval\n``` |
Into:
| Col | `val` |
"""
if not text:
return ""

# Pattern to match a code block inside a table cell (bounded by pipes)
pattern = re.compile(
r"\|([^|]*?)```(?:[a-zA-Z0-9_-]*)\n(.*?)\n\s*```([^|]*?)\|",
re.DOTALL
Comment thread
ZimoLiao marked this conversation as resolved.
Outdated
)

def replace_match(match):
full_match = match.group(0)
if re.search(r"\n\s*\n", full_match):
return full_match

before = match.group(1).replace("\n", " ").strip()
code_content = match.group(2).replace("\n", " ").strip()
after = match.group(3).replace("\n", " ").strip()

# Format the code content as inline code
inline_code = f"`{code_content}`" if code_content else ""

# Assemble the cleaned cell components
parts = [p for p in (before, inline_code, after) if p]
cleaned_cell = " " + " ".join(parts) + " "
return f"|{cleaned_cell}|"

cleaned = text
prev = ""
while cleaned != prev:
prev = cleaned
cleaned = pattern.sub(replace_match, cleaned)
return cleaned


def extract_web(
url: str,
*,
Expand Down Expand Up @@ -600,33 +642,39 @@ def extract_web(
"""
transport = _get_webextract_transport(cfg)
if transport == "mcp":
return _extract_web_mcp(url, cfg=cfg, timeout=timeout)
if transport != "http":
raise WebExtractError(f"未知 webextract transport: {transport}")
res = _extract_web_mcp(url, cfg=cfg, timeout=timeout)
else:
if transport != "http":
raise WebExtractError(f"未知 webextract transport: {transport}")

base_url = _get_webextract_base_url(cfg)
if not check_webextract_service(cfg, timeout=3.0):
raise WebExtractServiceUnavailableError(
f"提取服务未启动或不可达: {base_url}\n请确保 qt-web-extractor 服务已运行"
base_url = _get_webextract_base_url(cfg)
if not check_webextract_service(cfg, timeout=3.0):
raise WebExtractServiceUnavailableError(
f"提取服务未启动或不可达: {base_url}\n请确保 qt-web-extractor 服务已运行"
)

body: dict[str, object] = {"url": url}
if pdf is not None:
body["pdf"] = pdf
if include_html:
body["include_html"] = include_html

api_key = _get_webextract_api_key(cfg) or ""
req = Request(
f"{base_url}/extract",
data=json.dumps(body).encode("utf-8"),
headers=_headers(api_key),
method="POST",
)
try:
res = _load_json_response(req, timeout=int(timeout), error_prefix="提取失败")
except RuntimeError as e:
raise WebExtractError(str(e)) from e

body: dict[str, object] = {"url": url}
if pdf is not None:
body["pdf"] = pdf
if include_html:
body["include_html"] = include_html
if isinstance(res, dict) and "text" in res and res["text"]:
res["text"] = _clean_table_code_fences(res["text"])

api_key = _get_webextract_api_key(cfg) or ""
req = Request(
f"{base_url}/extract",
data=json.dumps(body).encode("utf-8"),
headers=_headers(api_key),
method="POST",
)
try:
return _load_json_response(req, timeout=int(timeout), error_prefix="提取失败")
except RuntimeError as e:
raise WebExtractError(str(e)) from e
return res


def extract_and_display(
Expand Down
10 changes: 10 additions & 0 deletions tests/fixtures/wikipedia_infobox_bad.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
| 性别 | 男 |
| 出生 | ```
1902年8月28日
``` |
| 逝世 | ```
1993年11月24日
``` |
| 国籍 | ```
中华人民共和国
``` |
4 changes: 4 additions & 0 deletions tests/fixtures/wikipedia_infobox_clean.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
| 性别 | 男 |
| 出生 | `1902年8月28日` |
| 逝世 | `1993年11月24日` |
| 国籍 | `中华人民共和国` |
72 changes: 72 additions & 0 deletions tests/test_webtools_source.py
Original file line number Diff line number Diff line change
Expand Up @@ -781,3 +781,75 @@ def fake_urlopen(req, timeout=0):
assert result["title"] == "Page"
captured = capsys.readouterr()
assert "markdown body" in captured.out

def test_clean_table_code_fences_with_fixtures(self):
import pathlib
from scholaraio.providers.webtools import _clean_table_code_fences

fixtures_dir = pathlib.Path(__file__).parent / "fixtures"
bad_path = fixtures_dir / "wikipedia_infobox_bad.md"
clean_path = fixtures_dir / "wikipedia_infobox_clean.md"

assert bad_path.exists()
assert clean_path.exists()

bad_text = bad_path.read_text(encoding="utf-8")
expected_clean_text = clean_path.read_text(encoding="utf-8")

cleaned_text = _clean_table_code_fences(bad_text)
assert cleaned_text.strip() == expected_clean_text.strip()

def test_clean_table_code_fences_ignores_normal_structures(self):
from scholaraio.providers.webtools import _clean_table_code_fences

# Test normal code block outside table should not be changed
normal_code = (
"Here is a code snippet:\n"
"```python\n"
"def test():\n"
" return True\n"
"```\n"
"And here is normal text."
)
assert _clean_table_code_fences(normal_code) == normal_code

# Test normal table with inline code should not be changed
normal_table = (
"| Column 1 | Column 2 |\n"
"| --- | --- |\n"
"| `inline code` | value |\n"
)
assert _clean_table_code_fences(normal_table) == normal_table

# Test standalone code block between tables should not be changed
standalone_between_tables = (
"| A | B |\n"
"| --- | --- |\n"
"| one | two |\n\n"
"```python\n"
"print(1)\n"
"```\n\n"
"| C | D |\n"
"| --- | --- |\n"
"| three | four |\n"
)
assert _clean_table_code_fences(standalone_between_tables) == standalone_between_tables

def test_extract_web_applies_cleanup_http(self, monkeypatch):
# Verify that HTTP path runs the clean helper
def fake_urlopen(req, timeout=0):
return _FakeResponse({
"title": "Page",
"text": "| 性别 |\n| 出生 | ```\n1902\n``` |"
})

def fake_check_service(cfg, timeout=3.0):
return True

monkeypatch.setattr("scholaraio.providers.webtools.urlopen", fake_urlopen)
monkeypatch.setattr("scholaraio.providers.webtools.check_webextract_service", fake_check_service)

from scholaraio.providers.webtools import extract_web

res = extract_web("https://example.com")
assert res["text"] == "| 性别 |\n| 出生 | `1902` |"