Skip to content
78 changes: 78 additions & 0 deletions docs/development/third-party-integration-audit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# ScholarAIO Third-Party Integration Quality Audit

This document records the quality, reachability, and output validation status of the third-party integrations, APIs, CLIs, and optional toolchains supported by ScholarAIO.

Integrations are evaluated at the workflow boundary, checking CLI/skill entrypoints, provider implementations, setup diagnostics, output formatting, fallback behaviors, and failure handling. A config test or a broad unit-test filename is not enough evidence to mark an integration surface as Good.

Status is intentionally conservative:

- **good**: workflow-boundary evidence exists, including commands, representative output, and failure handling.
- **partially-reviewed**: code-level or fixture evidence exists, but live workflow evidence is still missing.
- **not-yet-reviewed**: inventory only; no quality claim is made.

---

## 1. Quality Matrix

| Integration / Surface | Category | Status | Verification Path / Test Evidence | Observed Result / Config & Version Boundaries |
| :--- | :--- | :--- | :--- | :--- |
| **qt-web-extractor (HTTP & MCP)** | Web / Agent | **partially-reviewed** | `extract_web`, `_clean_table_code_fences`, `tests/test_webtools_source.py`, fixture pair under `tests/fixtures/` | Sanitizer regression is covered for malformed table-cell code fences and adjacent standalone code blocks. Live daemon canary evidence is still required before this surface is promoted to `good`. Boundaries: `webextract.transport` (HTTP/MCP), `webextract.base_url`, `webextract.mcp_url`, `webextract.api_key`. |
| **GUILessBingSearch** | Web / Agent | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **MinerU Local API** | Parsing | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **MinerU Cloud CLI** | Parsing | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Paper2Any MCP Sidecar** | Parsing/MCP | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Docling Fallback** | Parsing | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **PyMuPDF Fallback** | Parsing | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **arXiv Search (Atom API)** | Discovery | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **arXiv PDF Download** | Discovery | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **OpenAlex Explore** | Discovery | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Crossref / Semantic Scholar** | Discovery | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Zotero SQLite Import** | Import/Export | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Zotero Web API** | Import/Export | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **EndNote / RIS** | Import/Export | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **USPTO ODP / PPubs** | Patents | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **OpenAI-compatible Chat API** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Anthropic Messages API** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Google Gemini API** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Zhipu API** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **vLLM / Ollama Local** | LLM Backend | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Sentence-transformers Embeddings** | Vector/Embed | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **FAISS Vector / BERTopic** | Vector/Embed | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **MarkItDown Office Ingest** | Office/Output | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Office PPTX / DOCX Libraries** | Office/Output | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Mermaid / DOT Rendering** | Diagram | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Scientific Toolref (Quantum ESPRESSO, etc.)** | Toolref | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **AmberTools / PyMOL** | Scientific | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **rsync / SSH Backup** | System | **not-yet-reviewed** | N/A | Excluded from current triage phase. |
| **Setup Diagnostics** | System | **not-yet-reviewed** | N/A | Excluded from current triage phase. |

---

## 2. Current Reviewed Surface

### 2.1 qt-web-extractor (HTTP & MCP)
* **CLI/Skill Entrypoint**:
* CLI: `scholaraio webextract <url>` (implemented in `cmd_webextract` inside [web.py](../../scholaraio/interfaces/cli/web.py))
* Skill: `.claude/skills/webextract`
* **Provider/Service Implementation Path**:
* [webtools.py:extract_web](../../scholaraio/providers/webtools.py)
* **Setup Diagnostics**:
* Diagnostic path exists through `scholaraio setup check` (calls `_optional_webtool_detail` inside [setup.py](../../scholaraio/services/setup.py)), which executes `check_webextract_service` to verify that the HTTP/MCP endpoint responds. This PR does not include live daemon evidence from that path.
* **Output Quality & Validation**:
* Outputs parsed GFM Markdown. Output quality is protected by `_clean_table_code_fences` to sanitize malformed block code fences in Wikipedia/infobox table cells, resolving broken table rendering.
* Verified via unit and fixture coverage: [wikipedia_infobox_bad.md](../../tests/fixtures/wikipedia_infobox_bad.md), [wikipedia_infobox_clean.md](../../tests/fixtures/wikipedia_infobox_clean.md), and regression tests for standalone fenced code blocks near table or pipe-prefixed lines.
* **Fallback Behavior**:
* Configured via `webextract.transport` (HTTP or MCP). When configured as HTTP, failure to connect triggers fallback hint to MCP or setup checks.
* **Failure Handling**:
* Unreachable HTTP endpoints raise `WebExtractServiceUnavailableError`, returning a clean user-facing hint with exit code `1`.
* API/Server errors raise `WebExtractError`, showing warnings/errors instead of generic crashes.

## 3. Not-Yet-Reviewed Inventory

Rows marked `not-yet-reviewed` in the matrix are intentionally inventory-only. Promoting any of them to `partially-reviewed` or `good` should happen in a focused follow-up that includes:

- exact CLI command or skill workflow exercised;
- relevant config/version boundaries;
- representative success output;
- failure-mode behavior;
- targeted tests or reproducible smoke evidence.
183 changes: 160 additions & 23 deletions scholaraio/providers/webtools.py
Original file line number Diff line number Diff line change
Expand Up @@ -572,6 +572,137 @@ def _extract_web_mcp(url: str, *, cfg: Config | None, timeout: float) -> dict:
}


def _clean_single_row(row_text: str) -> str:
cells = row_text.split("|")
cleaned_cells = []

for i, cell in enumerate(cells):
if i == 0 and not cell.strip():
cleaned_cells.append(cell)
continue
if i == len(cells) - 1 and not cell.strip() and row_text.endswith("|"):
cleaned_cells.append(cell)
continue

if "```" in cell:
fence_count = cell.count("```")
cell_to_clean = cell + "\n```" if fence_count % 2 != 0 else cell
parts = cell_to_clean.split("```")
cleaned_parts = []
for j, part in enumerate(parts):
if j % 2 == 0:
cleaned_parts.append(part.replace("\n", " "))
else:
block = part
if block.startswith("\n"):
block = block[1:]
else:
block_lines = block.split("\n", 1)
if len(block_lines) > 1:
first_line = block_lines[0].strip()
if re.match(r"^[a-zA-Z0-9_-]+$", first_line):
block = block_lines[1]
block_clean = block.replace("\n", " ").strip()
if block_clean:
cleaned_parts.append(f"`{block_clean}`")
else:
cleaned_parts.append("")
cleaned_cell = "".join(cleaned_parts)
cleaned_cell = " " + cleaned_cell.strip() + " "
cleaned_cells.append(cleaned_cell)
else:
cleaned_cells.append(cell.replace("\n", " "))

res = "|".join(cleaned_cells)
if not res.endswith("|"):
res += "|"
return res


def _clean_table_code_fences(text: str) -> str:
"""Sanitize Markdown table cells that contain block-level code blocks/fences.

Transforms:
| Col | ```\nval\n``` |
Into:
| Col | `val` |
"""
if not text:
return ""

lines = text.splitlines()
cleaned_lines: list[str] = []
current_row_lines: list[str] = []
in_multiline_row = False
in_code_block = False

def flush_current_row():
nonlocal in_multiline_row, current_row_lines, in_code_block
if current_row_lines:
row_text = "\n".join(current_row_lines)
cleaned_row = _clean_single_row(row_text)
cleaned_lines.append(cleaned_row)
current_row_lines = []
in_multiline_row = False
in_code_block = False

for line in lines:
stripped = line.strip()

if in_multiline_row:
num_fences = stripped.count("```")
if stripped.startswith("|") and (stripped.count("|") >= 2 or "```" in stripped):
flush_current_row()
# fall through to process as a new row start below
else:
if num_fences % 2 != 0:
in_code_block = not in_code_block

if not in_code_block:
if stripped.endswith("|"):
current_row_lines.append(line)
flush_current_row()
continue
elif not stripped:
flush_current_row()
cleaned_lines.append(line)
continue
elif stripped.startswith("```"):
flush_current_row()
# fall through to process as normal

if in_multiline_row:
current_row_lines.append(line)
continue

if stripped.startswith("|") and (stripped.count("|") >= 2 or "```" in stripped):
if "```" in stripped:
num_fences = stripped.count("```")
in_code = num_fences % 2 != 0
if not in_code and stripped.endswith("|"):
cleaned_lines.append(_clean_single_row(line))
else:
in_multiline_row = True
in_code_block = in_code
current_row_lines = [line]
else:
if stripped.endswith("|"):
cleaned_lines.append(line)
else:
in_multiline_row = True
in_code_block = False
current_row_lines = [line]
else:
cleaned_lines.append(line)

flush_current_row()

result = "\n".join(cleaned_lines)
if text.endswith("\n") and not result.endswith("\n"):
result += "\n"
return result


def extract_web(
url: str,
*,
Expand Down Expand Up @@ -600,33 +731,39 @@ def extract_web(
"""
transport = _get_webextract_transport(cfg)
if transport == "mcp":
return _extract_web_mcp(url, cfg=cfg, timeout=timeout)
if transport != "http":
raise WebExtractError(f"未知 webextract transport: {transport}")
res = _extract_web_mcp(url, cfg=cfg, timeout=timeout)
else:
if transport != "http":
raise WebExtractError(f"未知 webextract transport: {transport}")

base_url = _get_webextract_base_url(cfg)
if not check_webextract_service(cfg, timeout=3.0):
raise WebExtractServiceUnavailableError(
f"提取服务未启动或不可达: {base_url}\n请确保 qt-web-extractor 服务已运行"
)

base_url = _get_webextract_base_url(cfg)
if not check_webextract_service(cfg, timeout=3.0):
raise WebExtractServiceUnavailableError(
f"提取服务未启动或不可达: {base_url}\n请确保 qt-web-extractor 服务已运行"
body: dict[str, object] = {"url": url}
if pdf is not None:
body["pdf"] = pdf
if include_html:
body["include_html"] = include_html

api_key = _get_webextract_api_key(cfg) or ""
req = Request(
f"{base_url}/extract",
data=json.dumps(body).encode("utf-8"),
headers=_headers(api_key),
method="POST",
)
try:
res = _load_json_response(req, timeout=int(timeout), error_prefix="提取失败")
except RuntimeError as e:
raise WebExtractError(str(e)) from e

body: dict[str, object] = {"url": url}
if pdf is not None:
body["pdf"] = pdf
if include_html:
body["include_html"] = include_html
if isinstance(res, dict) and "text" in res and res["text"]:
res["text"] = _clean_table_code_fences(res["text"])

api_key = _get_webextract_api_key(cfg) or ""
req = Request(
f"{base_url}/extract",
data=json.dumps(body).encode("utf-8"),
headers=_headers(api_key),
method="POST",
)
try:
return _load_json_response(req, timeout=int(timeout), error_prefix="提取失败")
except RuntimeError as e:
raise WebExtractError(str(e)) from e
return res


def extract_and_display(
Expand Down
10 changes: 10 additions & 0 deletions tests/fixtures/wikipedia_infobox_bad.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
| 性别 | 男 |
| 出生 | ```
1902年8月28日
``` |
| 逝世 | ```
1993年11月24日
``` |
| 国籍 | ```
中华人民共和国
``` |
4 changes: 4 additions & 0 deletions tests/fixtures/wikipedia_infobox_clean.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
| 性别 | 男 |
| 出生 | `1902年8月28日` |
| 逝世 | `1993年11月24日` |
| 国籍 | `中华人民共和国` |
63 changes: 63 additions & 0 deletions tests/test_webtools_source.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from __future__ import annotations

import json
import pathlib

import pytest

Expand Down Expand Up @@ -781,3 +782,65 @@ def fake_urlopen(req, timeout=0):
assert result["title"] == "Page"
captured = capsys.readouterr()
assert "markdown body" in captured.out

def test_clean_table_code_fences_with_fixtures(self):
from scholaraio.providers.webtools import _clean_table_code_fences

fixtures_dir = pathlib.Path(__file__).parent / "fixtures"
bad_path = fixtures_dir / "wikipedia_infobox_bad.md"
clean_path = fixtures_dir / "wikipedia_infobox_clean.md"

assert bad_path.exists()
assert clean_path.exists()

bad_text = bad_path.read_text(encoding="utf-8")
expected_clean_text = clean_path.read_text(encoding="utf-8")

cleaned_text = _clean_table_code_fences(bad_text)
assert cleaned_text.strip() == expected_clean_text.strip()

def test_clean_table_code_fences_ignores_normal_structures(self):
from scholaraio.providers.webtools import _clean_table_code_fences

# Test normal code block outside table should not be changed
normal_code = "Here is a code snippet:\n```python\ndef test():\n return True\n```\nAnd here is normal text."
assert _clean_table_code_fences(normal_code) == normal_code

# Test normal table with inline code should not be changed
normal_table = "| Column 1 | Column 2 |\n| --- | --- |\n| `inline code` | value |\n"
assert _clean_table_code_fences(normal_table) == normal_table

# Test standalone code block between tables should not be changed
standalone_between_tables = (
"| A | B |\n"
"| --- | --- |\n"
"| one | two |\n\n"
"```python\n"
"print(1)\n"
"```\n\n"
"| C | D |\n"
"| --- | --- |\n"
"| three | four |\n"
)
assert _clean_table_code_fences(standalone_between_tables) == standalone_between_tables

adjacent_standalone_code = (
"| A | B |\n| one | two |\n```python\nprint(1)\n```\n| next paragraph starts with pipe |\n"
)
assert _clean_table_code_fences(adjacent_standalone_code) == adjacent_standalone_code

def test_extract_web_applies_cleanup_http(self, monkeypatch):
# Verify that HTTP path runs the clean helper
def fake_urlopen(req, timeout=0):
return _FakeResponse({"title": "Page", "text": "| 性别 |\n| 出生 | ```\n1902\n``` |"})

def fake_check_service(cfg, timeout=3.0):
return True

monkeypatch.setattr("scholaraio.providers.webtools.urlopen", fake_urlopen)
monkeypatch.setattr("scholaraio.providers.webtools.check_webextract_service", fake_check_service)

from scholaraio.providers.webtools import extract_web

res = extract_web("https://example.com")
assert res["text"] == "| 性别 |\n| 出生 | `1902` |"