English | Español | 简体中文 | 한국어
This tool converts Office files (.doc, .docx, .pptx) to Markdown optimized for RAG, maintaining structural fidelity and applying safe linguistic normalization.
Main script:
Convert-OfficeToRAG.ps1
- Windows with Microsoft Word and PowerPoint installed (COM automation).
- PowerShell 7 (
pwsh) installed. - Read/write permissions over the source folders.
Quick verification:
pwsh -NoProfile -Command "$PSVersionTable.PSVersion"Edit the $Config block at the beginning of Convert-OfficeToRAG.ps1:
SourceFolders: source folders to process.FileExtensions: allowed extensions.OcrDictionary: conservative OCR dictionary.ResidualOcrRegex: residual validation regex.LogPath,QaLogPath,SummaryPath: log output paths.ForceReprocess: forces reprocessing of files even if.mdalready exists.
Portable behavior:
- If a path is relative, the script resolves it from the folder where
Convert-OfficeToRAG.ps1lives. - By default, logs (
rag_converter_log.txt,rag_converter_qa_log.txt,rag_converter_summary.txt) are written tooutputs/logsinside the script folder. SourceFoldersdefaults to..\inputrelative to the script folder.
Key environment variables:
RAG_SOURCE_FOLDERS: accepts multiple paths separated by;or,.RAG_SOURCE_FILES: accepts one or more specific files separated by;or,.RAG_FORCE_REPROCESS:true/falseto reprocess even if.mdexists.RAG_FAIL_FAST:true/falseto abort or continue on errors.RAG_ENABLE_PREFLIGHT:true/falseto enable/disable preflight API.RAG_OPENROUTER_MODEL: vision model to use.
Direct script parameters (high priority, recommended for automation):
-SourceFoldersOverride <string[]>-SourceFilesOverride <string[]>-ForceReprocessOverride <bool>-FailFastOverride <bool>-EnablePreflightOverride <bool>-OpenRouterModelOverride <string>
Always run with pwsh to maintain UTF-8 stability. Path-agnostic command:
$toolDir = "D:\Ruta\Donde\Está\RAG_Converter_Tool"
pwsh -NoProfile -File (Join-Path $toolDir "Convert-OfficeToRAG.ps1")Expected console output:
NORM_OKorNORM_WITH_ERRORS.
$toolDir = "D:\Ruta\Donde\Está\RAG_Converter_Tool"; pwsh -NoProfile -Command "$s=(Join-Path '$toolDir' 'Convert-OfficeToRAG.ps1'); $sum=(Join-Path '$toolDir' 'outputs\logs\rag_converter_summary.txt'); & $s; if($LASTEXITCODE -ne 0){ throw 'Falló la ejecución del convertidor' }; $st=(Get-Content $sum | Select-String '^STATUS=').Line; if($st -ne 'STATUS=NORM_OK'){ throw \"Estado inválido: $st\" }; Write-Host 'OK => STATUS=NORM_OK' -ForegroundColor Green"Quick summary audit:
$toolDir = "D:\Ruta\Donde\Está\RAG_Converter_Tool"
Get-Content -Path (Join-Path $toolDir "outputs\logs\rag_converter_summary.txt")One-line status:
$toolDir = "D:\Ruta\Donde\Está\RAG_Converter_Tool"
(Get-Content (Join-Path $toolDir "outputs\logs\rag_converter_summary.txt") | Select-String '^STATUS=').LineQA incidents:
$toolDir = "D:\Ruta\Donde\Está\RAG_Converter_Tool"
$qa = Join-Path $toolDir "outputs\logs\rag_converter_qa_log.txt"
if((Test-Path $qa) -and ((Get-Item $qa).Length -gt 0)){Get-Content $qa}else{"Sin incidencias QA"}The script is designed with a robust and modular approach:
The script is no longer coupled to a specific topic (such as football). It uses configuration variables to inject context dynamically into the AI model prompt:
$Config.DomainContext: Defines the environment (e.g. "high-performance sports educational environment").$Config.DomainNoiseFilter: Keywords for the model to ignore (e.g. "clothing colors, landscapes, weather").$Config.DomainTechnicalTerms: Terminological precision instructions (e.g. avoid replacing specialized terms with ambiguous synonyms).
Image analysis is performed via the OpenRouter API. The new prompt requires a strict Markdown output format that includes:
- Literal OCR: Exact transcription of text in slides.
- Spatial Technical Analysis: Interpretation of diagrams and arrows.
- Pedagogical Value: Extraction of the core concept.
To avoid "flying blind" during massive runs:
- Console Output (Verbose): Shows real-time progress (
[1/10] Processing...,[Image 3/5] Requesting analysis...,Generating final Markdown). - COM Telemetry: Measures and displays
Word.OpenandWord.SaveAs(HTML)timings to detect bottlenecks. - Log Files:
rag_converter_log.txt: Records events withINFOandERROR(including StackTraces).rag_converter_qa_log.txt: Records validation errors (e.g. incomplete image analysis).rag_converter_summary.txt: Final execution summary.
- Supports profiles like
defaultandstagingto switch models quickly without touching code. - Includes a preflight check that verifies API connectivity and multimodal support before starting, with automatic fallback from
visiontotextif the model does not support images.
- If launching with
powershell.exeshows broken characters in regex/accents, usepwsh. - If COM fails, verify Word/PowerPoint installation and active user session.
- If status is
NORM_WITH_ERRORS, checkrag_converter_qa_log.txtfirst. - If you move the
RAG_Converter_Toolfolder, the script still works; just reviewSourceFoldersif the sources ended up in another location. - If a Word extraction runs slow, check the console for
Word.Open/Word.SaveAs(HTML)timings to locate the bottleneck. - If there are locked temporary files (
~$*.docx), close them in Office before running the batch.
Full execution (configured folders):
$toolDir = "D:\Ruta\RAG_Converter_Tool"
$env:OPENROUTER_API_KEY="TU_API_KEY"; $env:RAG_OPENROUTER_MODEL="google/gemini-3.1-flash-lite-preview"; $env:RAG_FAIL_FAST="false"; $env:RAG_ENABLE_PREFLIGHT="false"; & (Join-Path $toolDir "Convert-OfficeToRAG.ps1")Forced full reprocessing:
$toolDir = "D:\Ruta\RAG_Converter_Tool"
$env:OPENROUTER_API_KEY="TU_API_KEY"; $env:RAG_OPENROUTER_MODEL="google/gemini-3.1-flash-lite-preview"; $env:RAG_FORCE_REPROCESS="true"; $env:RAG_FAIL_FAST="false"; $env:RAG_ENABLE_PREFLIGHT="false"; & (Join-Path $toolDir "Convert-OfficeToRAG.ps1")Single file execution:
$toolDir = "D:\Ruta\RAG_Converter_Tool"
$env:OPENROUTER_API_KEY="TU_API_KEY"; $env:RAG_OPENROUTER_MODEL="google/gemini-3.1-flash-lite-preview"; $env:RAG_SOURCE_FILES="D:\Ruta\Input\Documento.docx"; $env:RAG_FORCE_REPROCESS="true"; $env:RAG_FAIL_FAST="false"; $env:RAG_ENABLE_PREFLIGHT="false"; & (Join-Path $toolDir "Convert-OfficeToRAG.ps1")Single file execution (via parameters, recommended):
$toolDir = "D:\Ruta\RAG_Converter_Tool"
$env:OPENROUTER_API_KEY="TU_API_KEY"; & (Join-Path $toolDir "Convert-OfficeToRAG.ps1") -SourceFilesOverride "D:\Ruta\Input\Documento.docx" -ForceReprocessOverride $true -FailFastOverride $false -EnablePreflightOverride $false -OpenRouterModelOverride "google/gemini-3.1-flash-lite-preview"Full execution even if .md already exists (via parameters):
$toolDir = "D:\Ruta\RAG_Converter_Tool"
$env:OPENROUTER_API_KEY="TU_API_KEY"; & (Join-Path $toolDir "Convert-OfficeToRAG.ps1") -ForceReprocessOverride $true -FailFastOverride $false -EnablePreflightOverride $false -OpenRouterModelOverride "google/gemini-3.1-flash-lite-preview"- Create
.envby copying the template:
$toolDir = "D:\Ruta\RAG_Converter_Tool"
Copy-Item (Join-Path $toolDir ".env.example") (Join-Path $toolDir ".env") -Force-
Edit
.envand set yourOPENROUTER_API_KEY. -
Load short aliases in the current session:
$toolDir = "D:\Ruta\RAG_Converter_Tool"
. (Join-Path $toolDir "Enable-RagAlias.ps1")- Use short commands:
rag
rag -Target "D:\Ruta\Input"
rag -Target "D:\Ruta\Input\Documento.docx"
rag -Target "D:\Ruta\Input\Documento.docx" -Reprocess
rr -Target "D:\Ruta\Input\Documento.docx" -Reprocess- Optional: persist aliases in your PowerShell profile:
$toolDir = "D:\Ruta\RAG_Converter_Tool"
. (Join-Path $toolDir "Enable-RagAlias.ps1") -Persist- Multi-client scalability with dedicated
.envfiles:
$toolDir = "D:\Ruta\RAG_Converter_Tool"
Copy-Item (Join-Path $toolDir ".env.example") (Join-Path $toolDir ".env.acme.dev") -Force
Copy-Item (Join-Path $toolDir ".env.example") (Join-Path $toolDir ".env.acme.prod") -ForceGeneral note:
- Create one
.envfile per client and environment. - The
.envparser supports comments on lines starting with#.
Usage by client without typing long commands:
rag -EnvFile ".env.acme.dev"
rag -EnvFile ".env.acme.prod" -Target "D:\Ruta\Input"
rag -EnvFile ".env.acme.prod" -Target "D:\Ruta\Input\Documento.docx" -ReprocessQuick checklist to onboard a new client without touching code:
- Create the client environment file:
$toolDir = "D:\Ruta\RAG_Converter_Tool"
Copy-Item (Join-Path $toolDir ".env.example") (Join-Path $toolDir ".env.<cliente>.<entorno>") -Force- Edit
.env.<cliente>:
OPENROUTER_API_KEY: client key.RAG_OPENROUTER_MODEL: model agreed for that client.RAG_FAIL_FAST,RAG_ENABLE_PREFLIGHT,RAG_FORCE_REPROCESS: operational policy.- Comments are allowed on lines starting with
#.
- Run a minimal test on a file:
rag -EnvFile ".env.<cliente>" -Target "D:\Ruta\Documento.docx"- Validate the result:
- Console shows
NORM_OK. - Review
rag_converter_summary.txtandrag_converter_qa_log.txt. - In this version, artifacts are in
outputs/logs.
- Daily client operation:
rag -EnvFile ".env.<cliente>.<entorno>"- Full reprocessing when needed:
rag -EnvFile ".env.<cliente>.<entorno>" -ReprocessTo scale without friction, use this convention:
.env.<cliente>.<entorno><cliente>: stable identifier (no spaces), e.g.acme,clinicax,lexcorp.<entorno>:dev,staging, orprod.
Examples:
.env.acme.dev.env.acme.prod.env.lexcorp.staging
Recommended workflow:
- Create a variant per environment from the client base:
$toolDir = "D:\Ruta\RAG_Converter_Tool"
Copy-Item (Join-Path $toolDir ".env.example") (Join-Path $toolDir ".env.acme.dev") -Force
Copy-Item (Join-Path $toolDir ".env.example") (Join-Path $toolDir ".env.acme.prod") -Force- Execute per environment:
rag -EnvFile ".env.acme.dev"
rag -EnvFile ".env.acme.prod" -Target "D:\Ruta\Input" -ReprocessAutomatic executive report generation from rag_converter_summary.txt:
$toolDir = "D:\Ruta\Donde\Está\RAG_Converter_Tool"
& (Join-Path $toolDir "Gen-Report.ps1") -Cliente "Cliente Demo" -Firmante "Nombre Apellido" -Modo comercialOptionally define the output:
$toolDir = "D:\Ruta\Donde\Está\RAG_Converter_Tool"
& (Join-Path $toolDir "Gen-Report.ps1") -Cliente "Cliente Demo" -Modo tecnico -OutputPath (Join-Path $toolDir "outputs\reports\Informe_RAG_Auditoria.md")Notes:
- The report uses real metrics from the summary/log; if a data point does not exist, it is shown as
N/D. - It avoids inventing KPIs (e.g. OCR percentages or speedups) if they are not measured in the evidence files.
- Available modes:
-Modo comercial(executive storytelling) and-Modo tecnico(forensic audit). - By default, logs are generated in
outputs/logsand reports inoutputs/reports(portable paths, not hardcoded). - The report incorporates
DHI (Data Health Index)with a 0-100 scale and grades (WORLD CLASS,ENTERPRISE READY,ACCEPTABLE,NEEDS IMPROVEMENT). - The DHI is calculated with 4 weighted pillars: Integrity (30), Semantics (40), OCR Normalization (20), Citation (10).
- Case without images (
VISION_ITEMS=0): no penalty; it is marked asTexto puroin the semantics pillar. - The DHI calculation uses
summary + qa log; you can override QA with-QaPath.
Shortcut with aliases (after loading Enable-RagAlias.ps1):
rag-report -Modo comercial -Cliente "Cliente Demo" -Firmante "Nombre Apellido"
rag-report -Modo tecnico -Cliente "Cliente Demo" -OutputPath "D:\Ruta\RAG_Converter_Tool\outputs\reports\Informe_RAG_Auditoria.md"