Skip to content

No chuncks are getting generated from pdf file #189

Description

@090Gamer

With the MCP setup LLM gets access to the Morphik and performs ingest

list-allowed-directories
search-files
ingest-file-from-path
retrieve-chunks
retrieve-chunks

Then it gets stuck on the retrieve chunks with no progress.

Log file:

2025-06-05 16:19:39,992 - core.workers.ingestion_worker - INFO - File download took 0.00s for 2.80MB
2025-06-05 16:19:41,054 - core.workers.ingestion_worker - INFO - Document retrieval took 1.01s
2025-06-05 16:19:41,078 - core.workers.ingestion_worker - INFO - Initial document update took 0.02s
2025-06-05 16:19:41,085 - core.workers.ingestion_worker - WARNING - No text chunks extracted after parsing. Will attempt to continue and rely on image-based chunks if available.
2025-06-05 16:19:41,086 - core.workers.ingestion_worker - INFO - Text chunking took 0.00s to create 0 chunks
2025-06-05 16:19:41,086 - core.workers.ingestion_worker - ERROR - Error processing ingestion job for file abc.pdf: No content chunks (text or image) could be extracted from the document
2025-06-05 16:19:41,115 - core.workers.ingestion_worker - INFO - Updated document ab3e6c97-df0a-412d-aa4d-57470b4a4593 status to failed

Terminal output:

morphik-1 | INFO: 172.16.9.1:41444 - "POST /retrieve/chunks HTTP/1.1" 200 OK
morphik-1 | INFO: 172.16.9.1:33314 - "POST /ingest/file HTTP/1.1" 200 OK
worker-1 | 16:19:39: 0.15s → e756ce20119547928c41c036e9bc72d6:process_ingestion_job(auth_dict={'entity_type': 'developer', 'entity_id': 'dev_user', 'app_id': None,…)
worker-1 | 16:19:41: 1.13s ← e756ce20119547928c41c036e9bc72d6:process_ingestion_job ● {'status': 'failed', 'filename': 'abc.pdf', 'error': 'No content chunks (text…
morphik-1 | INFO: 172.16.9.1:33314 - "POST /retrieve/chunks HTTP/1.1" 200 OK
postgres-1 | 2025-06-05 16:21:00.990 UTC [57] LOG: checkpoint starting: time
postgres-1 | 2025-06-05 16:21:03.563 UTC [57] LOG: checkpoint complete: wrote 26 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=2.506 s, sync=0.058 s, total=2.573 s; sync files=19, longest=0.008 s, average=0.004 s; distance=23 kB, estimate=1451 kB
morphik-1 | INFO: 172.16.9.1:33314 - "POST /retrieve/chunks HTTP/1.1" 200 OK
morphik-1 | INFO: 172.16.9.1:33314 - "POST /retrieve/docs HTTP/1.1" 200 OK
worker-1 | 16:21:58: recording health: Jun-05 16:21:58 j_complete=3 j_failed=0 j_retried=0 j_ongoing=0 queued=0

Using openAI embeddings and openAI model, though 0 usage of the model so far. Only embeddings are getting triggered, but with zero cost

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions