Skip to content

feat: refactor indexing pipelines for some connectors#996

Merged
MODSetter merged 32 commits into
MODSetter:devfrom
AnishSarkar22:refactor/indexing-pipelines
Mar 27, 2026
Merged

feat: refactor indexing pipelines for some connectors#996
MODSetter merged 32 commits into
MODSetter:devfrom
AnishSarkar22:refactor/indexing-pipelines

Conversation

@AnishSarkar22

@AnishSarkar22 AnishSarkar22 commented Mar 27, 2026

Copy link
Copy Markdown
Collaborator

Description

  • Refactored indexing pipelines by adding semaphores. Now google drive, gmail, google calendar, notion, linear, jira and confluence connectors have parallel indexing making indexing faster overall.
  • These connectors now use the unified indexing pipeline.
  • Fixed connector dialog showing stale indexing config view on reopen by resetting state in handleStartIndexing and handleSkipIndexing.
  • Updated file skipping logic in Google Drive indexer to prevent documents with a READY status even if they have failed. Now they will be correctly marked with FAILED status.
  • Added descendant checking for folder filtering in Google Drive changes.
  • Support .docx and .xlsx file types by detecting them and forwarding to the ETL service.

Motivation and Context

FIX #

Screenshots

API Changes

  • This PR includes API changes

Change Type

  • Bug fix
  • New feature
  • Performance improvement
  • Refactoring
  • Documentation
  • Dependency/Build system
  • Breaking change
  • Other (specify):

Testing Performed

  • Tested locally
  • Manual/QA verification

Checklist

  • Follows project coding standards and conventions
  • Documentation updated as needed
  • Dependencies updated as needed
  • No lint/build errors or new warnings
  • All relevant tests are passing

High-level PR Summary

This PR refactors the connector indexing pipeline to use a unified parallel indexing architecture across Google Drive, Confluence, Jira, Linear, Notion, and Gmail connectors. The core change introduces IndexingPipelineService with parallel document processing (index_batch_parallel) that uses bounded concurrency and isolated database sessions for each document. Drive indexer now streams large files directly to disk and parallelizes download/ETL/indexing phases. Legacy Composio connector documents are automatically migrated to native types using NATIVE_TO_LEGACY_DOCTYPE mappings. The frontend switches from REST API polling to Zero real-time sync for document type counts, eliminating unnecessary cache invalidations. Embedding operations are protected with a reentrant lock to prevent tokenizer thread-safety issues. Google Calendar event updates now properly handle all-day events via _build_time_body helper.

⏱️ Estimated Review Time: 3+ hours

💡 Review Order Suggestion
Order File Path
1 surfsense_backend/app/indexing_pipeline/document_hashing.py
2 surfsense_backend/app/db/__init__.py
3 surfsense_backend/app/indexing_pipeline/connector_document.py
4 surfsense_backend/app/indexing_pipeline/indexing_pipeline_service.py
5 surfsense_backend/app/connectors/google_drive/content_extractor.py
6 surfsense_backend/app/connectors/google_drive/client.py
7 surfsense_backend/app/connectors/google_drive/__init__.py
8 surfsense_backend/app/tasks/connector_indexers/google_drive_indexer.py
9 surfsense_backend/app/tasks/connector_indexers/confluence_indexer.py
10 surfsense_backend/app/tasks/connector_indexers/jira_indexer.py
11 surfsense_backend/app/tasks/connector_indexers/linear_indexer.py
12 surfsense_backend/app/tasks/connector_indexers/notion_indexer.py
13 surfsense_backend/app/tasks/connector_indexers/google_calendar_indexer.py
14 surfsense_backend/app/tasks/connector_indexers/google_gmail_indexer.py
15 surfsense_backend/app/routes/search_source_connectors_routes.py
16 surfsense_backend/app/utils/document_converters.py
17 surfsense_backend/app/agents/new_chat/tools/google_calendar/update_event.py
18 surfsense_backend/app/services/google_calendar/kb_sync_service.py
19 surfsense_backend/tests/integration/indexing_pipeline/test_calendar_pipeline.py
20 surfsense_backend/tests/integration/indexing_pipeline/test_drive_pipeline.py
21 surfsense_backend/tests/integration/indexing_pipeline/test_gmail_pipeline.py
22 surfsense_backend/tests/integration/indexing_pipeline/test_index_batch.py
23 surfsense_backend/tests/unit/indexing_pipeline/test_document_hashing.py
24 surfsense_backend/tests/unit/indexing_pipeline/test_index_batch.py
25 surfsense_backend/tests/unit/indexing_pipeline/test_index_batch_parallel.py
26 surfsense_backend/tests/unit/indexing_pipeline/test_migrate_legacy_docs.py
27 surfsense_backend/tests/unit/connector_indexers/test_confluence_parallel.py
28 surfsense_backend/tests/unit/connector_indexers/test_google_drive_parallel.py
29 surfsense_backend/tests/unit/connector_indexers/test_jira_parallel.py
30 surfsense_backend/tests/unit/connector_indexers/test_linear_parallel.py
31 surfsense_backend/tests/unit/connector_indexers/test_notion_parallel.py
32 surfsense_web/hooks/use-zero-document-type-counts.ts
33 surfsense_web/components/onboarding-tour.tsx
34 surfsense_web/components/assistant-ui/connector-popup.tsx
35 surfsense_web/atoms/documents/document-mutation.atoms.ts
36 surfsense_web/app/dashboard/[search_space_id]/documents/(manage)/components/DocumentsTableShell.tsx
37 surfsense_web/components/tool-ui/google-calendar/update-event.tsx
38 surfsense_web/components/hitl-edit-panel/hitl-edit-panel.tsx
39 surfsense_web/contracts/enums/toolIcons.tsx
⚠️ Inconsistent Changes Detected
File Path Warning
surfsense_web/contracts/enums/toolIcons.tsx Icon change from Sparkles to ImageIcon for generate_image tool appears unrelated to indexing pipeline refactor
surfsense_web/components/hitl-edit-panel/hitl-edit-panel.tsx CSS changes to hide calendar picker indicator are unrelated to backend indexing changes
surfsense_web/app/dashboard/[search_space_id]/documents/(manage)/components/DocumentsTableShell.tsx Bulk delete button styling and positioning changes are unrelated to indexing pipeline architecture
surfsense_web/components/tool-ui/google-calendar/update-event.tsx Calendar event update UI logic changes appear separate from core indexing refactor

Need help? Join our Discord

…document migration

- Added `download_and_extract_content` function to extract content from Google Drive files as markdown.
- Updated Google Drive indexer to utilize the new content extraction method.
- Implemented document migration logic to update legacy Composio document types to their native Google types.
- Introduced identifier hashing for stable document identification.
- Improved file pre-filtering to handle unchanged and rename-only files efficiently.
- Introduced integration tests for Calendar, Drive, and Gmail indexers to ensure proper document creation and migration.
- Added tests for batch indexing functionality to validate the processing of multiple documents.
- Implemented tests for legacy document migration to verify updates to document types and hashes.
- Enhanced test coverage for the IndexingPipelineService to ensure robust functionality across various document types.
- Modified the `_should_skip_file` function to prevent skipping of documents with a FAILED status, ensuring they are reprocessed even if their content remains unchanged.
- Added a new integration test to verify that FAILED documents are not skipped during the indexing process.
- Introduced helper functions `_is_date_only` and `_build_time_body` to streamline the construction of event start and end times for all-day and timed events.
- Refactored the `create_update_calendar_event_tool` to utilize the new helper functions, improving code readability and maintainability.
- Updated the Google Calendar sync service to ensure proper handling of calendar IDs with a default fallback to "primary".
- Modified the ApprovalCard component to simplify the construction of event update arguments, enhancing clarity and reducing redundancy.
- Added `index_batch_parallel` method to enable concurrent indexing of documents with bounded concurrency, improving performance and efficiency.
- Refactored existing indexing logic to utilize `asyncio.to_thread` for non-blocking execution of embedding and chunking functions.
- Introduced unit tests to validate the functionality of the new parallel indexing method, ensuring robustness and error handling during document processing.
…ctors

- Refactored Google Calendar and Gmail indexers to utilize the new `index_batch_parallel` method for concurrent document indexing, enhancing performance.
- Updated the indexing logic to replace serial processing with parallel execution, allowing for improved efficiency in handling multiple documents.
- Adjusted logging and error handling to accommodate the new parallel processing approach, ensuring robust operation during indexing.
- Enhanced unit tests to validate the functionality of the parallel indexing method and its integration with existing workflows.
- Added performance logging to the `index_batch_parallel` method, capturing metrics for document indexing duration and concurrency.
- Introduced timing measurements for both the overall indexing process and the parallel document gathering phase, improving observability of the indexing workflow.
- Updated logging statements to provide detailed insights into the number of documents processed, indexed, and failed during the indexing operation.
…e indexer

- Added `_download_files_parallel` function to enable concurrent downloading of files from Google Drive, improving efficiency in document processing.
- Introduced `_download_and_index` function to handle the parallel downloading and indexing phases, streamlining the overall workflow.
- Updated `_index_full_scan` and `_index_with_delta_sync` methods to utilize the new parallel downloading functionality, enhancing performance.
- Added unit tests to validate the new parallel downloading and indexing logic, ensuring robustness and error handling during document processing.
- Introduced an asyncio lock to the GoogleDriveClient to ensure thread-safe access to the service instance.
- Refactored the get_service method to utilize the lock, preventing concurrent attempts to create the service and improving stability in multi-threaded environments.
- Introduced `index_google_drive_selected_files` function to enable indexing of multiple user-selected files in parallel, improving efficiency.
- Refactored existing indexing logic to handle batch processing, including error handling for individual file failures.
- Added unit tests for the new batch indexing functionality, ensuring robustness and proper error collection during the indexing process.
- Introduced `download_file_to_disk` method to stream files directly to disk in chunks, reducing memory usage during downloads.
- Updated `download_and_extract_content` function to utilize the new streaming download method for binary files, enhancing efficiency in handling large files.
- Improved error handling for download operations, providing clearer feedback on failures.
- Refactored Linear and Notion indexers to utilize the shared IndexingPipelineService for improved document deduplication, summarization, chunking, and embedding with bounded parallel indexing.
- Updated the `_build_connector_doc` function in both indexers to create ConnectorDocument instances with enhanced metadata and fallback summaries.
- Modified the `index_linear_issues` and `index_notion_pages` functions to return a tuple of (indexed_count, skipped_count, warning_or_error_message) for better error handling and reporting.
- Added unit tests for both indexers to validate the new parallel processing logic and ensure correct document creation and indexing behavior.
- Added a reentrant lock to ensure thread-safe access to the tokenizer and embedding model, preventing runtime errors during concurrent operations.
- Updated the `truncate_for_embedding` and `embed_text` functions to utilize the lock, ensuring safe execution in multi-threaded environments.
- Enhanced the `embed_texts` function to maintain thread safety while processing multiple texts for embedding.
- Removed the `documentTypeCountsAtom` and its associated logic from the document query atoms.
- Introduced `useZeroDocumentTypeCounts` hook to provide real-time document type counts, enhancing responsiveness as documents are indexed.
- Updated components to utilize the new hook for fetching document type counts, ensuring instant updates in the UI.
- Refactored Confluence and Jira indexers to utilize the shared IndexingPipelineService for improved document processing.
- Updated the `_build_connector_doc` function in both indexers to create ConnectorDocument instances with enhanced metadata and fallback summaries.
- Modified the `index_confluence_pages` and `index_jira_issues` functions to return a tuple of (indexed_count, skipped_count, warning_or_error_message) for better error handling and reporting.
- Added unit tests for both indexers to validate the new parallel processing logic and ensure correct document creation and indexing behavior.
@vercel

vercel Bot commented Mar 27, 2026

Copy link
Copy Markdown

@AnishSarkar22 is attempting to deploy a commit to the Rohan Verma's projects Team on Vercel.

A member of the Team first needs to authorize it.

@AnishSarkar22 AnishSarkar22 changed the title feat: refactoring indexing pipelines for some connectors feat: refactor indexing pipelines for some connectors Mar 27, 2026

@recurseml recurseml Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by RecurseML

🔍 Review performed on c7ace83..0bc1c76

✨ No bugs found, your code is sparkling clean

✅ Files analyzed, no issues (41)

surfsense_backend/app/agents/new_chat/tools/google_calendar/update_event.py
surfsense_backend/app/connectors/google_drive/__init__.py
surfsense_backend/app/connectors/google_drive/client.py
surfsense_backend/app/connectors/google_drive/content_extractor.py
surfsense_backend/app/indexing_pipeline/document_hashing.py
surfsense_backend/app/indexing_pipeline/indexing_pipeline_service.py
surfsense_backend/app/routes/search_source_connectors_routes.py
surfsense_backend/app/services/google_calendar/kb_sync_service.py
surfsense_backend/app/tasks/connector_indexers/confluence_indexer.py
surfsense_backend/app/tasks/connector_indexers/google_calendar_indexer.py
surfsense_backend/app/tasks/connector_indexers/google_drive_indexer.py
surfsense_backend/app/tasks/connector_indexers/google_gmail_indexer.py
surfsense_backend/app/tasks/connector_indexers/jira_indexer.py
surfsense_backend/app/tasks/connector_indexers/linear_indexer.py
surfsense_backend/app/tasks/connector_indexers/notion_indexer.py
surfsense_backend/app/utils/document_converters.py
surfsense_backend/tests/integration/indexing_pipeline/test_calendar_pipeline.py
surfsense_backend/tests/integration/indexing_pipeline/test_drive_pipeline.py
surfsense_backend/tests/integration/indexing_pipeline/test_gmail_pipeline.py
surfsense_backend/tests/integration/indexing_pipeline/test_index_batch.py
surfsense_backend/tests/integration/indexing_pipeline/test_migrate_legacy_docs.py
surfsense_backend/tests/unit/connector_indexers/conftest.py
surfsense_backend/tests/unit/connector_indexers/test_confluence_parallel.py
surfsense_backend/tests/unit/connector_indexers/test_google_drive_parallel.py
surfsense_backend/tests/unit/connector_indexers/test_jira_parallel.py
surfsense_backend/tests/unit/connector_indexers/test_linear_parallel.py
surfsense_backend/tests/unit/connector_indexers/test_notion_parallel.py
surfsense_backend/tests/unit/indexing_pipeline/test_document_hashing.py
surfsense_backend/tests/unit/indexing_pipeline/test_index_batch.py
surfsense_backend/tests/unit/indexing_pipeline/test_index_batch_parallel.py
surfsense_backend/tests/unit/indexing_pipeline/test_migrate_legacy_docs.py
surfsense_web/app/dashboard/[search_space_id]/documents/(manage)/components/DocumentsTableShell.tsx
surfsense_web/atoms/documents/document-mutation.atoms.ts
surfsense_web/atoms/documents/document-query.atoms.ts
surfsense_web/components/assistant-ui/connector-popup.tsx
surfsense_web/components/hitl-edit-panel/hitl-edit-panel.tsx
surfsense_web/components/onboarding-tour.tsx
surfsense_web/components/tool-ui/google-calendar/update-event.tsx
surfsense_web/contracts/enums/toolIcons.tsx
surfsense_web/hooks/use-zero-document-type-counts.ts
surfsense_web/lib/query-client/cache-keys.ts

⏭️ Files skipped (1)
  Locations  
surfsense_backend/tests/unit/connector_indexers/__init__.py

…t methods

- Implemented per-thread HTTP transport for concurrent downloads to ensure thread safety.
- Refactored `download_file` and `download_file_to_disk` methods to utilize blocking calls on separate threads, improving performance during file operations.
- Added logging to track the start and end of download and export processes, providing better visibility into execution time.
- Updated unit tests to verify parallel execution of download and export operations, ensuring efficiency in handling multiple requests.
…fe operations

- Added logging to track the start and end of file download and export processes, improving visibility into execution time.
- Implemented per-thread HTTP transport for concurrent downloads and exports, ensuring thread safety.
- Refactored download and export methods to utilize resolved credentials, enhancing functionality.
- Updated unit tests to validate the new threading and logging features, ensuring robust parallel execution.
…indexer

- Modified the `_should_skip_file` function to skip previously failed documents during processing, improving error handling.
- Updated the corresponding test to reflect the new behavior, ensuring that failed documents are correctly identified and skipped during automatic sync.
…ction

- Introduced a new utility for parsing .xlsx files into markdown format, enhancing the ability to process Excel documents natively.
- Updated the Google Drive content extractor to utilize the new Excel parsing functionality, allowing for better handling of spreadsheet files.
- Enhanced file type detection and export logic to support various document formats, improving overall content extraction accuracy.
- Added unit tests to ensure the correctness of the new Excel parsing feature and its integration with existing content extraction workflows.
@AnishSarkar22 AnishSarkar22 marked this pull request as ready for review March 27, 2026 17:06
… and UI components

- Implemented a new export endpoint in the backend to support exporting documents in various formats (PDF, DOCX, HTML, LaTeX, EPUB, ODT, plain text).
- Enhanced DocumentNode and FolderTreeView components to include export options in context and dropdown menus.
- Created shared ExportMenuItems component for consistent export options across the application.
- Integrated loading indicators for export actions to improve user experience.
@MODSetter MODSetter merged commit 30034d6 into MODSetter:dev Mar 27, 2026
5 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants