Commit c1c9f3e
authored
Rewrite the parquet input adapter manager (#704)
* Arrow row-by-row processing: ColumnDispatcher, RecordBatchRowProcessor
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Remove old C++ reader classes and file wrappers
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Rewrite ParquetInputAdapterManager for RecordBatch input
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Python adapter: RecordBatch stream factories and C Stream Interface
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Tests for parquet input adapter rewrite
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Qualify arrow:: as ::arrow:: in writer headers to avoid namespace ambiguity
The introduction of namespace csp::adapters::arrow (for the new
ColumnDispatcher/RecordBatchRowProcessor classes) creates ambiguity when
writer-side headers use unqualified arrow:: inside namespace
csp::adapters::parquet. The compiler finds the sibling csp::adapters::arrow
namespace before the global ::arrow namespace.
Also forward-declares ColumnDispatcher and RecordBatchRowProcessor in
ParquetInputAdapterManager.h (moving full includes to .cpp) and adds
direct includes for csp/core/Exception.h and arrow/table.h that were
previously provided transitively through the now-deleted reader headers.
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Fix Arrow 21 compatibility: use out-parameter FileReader::Make
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Add comprehensive test coverage for parquet input adapter
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Remove dead m_rbSources member from DictBasketReaderRecord, and format
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Optimize hot path: InlineReader for zero-overhead value extraction
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Add comprehensive tests for all Arrow types and edge cases
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Address review: remove dead code, hoist loop invariant, deduplicate string lambdas
- Remove unused m_basketSymbolColumn member
- Remove dead properties.get line before CSP_THROW
- Hoist phase variable out of loop
- Use generic lambda for string/binary extraction
- Leave timeUnitMultiplier inline because constexpr fails with CSP_THROW under this C++20 build
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Address review: private SourceEntry, const fields, deduplicate batch logic, hoist columns
- Make SourceEntry private in RecordBatchRowProcessor
- Make m_arrowTypeId and m_columnName const
- Replace duplicated first-batch loop with fetchNextBatch call
- Hoist columns() out of rebindSource loop
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Address review: static lambda, remove tz member, remove redundant override
- Make viewToString lambda static in createColumnDispatcher
- Remove m_defaultTimezone member, validate tz inline and discard
- Remove redundant doReadNextValue override from LambdaReader (base class handles it)
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* perf: add prefetch and parallel column decode for parquet reading
Add PrefetchingRecordBatchReader that decodes the next RecordBatch on a
background thread while CSP processes the current batch. Also enable
Arrow's use_threads and pre_buffer for parallel column decoding and IO
range caching.
The PrefetchingRecordBatchReader co-owns the FileReader (via shared_ptr)
to guarantee the FileReader outlives the background prefetch thread,
even when CSP stops mid-file.
Benchmarks show ~15% average speedup, up to 1.5x on filtered reads and
wide structs, with no regressions.
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Address comments
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Address review: fix bugs, remove dead code, clean up test scaffolding
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Clean up tests
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Replace PrefetchingRecordBatchReader with async generator API
Redesign the batch-reading interface to use Arrow's async generator
(GetRecordBatchGenerator) natively rather than wrapping it in a
synchronous RecordBatchReader subclass.
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
* Fix allow_missing_files not honored in split-columns native path
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>
---------
Signed-off-by: Arham Chopra <arham.chopra@cubistsystematic.com>1 parent 3f98441 commit c1c9f3e
37 files changed
Lines changed: 5823 additions & 3573 deletions
File tree
- cpp
- cmake/modules
- csp
- adapters
- arrow
- parquet
- python/adapters
- csp
- adapters
- tests/adapters
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
74 | 74 | | |
75 | 75 | | |
76 | 76 | | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
77 | 83 | | |
78 | 84 | | |
79 | 85 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | | - | |
4 | | - | |
5 | | - | |
| 3 | + | |
| 4 | + | |
6 | 5 | | |
7 | 6 | | |
8 | 7 | | |
9 | 8 | | |
10 | 9 | | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
11 | 27 | | |
12 | 28 | | |
13 | 29 | | |
| |||
0 commit comments