Add compression context cache#2440
Conversation
…sion Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
This will allow a couple of the compression types to avoid constantly creating a context object internally every chunk when the pipeline is re-used (which it is most times). Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
Adds some helper routines to avoid code duplication as well in enabling this Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
|
Hi Kimball, this is the branch that needs some profiling against the dev tip right? |
|
|
|
I run part of my research project's pipeline on this branch and the tip of main, results below, which is somewhat different from Pierre's. MaterialsEnvironment
EXR Builds
Test Data60 images retrieved from the Netflix Open Content "SolLevance" (20) and "Sparks" (20) shows, as well as programmatically generated images (20). As a starting point, all images are RGBA-half-4K(4096x2160) uncompressed scanline EXR, so all of them are roughly 70,778,880 bytes (70 Mbs). The programmatically generated images (20) include a few native 4K generations from openEXR's example code https://github.com/AcademySoftwareFoundation/openexr/blob/main/src/examples/drawImage.cpp . I have patched those codes so they render natively to 4K. I also collected a few other mundan edge cases, such as all 0, all 1, checkerboard, etc. A few special frames from the movies, such as text-only credit roll, are categorized as "programmatically generated". A Stanford bunny render is also included, because i cannot afford not to.
Then, the 60 starting images are converted to two more sets with different data layouts -- 64x64 tiled, and 512x512 tiled. All uncompressed, these tiled images are still roughly 70Mbs. In total, we have 180 EXR images with roughly the same size (metadata makes their sizes slightly differ), while having different data layout and contents. MethodologyBaseline TimingBefore we run The rationale of the Baseline Timing is to check the required raw IO time to read or write the amount of data corresponding to 4K-RGBA-half pixels in the test environment. Subsequently, all timing of read and write tasks are reported as time factor, that is, the timing output of the Usually the time factor would be greater than 1, implying using a compression is slower than simply IO the pixels from the disk; however, there are cases when the factor is less than 1, notably with the "all 1" or "all 0" synthetic images, wherein the compressor just repeat a number 4K times in the RAM. I was hoping this metrology setup can make the test result more environment agnostic, compared to listing the absolute timing numbers from ResultsHere is the table of timing factor (averaged across all data layouts) of my Smaller timing factor means better performance, and are marked in green. Bigger timing factor means slower, and are marked in red.
Interestingly the patch made write faster while read a tad slower. Here is the plot of time factor of reading and writing tasks, grouped by compression type:
Here are the breakout plots for different compression types in my run, groups by data layouts. As expected the small tile case (tile 64x64) introduce a performance hit. What is interesting to see is that htj2k have a lot of data points on both top and bottom being considered "outliers" that would be outside of the standard "Interquartile range from Q1 to Q3" statistical setup, while zip only have outliers on the bottom -- which are probably those all 0 and all 1s. I don't have an interpretation for this, but it isn't related about this patch either.
|
|
Credits to the Google Gemini bot for a lot of testing/visualization scaffolding code. |
|
@lji-ilm Which platform(s) did you run the test on? For the HTJ2K compressor, I was expecting gains primarily on Windows, where the cost of pre-allocating memory is high and, hence, tearing up/down the compressor once for each chunk has a significant. |
It's on linux (Alma) so i wasn't surprised that it was somewhat different from yours. I probably should say this on the first post, will edit that in. Kimball mentioned he also expect a slightly performance hit on read but we did not expect to see the gain on writing performance. |
| #include "internal_decompress.h" | ||
| #include "internal_structs.h" | ||
| #include "internal_xdr.h" | ||
| #include "internal_xdr.h" |
There was a problem hiding this comment.
| #include "internal_xdr.h" |
duplicate include of internal_xdr.h
| } | ||
| #endif | ||
|
|
||
| #endif /* OPENEXR_CORE_DECOMPRESS_H */ |
There was a problem hiding this comment.
| #endif /* OPENEXR_CORE_DECOMPRESS_H */ | |
| #endif /* OPENEXR_CORE_LEGACY_STRUCTS_H */ |
|
|
||
| void | ||
| testDeepZIPSCompression (const std::string& tempdir) | ||
| {} |
There was a problem hiding this comment.
Claude suggested this additioal test to verify that the context pointer is the same on subsequent chunks, and verify it's freed at the end.
| {} | |
| //////////////////////////////////////// | |
| static void | |
| writeZIPFile (const std::string& filename) | |
| { | |
| exr_context_t f; | |
| int partidx; | |
| exr_context_initializer_t cinit = EXR_DEFAULT_CONTEXT_INITIALIZER; | |
| exr_attr_box2i_t dataW = {0, 0, IMG_WIDTH - 1, IMG_HEIGHT - 1}; | |
| EXRCORE_TEST_RVAL (exr_start_write ( | |
| &f, filename.c_str (), EXR_WRITE_FILE_DIRECTLY, &cinit)); | |
| EXRCORE_TEST_RVAL ( | |
| exr_add_part (f, "scan", EXR_STORAGE_SCANLINE, &partidx)); | |
| EXRCORE_TEST_RVAL (exr_initialize_required_attr_simple ( | |
| f, partidx, IMG_WIDTH, IMG_HEIGHT, EXR_COMPRESSION_ZIP)); | |
| EXRCORE_TEST_RVAL (exr_set_data_window (f, partidx, &dataW)); | |
| EXRCORE_TEST_RVAL (exr_add_channel ( | |
| f, partidx, "H", EXR_PIXEL_HALF, EXR_PERCEPTUALLY_LOGARITHMIC, 1, 1)); | |
| EXRCORE_TEST_RVAL (exr_write_header (f)); | |
| std::vector<uint16_t> row (IMG_WIDTH, 0x3c00u); /* half 1.0 */ | |
| exr_encode_pipeline_t encoder = EXR_ENCODE_PIPELINE_INITIALIZER; | |
| exr_chunk_info_t cinfo; | |
| int32_t scansperchunk; | |
| EXRCORE_TEST_RVAL (exr_get_scanlines_per_chunk (f, 0, &scansperchunk)); | |
| bool first = true; | |
| for (int y = 0; y < IMG_HEIGHT; y += scansperchunk) | |
| { | |
| EXRCORE_TEST_RVAL (exr_write_scanline_chunk_info (f, 0, y, &cinfo)); | |
| if (first) | |
| { | |
| EXRCORE_TEST_RVAL (exr_encoding_initialize (f, 0, &cinfo, &encoder)); | |
| } | |
| else | |
| { | |
| EXRCORE_TEST_RVAL (exr_encoding_update (f, 0, &cinfo, &encoder)); | |
| } | |
| encoder.channels[0].encode_from_ptr = (const uint8_t*) row.data (); | |
| encoder.channels[0].user_pixel_stride = 2; | |
| encoder.channels[0].user_line_stride = 0; | |
| if (first) | |
| { | |
| EXRCORE_TEST_RVAL ( | |
| exr_encoding_choose_default_routines (f, 0, &encoder)); | |
| } | |
| EXRCORE_TEST_RVAL (exr_encoding_run (f, 0, &encoder)); | |
| first = false; | |
| } | |
| EXRCORE_TEST_RVAL (exr_encoding_destroy (f, &encoder)); | |
| EXRCORE_TEST_RVAL (exr_finish (&f)); | |
| } | |
| void | |
| testCompressionContextCache (const std::string& tempdir) | |
| { | |
| std::string filename = tempdir + "compression_context_cache.exr"; | |
| writeZIPFile (filename); | |
| exr_context_t f; | |
| exr_context_initializer_t cinit = EXR_DEFAULT_CONTEXT_INITIALIZER; | |
| EXRCORE_TEST_RVAL (exr_start_read (&f, filename.c_str (), &cinit)); | |
| exr_decode_pipeline_t decoder = EXR_DECODE_PIPELINE_INITIALIZER; | |
| exr_chunk_info_t cinfo; | |
| int32_t scansperchunk; | |
| int32_t height; | |
| EXRCORE_TEST_RVAL (exr_get_scanlines_per_chunk (f, 0, &scansperchunk)); | |
| exr_attr_box2i_t dw; | |
| EXRCORE_TEST_RVAL (exr_get_data_window (f, 0, &dw)); | |
| height = dw.max.y - dw.min.y + 1; | |
| /* Need at least 2 chunks to verify reuse */ | |
| EXRCORE_TEST (height > scansperchunk); | |
| void* first_context = NULL; | |
| bool first = true; | |
| int chunks_decoded = 0; | |
| for (int y = dw.min.y; y <= dw.max.y; y += scansperchunk) | |
| { | |
| EXRCORE_TEST_RVAL (exr_read_scanline_chunk_info (f, 0, y, &cinfo)); | |
| if (first) | |
| { | |
| EXRCORE_TEST_RVAL (exr_decoding_initialize (f, 0, &cinfo, &decoder)); | |
| } | |
| else | |
| { | |
| EXRCORE_TEST_RVAL (exr_decoding_update (f, 0, &cinfo, &decoder)); | |
| } | |
| /* skip output — just decompress */ | |
| for (int c = 0; c < decoder.channel_count; ++c) | |
| { | |
| decoder.channels[c].decode_to_ptr = NULL; | |
| decoder.channels[c].user_pixel_stride = 0; | |
| decoder.channels[c].user_line_stride = 0; | |
| } | |
| if (first) | |
| { | |
| EXRCORE_TEST_RVAL ( | |
| exr_decoding_choose_default_routines (f, 0, &decoder)); | |
| } | |
| EXRCORE_TEST_RVAL (exr_decoding_run (f, 0, &decoder)); | |
| if (first) | |
| { | |
| /* context must be populated after the first chunk */ | |
| EXRCORE_TEST (decoder.compression_context != NULL); | |
| EXRCORE_TEST (decoder.free_compression_context != NULL); | |
| first_context = decoder.compression_context; | |
| } | |
| else | |
| { | |
| /* same pointer every subsequent chunk — context is reused, not recreated */ | |
| EXRCORE_TEST (decoder.compression_context == first_context); | |
| } | |
| first = false; | |
| ++chunks_decoded; | |
| } | |
| EXRCORE_TEST (chunks_decoded >= 2); | |
| /* destroy must clear the context */ | |
| EXRCORE_TEST_RVAL (exr_decoding_destroy (f, &decoder)); | |
| EXRCORE_TEST (decoder.compression_context == NULL); | |
| EXRCORE_TEST (decoder.free_compression_context == NULL); | |
| EXRCORE_TEST_RVAL (exr_finish (&f)); | |
| remove (filename.c_str ()); | |
| } | |
cary-ilm
left a comment
There was a problem hiding this comment.
A couple typos and a suggested test from Claude.







This adds a cache for a (de)compression context to the core decode pipeline struct.
This is a backwards compatible change (i.e. older versions of code compiled against previous headers will still work), with the exception that it does add the requirement that people actually use the provided initializer when defining exr_decode_pipeline_t things. This has been relaxed to assume that if it's not the new size, it must be the old size, which should work, although in theory if it's uninitialized, it could have a bad value (but should also trigger an uninitialized read if people check their sanitizers, which is good).
This should supercede #2128 with a backwards compatible mechanism