Skip to content

Add compression context cache#2440

Open
kdt3rd wants to merge 7 commits into
AcademySoftwareFoundation:mainfrom
kdt3rd:add_compression_ctxt_cache
Open

Add compression context cache#2440
kdt3rd wants to merge 7 commits into
AcademySoftwareFoundation:mainfrom
kdt3rd:add_compression_ctxt_cache

Conversation

@kdt3rd

@kdt3rd kdt3rd commented May 28, 2026

Copy link
Copy Markdown
Contributor

This adds a cache for a (de)compression context to the core decode pipeline struct.

This is a backwards compatible change (i.e. older versions of code compiled against previous headers will still work), with the exception that it does add the requirement that people actually use the provided initializer when defining exr_decode_pipeline_t things. This has been relaxed to assume that if it's not the new size, it must be the old size, which should work, although in theory if it's uninitialized, it could have a bad value (but should also trigger an uninitialized read if people check their sanitizers, which is good).

This should supercede #2128 with a backwards compatible mechanism

kdt3rd added 6 commits May 28, 2026 21:56
…sion

Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
This will allow a couple of the compression types to avoid constantly
creating a context object internally every chunk when the pipeline is
re-used (which it is most times).

Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
Adds some helper routines to avoid code duplication as well in enabling
this

Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
@kdt3rd kdt3rd added this to the Next Minor Release milestone May 28, 2026
Signed-off-by: Kimball Thurston <kdt3rd@gmail.com>
@lji-ilm

lji-ilm commented May 29, 2026

Copy link
Copy Markdown
Contributor

Hi Kimball, this is the branch that needs some profiling against the dev tip right?
Looking at the code change, probably both zip and htj2k would need a test..?

@kdt3rd kdt3rd added the v3.5 label May 29, 2026
@palemieux

Copy link
Copy Markdown
Collaborator

.\exrmetrics.exe -z htj2k32 ACES_OT_VWG_SampleFrames\ACES_OT_VWG_SampleFrames.0041.exr --passes 20 is 40% faster on Windows compared to main because of the reduced memory allocation overhead.

@lji-ilm

lji-ilm commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

I run part of my research project's pipeline on this branch and the tip of main, results below, which is somewhat different from Pierre's.
Maybe this is a bit too glorified than needed for this particular patch, but I was also thinking about asking people for feedbacks for the ongoing project, so I'll use this opportunity to show a mini pipeline run with a few key ideas, and feel free to comment on the methodology too, in addition to the result.

Materials

Environment

  • Alma 9
  • Cmake 3.26
  • gcc 11.5

EXR Builds

  • EXR built on main tip, commit [f19c773], in the results identified as main.
  • EXR built on Kimball's patch, commit kdt3rd@6d85a45 , in the results identified as patch.

Test Data

60 images retrieved from the Netflix Open Content "SolLevance" (20) and "Sparks" (20) shows, as well as programmatically generated images (20). As a starting point, all images are RGBA-half-4K(4096x2160) uncompressed scanline EXR, so all of them are roughly 70,778,880 bytes (70 Mbs).

The programmatically generated images (20) include a few native 4K generations from openEXR's example code https://github.com/AcademySoftwareFoundation/openexr/blob/main/src/examples/drawImage.cpp . I have patched those codes so they render natively to 4K. I also collected a few other mundan edge cases, such as all 0, all 1, checkerboard, etc. A few special frames from the movies, such as text-only credit roll, are categorized as "programmatically generated". A Stanford bunny render is also included, because i cannot afford not to.

Screenshot 2026-06-02 at 11 11 54 PM

Then, the 60 starting images are converted to two more sets with different data layouts -- 64x64 tiled, and 512x512 tiled. All uncompressed, these tiled images are still roughly 70Mbs. In total, we have 180 EXR images with roughly the same size (metadata makes their sizes slightly differ), while having different data layout and contents.

Methodology

Baseline Timing

Before we run exrmetrics, a small C++ program like the following was run on a few of these 70Mb images, and the timing collected is averaged into a Baseline Timing number:

// collect baseline read timing
std::ifstream file(filePath, std::ios::binary);
auto startTime = std::chrono::steady_clock::now();
file.read(buffer.data(), fileSize));
auto endTime = std::chrono::steady_clock::now();

// collect baseline write timing
std::ofstream file(filePath, std::ios::binary);
auto startTime = std::chrono::steady_clock::now();
file.write(buffer.data(), fileSize));
auto endTime = std::chrono::steady_clock::now();

The rationale of the Baseline Timing is to check the required raw IO time to read or write the amount of data corresponding to 4K-RGBA-half pixels in the test environment. Subsequently, all timing of read and write tasks are reported as time factor, that is, the timing output of the exrmetrics run divided by this baseline timing (read or write, accordingly).

Usually the time factor would be greater than 1, implying using a compression is slower than simply IO the pixels from the disk; however, there are cases when the factor is less than 1, notably with the "all 1" or "all 0" synthetic images, wherein the compressor just repeat a number 4K times in the RAM. I was hoping this metrology setup can make the test result more environment agnostic, compared to listing the absolute timing numbers from exrmetrics.

Results

Here is the table of timing factor (averaged across all data layouts) of my exrmetrics runs on the 180 images, all tests are single threaded. I didn't get to analyzing multi-threading in this mini project.

Smaller timing factor means better performance, and are marked in green. Bigger timing factor means slower, and are marked in red.

image

Interestingly the patch made write faster while read a tad slower.

Here is the plot of time factor of reading and writing tasks, grouped by compression type:

tf_compression_read tf_compression_write

Here are the breakout plots for different compression types in my run, groups by data layouts. As expected the small tile case (tile 64x64) introduce a performance hit. What is interesting to see is that htj2k have a lot of data points on both top and bottom being considered "outliers" that would be outside of the standard "Interquartile range from Q1 to Q3" statistical setup, while zip only have outliers on the bottom -- which are probably those all 0 and all 1s. I don't have an interpretation for this, but it isn't related about this patch either.

breakout_htj2k breakout_zip breakout_zips

@lji-ilm

lji-ilm commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Credits to the Google Gemini bot for a lot of testing/visualization scaffolding code.
source data dump:
kimball_benchmark2.csv

@palemieux

Copy link
Copy Markdown
Collaborator

@lji-ilm Which platform(s) did you run the test on?

For the HTJ2K compressor, I was expecting gains primarily on Windows, where the cost of pre-allocating memory is high and, hence, tearing up/down the compressor once for each chunk has a significant.

@lji-ilm

lji-ilm commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

@lji-ilm Which platform(s) did you run the test on?

For the HTJ2K compressor, I was expecting gains primarily on Windows, where the cost of pre-allocating memory is high and, hence, tearing up/down the compressor once for each chunk has a significant.

It's on linux (Alma) so i wasn't surprised that it was somewhat different from yours. I probably should say this on the first post, will edit that in. Kimball mentioned he also expect a slightly performance hit on read but we did not expect to see the gain on writing performance.

#include "internal_decompress.h"
#include "internal_structs.h"
#include "internal_xdr.h"
#include "internal_xdr.h"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include "internal_xdr.h"

duplicate include of internal_xdr.h

}
#endif

#endif /* OPENEXR_CORE_DECOMPRESS_H */

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#endif /* OPENEXR_CORE_DECOMPRESS_H */
#endif /* OPENEXR_CORE_LEGACY_STRUCTS_H */


void
testDeepZIPSCompression (const std::string& tempdir)
{}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude suggested this additioal test to verify that the context pointer is the same on subsequent chunks, and verify it's freed at the end.

Suggested change
{}
////////////////////////////////////////
static void
writeZIPFile (const std::string& filename)
{
exr_context_t f;
int partidx;
exr_context_initializer_t cinit = EXR_DEFAULT_CONTEXT_INITIALIZER;
exr_attr_box2i_t dataW = {0, 0, IMG_WIDTH - 1, IMG_HEIGHT - 1};
EXRCORE_TEST_RVAL (exr_start_write (
&f, filename.c_str (), EXR_WRITE_FILE_DIRECTLY, &cinit));
EXRCORE_TEST_RVAL (
exr_add_part (f, "scan", EXR_STORAGE_SCANLINE, &partidx));
EXRCORE_TEST_RVAL (exr_initialize_required_attr_simple (
f, partidx, IMG_WIDTH, IMG_HEIGHT, EXR_COMPRESSION_ZIP));
EXRCORE_TEST_RVAL (exr_set_data_window (f, partidx, &dataW));
EXRCORE_TEST_RVAL (exr_add_channel (
f, partidx, "H", EXR_PIXEL_HALF, EXR_PERCEPTUALLY_LOGARITHMIC, 1, 1));
EXRCORE_TEST_RVAL (exr_write_header (f));
std::vector<uint16_t> row (IMG_WIDTH, 0x3c00u); /* half 1.0 */
exr_encode_pipeline_t encoder = EXR_ENCODE_PIPELINE_INITIALIZER;
exr_chunk_info_t cinfo;
int32_t scansperchunk;
EXRCORE_TEST_RVAL (exr_get_scanlines_per_chunk (f, 0, &scansperchunk));
bool first = true;
for (int y = 0; y < IMG_HEIGHT; y += scansperchunk)
{
EXRCORE_TEST_RVAL (exr_write_scanline_chunk_info (f, 0, y, &cinfo));
if (first)
{
EXRCORE_TEST_RVAL (exr_encoding_initialize (f, 0, &cinfo, &encoder));
}
else
{
EXRCORE_TEST_RVAL (exr_encoding_update (f, 0, &cinfo, &encoder));
}
encoder.channels[0].encode_from_ptr = (const uint8_t*) row.data ();
encoder.channels[0].user_pixel_stride = 2;
encoder.channels[0].user_line_stride = 0;
if (first)
{
EXRCORE_TEST_RVAL (
exr_encoding_choose_default_routines (f, 0, &encoder));
}
EXRCORE_TEST_RVAL (exr_encoding_run (f, 0, &encoder));
first = false;
}
EXRCORE_TEST_RVAL (exr_encoding_destroy (f, &encoder));
EXRCORE_TEST_RVAL (exr_finish (&f));
}
void
testCompressionContextCache (const std::string& tempdir)
{
std::string filename = tempdir + "compression_context_cache.exr";
writeZIPFile (filename);
exr_context_t f;
exr_context_initializer_t cinit = EXR_DEFAULT_CONTEXT_INITIALIZER;
EXRCORE_TEST_RVAL (exr_start_read (&f, filename.c_str (), &cinit));
exr_decode_pipeline_t decoder = EXR_DECODE_PIPELINE_INITIALIZER;
exr_chunk_info_t cinfo;
int32_t scansperchunk;
int32_t height;
EXRCORE_TEST_RVAL (exr_get_scanlines_per_chunk (f, 0, &scansperchunk));
exr_attr_box2i_t dw;
EXRCORE_TEST_RVAL (exr_get_data_window (f, 0, &dw));
height = dw.max.y - dw.min.y + 1;
/* Need at least 2 chunks to verify reuse */
EXRCORE_TEST (height > scansperchunk);
void* first_context = NULL;
bool first = true;
int chunks_decoded = 0;
for (int y = dw.min.y; y <= dw.max.y; y += scansperchunk)
{
EXRCORE_TEST_RVAL (exr_read_scanline_chunk_info (f, 0, y, &cinfo));
if (first)
{
EXRCORE_TEST_RVAL (exr_decoding_initialize (f, 0, &cinfo, &decoder));
}
else
{
EXRCORE_TEST_RVAL (exr_decoding_update (f, 0, &cinfo, &decoder));
}
/* skip output — just decompress */
for (int c = 0; c < decoder.channel_count; ++c)
{
decoder.channels[c].decode_to_ptr = NULL;
decoder.channels[c].user_pixel_stride = 0;
decoder.channels[c].user_line_stride = 0;
}
if (first)
{
EXRCORE_TEST_RVAL (
exr_decoding_choose_default_routines (f, 0, &decoder));
}
EXRCORE_TEST_RVAL (exr_decoding_run (f, 0, &decoder));
if (first)
{
/* context must be populated after the first chunk */
EXRCORE_TEST (decoder.compression_context != NULL);
EXRCORE_TEST (decoder.free_compression_context != NULL);
first_context = decoder.compression_context;
}
else
{
/* same pointer every subsequent chunk — context is reused, not recreated */
EXRCORE_TEST (decoder.compression_context == first_context);
}
first = false;
++chunks_decoded;
}
EXRCORE_TEST (chunks_decoded >= 2);
/* destroy must clear the context */
EXRCORE_TEST_RVAL (exr_decoding_destroy (f, &decoder));
EXRCORE_TEST (decoder.compression_context == NULL);
EXRCORE_TEST (decoder.free_compression_context == NULL);
EXRCORE_TEST_RVAL (exr_finish (&f));
remove (filename.c_str ());
}

@cary-ilm cary-ilm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple typos and a suggested test from Claude.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants