Skip to content

feat: support runtime length for vectorized functions with compile-time max bound#1777

Open
AraxTheCoder wants to merge 1 commit intodata61:masterfrom
AraxTheCoder:master
Open

feat: support runtime length for vectorized functions with compile-time max bound#1777
AraxTheCoder wants to merge 1 commit intodata61:masterfrom
AraxTheCoder:master

Conversation

@AraxTheCoder
Copy link
Copy Markdown

Description: Runtime Vectorization Length Support

This pull request introduces the ability to execute vectorized instructions and functions based on a runtime-determined length.

While the maximum possible length must still be defined at compile-time to facilitate instruction generation, this change allows for significantly more efficient execution when the actual number of required operations is only known at runtime.

High-Level API

The functionality is exposed through an extension of the vectorize decorator, which now supports the optional named argument active_length.

Comparison & Motivation

To illustrate the necessity of this change, consider the performance and resource trade-offs of different approaches when processing a fraction of a large array at runtime:

from Compiler.library import *

max_length = 10000
values = Array.create_from([sint(x) for x in range(max_length)])
runtime_length = cint(300)

start_timer(3)
# compare for the whole length and then...
res = values.get_vector().less_than(values.get_vector())
# ...pretend we only use the first runtime_length values in using e.g. @for_range
stop_timer(3)

start_timer(4)
# optimal performance (pretend like we knew the runtime length at compile time)
res = values.get_vector(0, 300).less_than(values.get_vector(0, 300))
stop_timer(4)

@vectorize
def runtime_less_than(a, b):
    return a.less_than(b)

start_timer(5)
# proposed runtime length vectorized instructions solution using a true runtime value
res = runtime_less_than(values.get_vector(), values.get_vector(), active_length=runtime_length)
stop_timer(5)

Current Framework State (Timer 3):
To handle dynamic lengths currently, one must process the entire max_length. This introduces no additional compile-time overhead but results in significant data-volume and computation overhead, as unused operations are still executed.

Manual Mapping (Pre-compilation):
Another workaround is "pre-compiling" the vectorized instruction/function for every possible length and mapping them at runtime to the corresponding function. While this avoids data-volume overhead, it causes a compile-time overhead that scales linearly with the maximum possible length.

The Proposed Solution (Timer 5):
By using the active_length parameter, the optimal performance and data-volume of the static baseline (Timer 4) is matched. This eliminates the need to process unused elements without increasing compilation times.

Time3 = 0.00544764 seconds (0.383952 MB, 9 rounds)
Time4 = 0.00115694 seconds (0.011596 MB, 9 rounds)
Time5 = 0.00134388 seconds (0.011596 MB, 9 rounds)

Preprocessed Data & Batching

A notable characteristic of this approach is that the exact amount of preprocessed data required is not known at compile-time. To ensure the communication rounds stay as low as the static baseline, the batch size must be adjusted. For the test program above, a batch size of 1,000,000 was used to verify optimal performance; however, users need to tune this value based on their specific requirements when using this feature.

I tried to keep the changes within the Instruction and Processor files as minimal as possible.
Crucially, the standard behavior of the vectorize decorator should remain unchanged when the active_length extension is not utilized.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 17, 2026

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants