Added implementation to calculate Gram and Row Gram matrices #7378
Added implementation to calculate Gram and Row Gram matrices #7378TwentyPast4 wants to merge 2 commits intoisl-org:mainfrom
Conversation
|
Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes. |
|
Hi @TwentyPast4, thanks for adding this interesting new operation. Where is this used in Open3D, or a 3D workflow? How much of a performance gain does it provide to that workflow? |
The current use cases of this operation I could find are:
The workflow is solving least-squares problems, for example fitting a conic to point clouds - something useful when you are working with point clouds containing objects of a known shape, and wanting to measure properties of those shapes. The performance difference is the same for both Gram and Row gram - on my hardware it is a 6-7x speedup on CPU and 2x on GPU. (3.60 vs 22.25 seconds on CPU and 14.01 vs 24.37 seconds on GPU) TEST_P(LinalgPermuteDevices, GramPerf) {
core::Device device = GetParam();
// Gram test.
core::Tensor A = core::Tensor::Init<float>({{1, 2, 3}, {4, 5, 6}}, device);
auto start = std::chrono::steady_clock::now();
core::Tensor B;
for (int i = 0; i < 1000000; ++i) {
B = A.Gram();
}
auto after_gram = std::chrono::steady_clock::now();
for (int i = 0; i < 1000000; ++i) {
B = A.T().Matmul(A);
}
auto finish = std::chrono::steady_clock::now();
double elapsed_gram = std::chrono::duration_cast<std::chrono::microseconds>(after_gram - start).count() * 1e-6;
double elapsed_matmul = std::chrono::duration_cast<std::chrono::microseconds>(finish - after_gram).count() * 1e-6;
EXPECT_LT(elapsed_gram, elapsed_matmul);
}It should be noted that there may be compiler optimization specifics at play with this kind of benchmark, but it's probably ballpark-accurate. |
Type
Motivation and Context
Gram and row Gram matrix computations (ie. A.T @ A and A @ A.T) are relatively common in some linear algebra (eg. least squares, linear independence, ML kernels, ...).
If you execute A.T().Matmul(A), at least one of the two matrices in the matmul operation will not be contiguous, meaning when matmul is executed, a copy operation will be performed which can be a noticeable performance loss.
Implementations of Gram() and RowGram() are done with a single matrix, with the transposition being done in gemm functions. This means that if A is contiguous, no copy will be performed, which is not true for A.T().Matmul(A).
Checklist:
python util/check_style.py --applyto apply Open3D code styleto my code.
updated accordingly.
results (e.g. screenshots or numbers) here.
Description
Added Gram() and RowGram() functions to Tensor. These are intended for <=2D tensors, similar to how T() is implemented.