-
Notifications
You must be signed in to change notification settings - Fork 654
Add Google Highway SIMD acceleration for ImageBufAlgo operations #4986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: hwy-prototype
Are you sure you want to change the base?
Conversation
Integrated Google Highway (hwy) as a required dependency and updated build scripts accordingly. Added a new SIMD-accelerated resample_hwy implementation in imagebufalgo_xform.cpp, which is used for resampling when both source and destination have local pixels and the image is not deep. The scalar fallback remains for other cases.
Introduces Highway (hwy) SIMD-accelerated implementations for add, sub, mul, and pow operations in imagebufalgo, using fast pointer-based code paths when localpixels are available. Also updates resample_hwy to support both float and double types, improving performance and type safety for SIMD image processing.
Deleted all CI, analysis, documentation, release, and related workflow YAML files from .github/workflows. This disables all automated GitHub Actions for the repository.
Introduces a new header, imagebufalgo_hwy_pvt.h, encapsulating SIMD type traits and vectorized load/store utilities using Highway. Refactors add, sub, and mul implementations in imagebufalgo_addsub.cpp and imagebufalgo_muldiv.cpp to use these utilities, improving code clarity and enabling more robust SIMD handling for various pixel types.
Refactors SIMD kernel runners and type promotion/demotion utilities in imagebufalgo_hwy_pvt.h for better extensibility and correctness, including support for int16_t and improved documentation. Updates all relevant imagebufalgo implementations to use reinterpret_cast and static_cast for type safety, and enhances pow_impl_hwy to use SIMD for scalar exponents. Also links hwy_contrib in CMake and replaces direct Highway includes with the new header where appropriate.
|
It's gonna take me a few days to do a thorough review -- I'm under the weather this week and mostly sacked out in bed with some strong meds, so brain isn't what it should be. But I like the direction of this. Stylistically, it's not very different from the small number of places that I've "hard coded" our simd.h approaches to certain math IBA ops, but mostly I only did it for a small number of common cases, for example over_impl_rgbafloat special case of IBA::over() when all images are 4 channel float, in imagebufalgo_composite.cpp. But you're handling many more cases, and in a more principled way using hwy. This PR seems to have completely removed the entire directory of GHA workflows. |
Benchmarking scripts for OIIO resample operations in Windows (PowerShell, BAT) and Linux (Bash), along with a C++ benchmark for HWY arithmetic operations. Includes Visual Studio project files for building and organizing the HWY benchmark. Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
|
Sure, no rush. I also will experiment with some IBA functions, that might be a good for SIMD use |
Introduces Highway SIMD fast paths for rangecompress, rangeexpand, premult, and unpremult operations in imagebufalgo_pixelmath.cpp, with new helpers for interleaved channel load/store and SIMD kernels for range compression/expansion in imagebufalgo_hwy_pvt.h. These changes accelerate per-pixel math for images with local memory, especially for RGBA and luma-based workflows, while preserving scalar fallbacks for non-local buffers.
Introduces an 'enable_hwy' global variable to control usage of Google Highway SIMD optimizations, defaulting to enabled. Updates pixel math and transformation functions to check 'enable_hwy' before using Highway paths. Attribute API extended to allow runtime control of Highway optimizations.
Expanded the benchmark_hwy_simple.cpp to provide comprehensive benchmarking of Highway SIMD vs scalar implementations for various OIIO ImageBufAlgo operations. Added support for multiple data types, command-line options, detailed timing, result validation, and output image saving. The new version benchmarks add, sub, mul, pow, rangecompress, rangeexpand, premult, unpremult, and resample operations, printing results in a formatted table and highlighting SIMD speedup.
Subtract 0.5 from source coordinates before interpolation in both vertical and horizontal resampling loops to correctly follow the pixel-center convention.
Relocated the hwy_test benchmark from hwy_tests/hwy to src/hwy_test, including renaming and updating CMake integration. Removed old batch, shell, and project files related to the previous test location. This streamlines the test structure and integrates hwy_test with the main build system. Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
|
I have added OIIO::attribute("enable_hwy", 0) #- disable highway SIMD code path
OIIO::attribute("enable_hwy", 1) #- enable highway SIMD code pathand added hwy_test as a tool (sorry) that run 10x times current and 10x times SIMD versions of add/mul/sub/pow range expand/compress and resample. take into account that native double support is not exist in oiio. data processed in float and cast back to double. |
|
Introduces Highway SIMD implementations for div, min, max, absdiff, clamp, and mad operations in ImageBufAlgo. Adds a generic ternary SIMD helper, and updates the test suite to benchmark the new SIMD code paths. Scalar fallbacks are preserved for non-contiguous or unsupported cases. Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
a benchmark for the clamp operation in hwy_test.cpp, including result saving.
Adds native integer SIMD implementations for scale-invariant ImageBufAlgo operations (add, sub, min, max, absdiff) using Google Highway, bypassing float conversion for matching integer types. Updates documentation to describe SIMD optimizations and environment variable control. Refactors kernel runners and test code to support new paths, yielding 6-12x speedup for integer images. Signed-off-by: Vlad (Kuzmin) Erium <libalias@gmail.com>
Introduces Highway SIMD implementations for ImageBufAlgo::invert and contrast_remap (linear stretch only, no sigmoid), improving performance for these pixel math operations. The code dispatches to SIMD when possible and falls back to scalar code for complex or unsupported cases.
|
I would recommend that you stop adding new functionality for now, and let's iterate on the review and whether we like the exact method and form of what you have. If we decide we want to make changes, the more IBA functions you have rewritten, the harder it will be to make the changes in all places. |
|
Can you please restore the workflows directory that you deleted? As it stands, this PR is not running any CI because of that change. |
Updated SIMD image buffer algorithms to consistently normalize integer pixel data to the 0-1 range for all relevant types (uint8, int8, uint16, int16, uint32) during vectorized load, store, and math operations. This ensures correct results for pixel math, clamping, pow, premult, and unpremult operations across all supported data types, and fixes edge cases where normalization or denormalization was missing in scalar fallback code paths.
This reverts commit 0ea3dff.
Refactors SIMD load and store routines to consistently normalize signed integer types (int8, int16, int32) to approximately [-1.0, 1.0] range, and updates denormalization to match. This ensures symmetric mapping and clamping for negative values, improving accuracy and consistency in image operations.
| checked_find_package (hwy REQUIRED) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer this be an optional dependency,
| checked_find_package (hwy REQUIRED) | |
| checked_find_package (hwy | |
| DEFINITIONS USE_HWY) | |
And then #ifdef USE_HWY in the cpp files to conditionally compile the parts that need hwy.
If we add a hard dependency, it precludes backporting any part of it to 3.1, since we don't change minimum dependencies in a released branch.
|
CI is failing because (perhaps among other reasons) we never have hwy installed. I think it would help to have a build_hwy.cmake auto-builder and maybe a build_hwy.bash to use for CI installation. |
|
I think the overall scheme here is good, but I'm concerned about the massive amount of extra code here that has to be maintained in parallel with the reference scalar implementation. (And also, I would expect it to greatly increase the compilation time.) As an example, contrast with add_impl_scalar is about 10 lines of code (for each of the image+image and image+const flavors) and really easy to understand, whereas the hwy implementations are another 200 lines (including image+image having to split into the float and integer cases). This pattern appears to be needed for all IBA functions we wish to implement hwy versions of. I don't think this is insurmountable at all. I just think we're going to need to work to make the situation easier to manage, and some suggestions come to mind. Not sure if any of what I'm about to say is practical, but maybe we can find a way. First, I note that each of these functions has a LOT of boilerplate to set up. But I feel like maybe most of that is repeated over and over? For example, it seems to me that the hwy implementations of add() and sub() are almost completely identical, except that at some point one has a So, is it possible that in imagebufalgo_hwy_pvt.h (where you have thankfully already abstracted out the Load/Store/Promote/Demote boilerplate), there is a way to create a single template each for unary, binary, and ternary pixel-by-pixel-independent style operations, where it is passed a functor/lambda with just the bare operation (Add, Sub, etc.)? So that maybe the operation_hwy_impl() for each IBA function is very tiny and mostly just supplying the right kernel operation? It's a little bit (conceptually) like what we recently did with perpixel_op(), where it's possible to write the simple pixel-by-pixel operations in just a couple lines of new code each. I'm obviously not experienced with hwy. Do you think this is practical? It's almost a little confusing that you have done so much already. I feel like it would have been easier if you stopped after a minimal prototype, like only doing IBA::add() and IBA::sub(), and we looked at that and said "what's the absolute minimum text size, least repetition, and simplest way to express each of these that doesn't sacrifice too much performance over the super optimal but maximally verbose version?" and iterated on that to see how simple we can get it, THEN replicate that methodology over as many other IBA functions as needed. |
|
Also, I see from your benchmarks (which may not reflect the latest code), that the performance improvement is not uniformly good. The uint32 stands out as often being slower, and never being a big enough improvement over the scalar code to be worth the template expansion of the extra cases, I bet. But do we need it at all? I've never seen an image file where somebody used uint32 pixel values -- except for channels that represent Object IDs or other ID tags, not color values, so those channels would not be meaningful to add or do most other image arithmetic for. So maybe uint32 should be removed from the Hwy templates and just let that case always fall back to the scalar code? There are other rare cases. Almost all image files are uint8, uint16, half, or float pixel values. The signed integers and 32 bit integers are rarely encountered. And many cases of mixing types are rare as well (for example, if somebody is encountering half at all, it's probably because they have a pipeline where everything is OpenEXR, and will almost never directly do binary image operations on one half image and one uint16 image, say. So maybe we can cut down drastically on compile times and library size bloat if we only expand hwy implementations for the small number of common combinations, let the other combinations fall back to the scalar code that handles all types, and we can always expand the set if we realize that there is a widely used combination that we've missed. Also, may as well fall back to shaders for combinations that we think are common, but we can't seem to make the hwy version be significantly faster. (Again, if we later figure out how to make a missed combination fast, we can add it at any time without any compatibility change.) |
|
I also do not happy with uint32 and float performance. Probably it not scaling well because I mostly testing on AVX2 with 8x32 that is only two RGB or RGBA vectors. I need to do a AVX512 benchmarks. |
It's hard to tell. In my experience, sometimes AVX2 is about as fast as you can get. Many of the Intel chips with clock down if you use the 512 bit instructions, and that can slow down subsequent scalar, 128/256-bit simd, or integer code until it raises the clock again. In other words, 512 bit is often only a win if you can do a large number of 512-wide instructions in a row. It can be really hit or miss if you are mixing different width SIMD, or if uou can't keep enough SIMD lanes full. I'm really not sure why the float results aren't better. It's important to understand. Is it somehow already auto-vectorizing the float case well even for the scalar case? Or is there some reason why the hwy implementation for that is not especially effective? Or is something simple like 'add' limited more by memory bandwidth saturation than by floating point throughput? It's important to keep in mind that for oiiotool (a very important use case), by default it expands most images to float while in memory for maximum precision, so the float case is by are the most common one for anybody using oiiotool.
Well, so this is kind of what I was getting at. I feel like this is showing enough promise that I believe we can count on adding some sort of hwy support, as long as it's an optional dependency, but I think some of the internals have some kinks to work out -- which IBA functions to use hwy for, which type combinations, how to express it. But those can be iterated over time. If it were me, this is the suggestion for how I would proceed, step by step:
|
|
@lgritz, I want to run code restructuring first in this PR, and after that will create a new PR for main with limited resample + add/sub/mul/mad/div |
Yes, definitely. I'm just trying to lay out a path that, ultimately, will make it easiest to accept your changes. What I'm afraid of is that you'll write so much code before we have refined the methodology, that you will experience some "sunk cost fallacy" where it will seem like too much work to change it all to a better methodology. To use another analogy, do you know the adage about how if you throw a frog in boiling water, it will jump right out, but if you put the frog into room temperature water that slowly heats to a boil, he will never notice until it's too late? I'm the frog! I'm saying that a series of small PRs -- each of which is obvious and totally defensible as the right approach, each of which is easy to read/understand/test, each of which is simple to steer to be better if I or other onlookers have suggestions -- will get accepted quickly in series, whereas the same set of changes all batched together in total into one huge PR (in a branch that my have diverged quite a bit from main if they take a long time) may be very difficult to review and decide to merge. |
|
As an aside, I would love your review of #4993 and then after I merge it, for you to rebase on top of it (or merge into your work) so that you are benchmarking versus my improved IBA::resample() fixes. In general, I think that ideally, we would like to aim for the hwy implementations to result in a 2-10x improvement versus scalar. That would be awesome, and is about as good as one can reasonably expect from turning scalar code into fairly good SIMD code for AVX2 or AVX512.
|
Introduces header-only helpers for local pixel access in imagebufalgo_hwy_pvt.h, reducing repetitive pointer and stride calculations. Updates add, sub, mul, div, mad, and invert Highway implementations to use these helpers for cleaner, safer, and more maintainable code. Adds developer documentation for Highway kernels and updates tests to include additional resample benchmarks.
perf(IBA): Add Google Highway SIMD fast paths for core operations
Implement SIMD acceleration using Google Highway for ImageBufAlgo add, sub,
mul, pow, and resample operations. Provides 2-8x performance improvement for
contiguous pixel layouts with portable vectorization across x86 and ARM.
New file imagebufalgo_hwy_pvt.h provides reusable SIMD infrastructure with
automatic type promotion/demotion (uint8/uint16/int16/uint32/half/float/double),
generic operation kernels, and smart fallback to scalar code for strided layouts.
Operations use runtime vector width detection (ScalableTag), FMA instructions
where applicable, and handle partial vectors correctly. Code follows OIIO style
with modern C++ casts and comprehensive documentation.
Requires: Google Highway library (MIT license, header-only)
Modified: imagebufalgo_{addsub,muldiv,pixelmath,xform}.cpp + new hwy_pvt.h
Checklist:
behavior.
testsuite.
PR, by pushing the changes to my fork and seeing that the automated CI
passed there. (Exceptions: If most tests pass and you can't figure out why
the remaining ones fail, it's ok to submit the PR and ask for help. Or if
any failures seem entirely unrelated to your change; sometimes things break
on the GitHub runners.)
fixed any problems reported by the clang-format CI test.
corresponding Python bindings. If altering ImageBufAlgo functions, I also
exposed the new functionality as oiiotool options.