-
Notifications
You must be signed in to change notification settings - Fork 654
Add Google Highway SIMD acceleration for ImageBufAlgo operations #4986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ssh4net
wants to merge
21
commits into
AcademySoftwareFoundation:hwy-prototype
Choose a base branch
from
ssh4net:hwy
base: hwy-prototype
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+4,831
−130
Open
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
c8d2dbd
Add Google Highway SIMD resampling to ImageBufAlgo
ssh4net f13876b
Add Highway-based SIMD fast paths for pixel math ops
ssh4net 0ea3dff
Remove all GitHub Actions workflow files
ssh4net b1b8be0
Refactor SIMD math to use imagebufalgo_hwy_pvt.h
ssh4net 2cac746
Improve SIMD support and type handling in imagebufalgo
ssh4net 2c0f517
Add HWY arithmetic and resample benchmark scripts
ssh4net 9c5808b
Add SIMD-optimized range and premult/unpremult pixel math
ssh4net 20960da
Add Highway SIMD toggle and update pixel math logic
ssh4net f0552f7
Revamp Highway SIMD benchmark with detailed tests
ssh4net 7229160
Apply pixel-center convention in resample_hwy interpolation
ssh4net ed0c40b
Move and integrate hwy_test into src directory
ssh4net d6809ff
Add SIMD (Highway) acceleration for pixel math ops
ssh4net d6fbbc7
Add clamp benchmark and format code in mad_impl_hwy
ssh4net 45d6beb
Optimize ImageBufAlgo integer ops with native SIMD
ssh4net b87f054
Add Highway SIMD for invert and contrast_remap ops
ssh4net 1640275
Normalize integer image data to 0-1 in SIMD math
ssh4net 53b0294
Revert "Remove all GitHub Actions workflow files"
ssh4net f2fe675
Improve SIMD normalization for signed integer image types
ssh4net 699196b
Update hwy_test.cpp
ssh4net 87421c4
Update imagebufalgo_xform.cpp
ssh4net 37089ae
Refactor ImageBufAlgo SIMD helpers and usage
ssh4net File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,264 @@ | ||
| ImageBufAlgo Highway (hwy) Implementation Guide | ||
| ============================================== | ||
|
|
||
| This document explains how OpenImageIO uses Google Highway (hwy) to accelerate | ||
| selected `ImageBufAlgo` operations, and how to add or modify kernels in a way | ||
| that preserves OIIO semantics while keeping the code maintainable. | ||
|
|
||
| This is a developer-facing document about the implementation structure in | ||
| `src/libOpenImageIO/`. It does not describe the public API behavior of the | ||
| algorithms. | ||
|
|
||
|
|
||
| Goals and non-goals | ||
| ------------------- | ||
|
|
||
| Goals: | ||
| - Make the hwy-backed code paths easy to read and easy to extend. | ||
| - Centralize repetitive boilerplate (type conversion, tails, ROI pointer math). | ||
| - Preserve OIIO's numeric semantics (normalized integer model). | ||
| - Keep scalar fallbacks as the source of truth for tricky layout cases. | ||
|
|
||
| Non-goals: | ||
| - Explain Highway itself. Refer to the upstream Highway documentation. | ||
| - Guarantee that every ImageBufAlgo op has a hwy implementation. | ||
|
|
||
|
|
||
| Where the code lives | ||
| -------------------- | ||
|
|
||
| Core helpers: | ||
| - `src/libOpenImageIO/imagebufalgo_hwy_pvt.h` | ||
|
|
||
| Typical hwy call sites: | ||
| - `src/libOpenImageIO/imagebufalgo_addsub.cpp` | ||
| - `src/libOpenImageIO/imagebufalgo_muldiv.cpp` | ||
| - `src/libOpenImageIO/imagebufalgo_mad.cpp` | ||
| - `src/libOpenImageIO/imagebufalgo_pixelmath.cpp` | ||
| - `src/libOpenImageIO/imagebufalgo_xform.cpp` (some ops are hwy-accelerated) | ||
|
|
||
|
|
||
| Enabling and gating the hwy path | ||
| ------------------------------- | ||
|
|
||
| The hwy path is only used when: | ||
| - Highway usage is enabled at runtime (`OIIO::pvt::enable_hwy`). | ||
| - The relevant `ImageBuf` objects have local pixel storage (`localpixels()` is | ||
| non-null), meaning the data is in process memory rather than accessed through | ||
| an `ImageCache` tile abstraction. | ||
| - The operation can be safely expressed as contiguous streams of pixels/channels | ||
| for the hot path, or the code falls back to a scalar implementation for | ||
| strided/non-contiguous layouts. | ||
|
|
||
| The common gating pattern looks like: | ||
| - In a typed `*_impl` dispatcher: check `OIIO::pvt::enable_hwy` and `localpixels` | ||
| and then call a `*_impl_hwy` function; otherwise call `*_impl_scalar`. | ||
|
|
||
| Important: the hwy path is an optimization. Correctness must not depend on hwy. | ||
|
|
||
|
|
||
| OIIO numeric semantics: why we promote to float | ||
| ---------------------------------------------- | ||
|
|
||
| OIIO treats integer image pixels as normalized values: | ||
| - Unsigned integers represent [0, 1]. | ||
| - Signed integers represent approximately [-1, 1] with clamping for INT_MIN. | ||
|
|
||
| Therefore, most pixel math must be performed in float (or double) space, even | ||
| when the stored data is integer. This is why the hwy layer uses the | ||
| "LoadPromote/Operate/DemoteStore" pattern. | ||
|
|
||
| For additional discussion (and pitfalls of saturating integer arithmetic), see: | ||
| - `HIGHWAY_SATURATING_ANALYSIS.md` | ||
|
|
||
|
|
||
| The core pattern: LoadPromote -> RunHwy* -> DemoteStore | ||
| ------------------------------------------------------- | ||
|
|
||
| The helper header `imagebufalgo_hwy_pvt.h` defines the reusable building blocks: | ||
|
|
||
| 1) Computation type selection | ||
| - `SimdMathType<T>` selects `float` for most types, and `double` only when | ||
| the destination type is `double`. | ||
|
|
||
| Rationale: | ||
| - Float math is significantly faster on many targets. | ||
| - For OIIO, integer images are normalized to [0,1] (or ~[-1,1]), so float | ||
| precision is sufficient for typical image processing workloads. | ||
|
|
||
| 2) Load and promote (with normalization) | ||
| - `LoadPromote(d, ptr)` and `LoadPromoteN(d, ptr, count)` load values and | ||
| normalize integer ranges into the computation space. | ||
|
|
||
| Rationale: | ||
| - Consolidates all normalization and conversion logic in one place. | ||
| - Prevents subtle drift where each operation re-implements integer scaling. | ||
| - Ensures tail handling ("N" variants) is correct and consistent. | ||
|
|
||
| 3) Demote and store (with denormalization/clamp/round) | ||
| - `DemoteStore(d, ptr, v)` and `DemoteStoreN(d, ptr, v, count)` reverse the | ||
| normalization and store results in the destination pixel type. | ||
|
|
||
| Rationale: | ||
| - Centralizes rounding and clamping behavior for all destination types. | ||
| - Ensures output matches OIIO scalar semantics. | ||
|
|
||
| 4) Generic kernel runners (streaming arrays) | ||
| - `RunHwyUnaryCmd`, `RunHwyCmd` (binary), `RunHwyTernaryCmd` | ||
| - These are the primary entry points for most hwy kernels. | ||
|
|
||
| Rationale: | ||
| - Encapsulates lane iteration and tail processing once. | ||
| - The call sites only provide the per-lane math lambda, not the boilerplate. | ||
|
|
||
|
|
||
| Native integer runners: when they are valid | ||
| ------------------------------------------- | ||
|
|
||
| Some operations are "scale-invariant" under OIIO's normalized integer model. | ||
| For example, for unsigned integer add: | ||
| - `(a/max + b/max)` in float space, then clamped to [0,1], then scaled by max | ||
| matches saturated integer add `SaturatedAdd(a, b)` for the same bit depth. | ||
|
|
||
| For those cases, `imagebufalgo_hwy_pvt.h` provides: | ||
| - `RunHwyUnaryNativeInt<T>` | ||
| - `RunHwyBinaryNativeInt<T>` | ||
|
|
||
| These should only be used when all of the following are true: | ||
| - The operation is known to be scale-invariant under the normalization model. | ||
| - Input and output types are the same integral type. | ||
| - The operation does not depend on mixed types or float-range behavior. | ||
|
|
||
| Rationale: | ||
| - Avoids promotion/demotion overhead and can be materially faster. | ||
| - Must be opt-in and explicit, because many operations are NOT compatible with | ||
| raw integer arithmetic (e.g. multiplication, division, pow). | ||
|
|
||
|
|
||
| Local pixel pointer helpers: reducing boilerplate safely | ||
| ------------------------------------------------------- | ||
|
|
||
| Most hwy call sites need repeated pointer and stride computations: | ||
| - Pixel size in bytes. | ||
| - Scanline size in bytes. | ||
| - Base pointer to local pixels. | ||
| - Per-row pointer for a given ROI and scanline. | ||
| - Per-pixel pointer for non-contiguous fallbacks. | ||
|
|
||
| To centralize that, `imagebufalgo_hwy_pvt.h` defines: | ||
| - `HwyPixels(ImageBuf&)` and `HwyPixels(const ImageBuf&)` | ||
| returning a small view (`HwyLocalPixelsView`) with: | ||
| - base pointer (`std::byte*` / `const std::byte*`) | ||
| - `pixel_bytes`, `scanline_bytes` | ||
| - `xbegin`, `ybegin`, `nchannels` | ||
| - `RoiNChannels(roi)` for `roi.chend - roi.chbegin` | ||
| - `ChannelsContiguous<T>(view, nchannels)`: | ||
| true only when the pixel stride exactly equals `nchannels * sizeof(T)` | ||
| - `PixelBase(view, x, y)`, `ChannelPtr<T>(view, x, y, ch)` | ||
| - `RoiRowPtr<T>(view, y, roi)` for the start of the ROI row at `roi.xbegin` and | ||
| `roi.chbegin`. | ||
|
|
||
| Rationale: | ||
| - Avoids duplicating fragile byte-offset math across many ops. | ||
| - Makes it visually obvious what the code is doing: "get row pointer" vs | ||
| "compute offset by hand." | ||
| - Makes non-contiguous fallback paths less error-prone by reusing the same | ||
| pointer computations. | ||
|
|
||
| Important: these helpers are only valid for `ImageBuf` instances with local | ||
| pixels (`localpixels()` non-null). The call sites must check that before using | ||
| them. | ||
|
|
||
|
|
||
| Contiguous fast path vs non-contiguous fallback | ||
| ----------------------------------------------- | ||
|
|
||
| Most operations implement two paths: | ||
|
|
||
| 1) Contiguous fast path: | ||
| - Used when pixels are tightly packed for the ROI's channel range. | ||
| - The operation is executed as a 1D stream of length: | ||
| `roi.width() * (roi.chend - roi.chbegin)` | ||
| - Uses `RunHwy*Cmd` (or native-int runner) and benefits from: | ||
| - fewer branches | ||
| - fewer pointer computations | ||
| - auto tail handling | ||
|
|
||
| 2) Non-contiguous fallback: | ||
| - Used when pixels have padding, unusual strides, or channel subsets that do | ||
| not form a dense stream. | ||
| - Typically loops pixel-by-pixel and channel-by-channel. | ||
| - May still use the `ChannelPtr` helpers to compute correct addresses. | ||
|
|
||
| Rationale: | ||
| - The contiguous path is where SIMD delivers large gains. | ||
| - Trying to SIMD-optimize arbitrary strided layouts often increases complexity | ||
| and risk for marginal benefit. Keeping a scalar fallback preserves | ||
| correctness and maintainability. | ||
|
|
||
|
|
||
| How to add a new hwy kernel | ||
| --------------------------- | ||
|
|
||
| Step 1: Choose the kernel shape | ||
| - Unary: `R = f(A)` -> use `RunHwyUnaryCmd` | ||
| - Binary: `R = f(A, B)` -> use `RunHwyCmd` | ||
| - Ternary: `R = f(A, B, C)` -> use `RunHwyTernaryCmd` | ||
|
|
||
| Step 2: Decide if a native-int fast path is valid | ||
| - Only for scale-invariant ops and same-type integral inputs/outputs. | ||
| - Use `RunHwyUnaryNativeInt` / `RunHwyBinaryNativeInt` when safe. | ||
| - Otherwise, always use the promote/demote runners. | ||
|
|
||
| Step 3: Implement the hwy body with a contig check | ||
| Typical structure inside `*_impl_hwy`: | ||
| - Acquire views once: | ||
| - `auto Rv = HwyPixels(R);` | ||
| - `auto Av = HwyPixels(A);` etc. | ||
| - In the parallel callback: | ||
| - compute `nchannels = RoiNChannels(roi)` | ||
| - compute `contig = ChannelsContiguous<...>(...)` for each image | ||
| - for each scanline y: | ||
| - `Rtype* r_row = RoiRowPtr<Rtype>(Rv, y, roi);` | ||
| - `const Atype* a_row = RoiRowPtr<Atype>(Av, y, roi);` etc. | ||
| - if contig: call `RunHwy*` with `n = roi.width() * nchannels` | ||
| - else: fall back per pixel, per channel | ||
|
|
||
| Step 4: Keep the scalar path as the reference | ||
| - The scalar implementation should remain correct for all layouts and types. | ||
| - The hwy path should match scalar results for supported cases. | ||
|
|
||
|
|
||
| Design rationale summary | ||
| ------------------------ | ||
|
|
||
| This design intentionally separates concerns: | ||
| - Type conversion and normalization are centralized (`LoadPromote`, | ||
| `DemoteStore`). | ||
| - SIMD lane iteration and tail handling are centralized (`RunHwy*` runners). | ||
| - Image address computations are centralized (`HwyPixels`, `RoiRowPtr`, | ||
| `ChannelPtr`). | ||
| - Operation-specific code is reduced to short lambdas expressing the math. | ||
|
|
||
| This makes the hwy layer: | ||
| - Easier to maintain: fewer places to fix bugs when semantics change. | ||
| - Easier to extend: adding an op mostly means writing the math lambda and the | ||
| dispatch glue. | ||
| - Safer: correctness for unusual layouts remains in scalar fallbacks. | ||
|
|
||
|
|
||
| Notes on `half` | ||
| --------------- | ||
|
|
||
| The hwy conversion helpers handle `half` by converting through | ||
| `hwy::float16_t`. This currently assumes the underlying `half` representation | ||
| is compatible with how Highway loads/stores 16-bit floats. | ||
|
|
||
| If this assumption is revisited in the future, it should be changed as a | ||
| separate, explicit correctness/performance project. | ||
|
|
||
|
|
||
| <!-- SPDX-License-Identifier: CC-BY-4.0 --> | ||
| <!-- Copyright Contributors to the OpenImageIO Project. --> | ||
|
|
||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer this be an optional dependency,
And then
#ifdef USE_HWYin the cpp files to conditionally compile the parts that need hwy.If we add a hard dependency, it precludes backporting any part of it to 3.1, since we don't change minimum dependencies in a released branch.