Skip to content

Conversation

@brandon-b-miller
Copy link
Contributor

@brandon-b-miller brandon-b-miller commented Oct 24, 2025

Part of #471

  • Adds a DeprecatedNDArrayAPIWarning emitted from all user facing functions for moving data around (cuda.to_device, driver.host_to_device, device_to_host, also as_cuda_array, is_cuda_array, etc
  • Separates existing now deprecated APIs into internal non-warning versions and external warning versions
  • Adds a deprecation warning to the DeviceNDArray ctor
  • Adds DeviceNDArray._create_nowarn
  • Removes as many usages of the deprecated APIs as possible from the test suite in favor of cupy arrays
  • Catches warnings for tests of the currently exposed and now deprecated APIs
  • Where absolutely necessary, tests calls internal non-warning versions of the deprecated APIs
  • Rework tests to not use these apis as much as possible

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 24, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@functools.wraps(func)
def wrapper(*args, **kwargs):
warnings.warn(
f"{func.__name__} api is deprecated. Please prefer cupy for array functions",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cupy arrays are much slower than DeviceNDArray because they require creating an external (i.e., non-numba-cuda-created) stream, so I'm not sure a recommendation for that is what we should do right now.

I was thinking that we can keep the top-level APIs (device_array etc.) and replace their internals with StridedMemoryView or something similar, in an effort to allow folks to as-cheaply-as-possible construct arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the current state of the art:

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur that a light weight device array like container should exist, I'm just not sure that numba-cuda should necessarily be the library providing it publicly. I think we should nudge users away from using numba-cuda as such, like for moving data from host to device. That said, I'm open to suggestions on what we should recommend.

@gmarkall gmarkall added the 2 - In Progress Currently a work in progress label Oct 24, 2025
@rparolin rparolin added this to the next milestone Oct 24, 2025
@brandon-b-miller
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +148 to 154
def _device_array(shape, dtype=np.float64, strides=None, order="C", stream=0):
shape, strides, dtype = prepare_shape_strides_dtype(
shape, strides, dtype, order
)
return devicearray.DeviceNDArray(
return devicearray.DeviceNDArray._create_nowarn(
shape=shape, strides=strides, dtype=dtype, stream=stream
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] The internal _device_array function is missing the @require_context decorator. This is inconsistent with the same function in _api.py (line 143) which has the decorator.

Since DeviceNDArray._create_nowarn requires an active CUDA context (it calls devices.get_context().memalloc() when allocating memory), calling this function without an active context will cause a runtime error. This function is used internally by the public device_array function which has the decorator via the outer function, but _device_array itself can be called directly from other modules (e.g., from cuda.kernels.reduction, cuda.vectorizers, cuda.random, etc.).

Suggested change
def _device_array(shape, dtype=np.float64, strides=None, order="C", stream=0):
shape, strides, dtype = prepare_shape_strides_dtype(
shape, strides, dtype, order
)
return devicearray.DeviceNDArray(
return devicearray.DeviceNDArray._create_nowarn(
shape=shape, strides=strides, dtype=dtype, stream=stream
)
@require_context
def _device_array(shape, dtype=np.float64, strides=None, order="C", stream=0):
shape, strides, dtype = prepare_shape_strides_dtype(
shape, strides, dtype, order
)
return devicearray.DeviceNDArray._create_nowarn(
shape=shape, strides=strides, dtype=dtype, stream=stream
)

Comment on lines +262 to +264
cuda._api._from_cuda_array_interface(res.__cuda_array_interface__)[
:1
].copy_to_device(partials[:1], stream=stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Unnecessary complexity: converting res through __cuda_array_interface__ when it's already sliceable.

The res parameter is a device array that already supports slicing operations. The old code simply used res[:1] which worked correctly. Converting it through _from_cuda_array_interface(res.__cuda_array_interface__) adds unnecessary overhead and complexity without any functional benefit.

Since res implements the CUDA array interface, it can be sliced directly. The [:1] operation will work on any object that implements __getitem__ properly, including DeviceNDArray and other CUDA-compatible arrays.

Suggested change
cuda._api._from_cuda_array_interface(res.__cuda_array_interface__)[
:1
].copy_to_device(partials[:1], stream=stream)
res[:1].copy_to_device(partials[:1], stream=stream)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

)


def _device_array(shape, dtype=np.float64, strides=None, order="C", stream=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P0] Missing @require_context decorator. This function calls DeviceNDArray._create_nowarn() which internally calls devices.get_context().memalloc() (line 123 in devicearray.py), requiring an active CUDA context. Without this decorator, the function will fail with a context error when called without an active context. The equivalent function in _api.py (line 143) correctly has this decorator.

Comment on lines +262 to +264
cuda._api._from_cuda_array_interface(res.__cuda_array_interface__)[
:1
].copy_to_device(partials[:1], stream=stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Unnecessary complexity. res is already a sliceable device array (CuPy array as shown in tests), so res[:1] works directly. Converting through __cuda_array_interface__ and _from_cuda_array_interface() adds overhead without benefit. The original code res[:1].copy_to_device(partials[:1], stream=stream) was simpler and equivalent.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@brandon-b-miller
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

75 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +148 to 154
def _device_array(shape, dtype=np.float64, strides=None, order="C", stream=0):
shape, strides, dtype = prepare_shape_strides_dtype(
shape, strides, dtype, order
)
return devicearray.DeviceNDArray(
return devicearray.DeviceNDArray._create_nowarn(
shape=shape, strides=strides, dtype=dtype, stream=stream
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Missing @require_context decorator. The function calls DeviceNDArray._create_nowarn() which requires an active CUDA context (allocates memory via devices.get_context().memalloc()). The equivalent function in _api.py (line 143) has this decorator.

@brandon-b-miller
Copy link
Contributor Author

/ok to test

@brandon-b-miller
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

79 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

)


def _device_array(shape, dtype=np.float64, strides=None, order="C", stream=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

syntax: Missing @require_context decorator. This function calls DeviceNDArray._create_nowarn() which allocates memory via devices.get_context().memalloc() (devicearray.py:123), requiring an active CUDA context. The equivalent function in _api.py (line 143) correctly has this decorator.

Suggested change
def _device_array(shape, dtype=np.float64, strides=None, order="C", stream=0):
@require_context
def _device_array(shape, dtype=np.float64, strides=None, order="C", stream=0):

Comment on lines +262 to +264
cuda._api._from_cuda_array_interface(res.__cuda_array_interface__)[
:1
].copy_to_device(partials[:1], stream=stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Unnecessary complexity. res is already a sliceable device array (supports __cuda_array_interface__), so converting through _from_cuda_array_interface() adds overhead without benefit. Direct slicing works:

Suggested change
cuda._api._from_cuda_array_interface(res.__cuda_array_interface__)[
:1
].copy_to_device(partials[:1], stream=stream)
res[:1].copy_to_device(partials[:1], stream=stream)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@brandon-b-miller
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

)


def _device_array(shape, dtype=np.float64, strides=None, order="C", stream=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing @require_context decorator. This function calls DeviceNDArray._create_nowarn() which allocates memory via devices.get_context().memalloc(), requiring an active CUDA context. The equivalent function in _api.py (line 143) has this decorator.

Suggested change
def _device_array(shape, dtype=np.float64, strides=None, order="C", stream=0):
@require_context
def _device_array(shape, dtype=np.float64, strides=None, order="C", stream=0):

Comment on lines +262 to +264
cuda._api._from_cuda_array_interface(res.__cuda_array_interface__)[
:1
].copy_to_device(partials[:1], stream=stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary complexity. res is already a sliceable device array (CuPy array in tests), so direct slicing works:

res[:1].copy_to_device(partials[:1], stream=stream)

Converting through __cuda_array_interface__ and _from_cuda_array_interface() adds overhead without benefit.

@brandon-b-miller
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +148 to 154
def _device_array(shape, dtype=np.float64, strides=None, order="C", stream=0):
shape, strides, dtype = prepare_shape_strides_dtype(
shape, strides, dtype, order
)
return devicearray.DeviceNDArray(
return devicearray.DeviceNDArray._create_nowarn(
shape=shape, strides=strides, dtype=dtype, stream=stream
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing @require_context decorator. This function calls DeviceNDArray._create_nowarn() which requires an active CUDA context (allocates memory via devices.get_context().memalloc() at devicearray.py:126). The equivalent function in _api.py:143 correctly has this decorator.

Suggested change
def _device_array(shape, dtype=np.float64, strides=None, order="C", stream=0):
shape, strides, dtype = prepare_shape_strides_dtype(
shape, strides, dtype, order
)
return devicearray.DeviceNDArray(
return devicearray.DeviceNDArray._create_nowarn(
shape=shape, strides=strides, dtype=dtype, stream=stream
)
@require_context
def _device_array(shape, dtype=np.float64, strides=None, order="C", stream=0):

@brandon-b-miller
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@brandon-b-miller
Copy link
Contributor Author

/ok to test

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

)


def _device_array(shape, dtype=np.float64, strides=None, order="C", stream=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing @require_context decorator. This function calls DeviceNDArray._create_nowarn() which allocates memory via devices.get_context().memalloc() (devicearray.py:126), requiring an active CUDA context. The equivalent function in _api.py:143 has this decorator.

Comment on lines +262 to +264
cuda._api._from_cuda_array_interface(res.__cuda_array_interface__)[
:1
].copy_to_device(partials[:1], stream=stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary complexity. res already supports slicing (it implements __cuda_array_interface__). Direct slicing works and is simpler: res[:1].copy_to_device(partials[:1], stream=stream)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2 - In Progress Currently a work in progress

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants