nvmeof: treat "connecting" state as valid in path detection#5974
Conversation
When checking if a path to a gateway already exists, treat both "live" and "connecting" states as valid connections that should not be re-attempted. The "connecting" state indicates the NVMe kernel is actively trying to establish or re-establish a connection, which occurs in scenarios like: - Initial connection establishment - Gateway temporarily unavailable and kernel retrying - Subsystem deleted and recreated on the gateway The kernel's ctrl_loss_tmo mechanism will continue retry attempts for up to 30 minutes ( by -l param in nvme connect command). Attempting nvme connect while a path is in "connecting" state results in "already connected" errors and can cause volume attachment failures during create/delete cycles. By treating "connecting" as a valid state, we allow the kernel's retry logic to handle reconnection automatically without interference. Signed-off-by: gadi-didi <gadi.didi@ibm.com>
ad52f60 to
4dc16ff
Compare
| path.Address.Trsvcid == gatewayPort && | ||
| path.State == "live" { | ||
| (path.State == "live" || | ||
| path.State == "connecting") { |
There was a problem hiding this comment.
Do we need to retry if its in connecting state until it goes to live?
There was a problem hiding this comment.
No, we don't need to retry. Here's why:
The kernel is already doing the retrying for us. When a path is in "connecting" state, the NVMe kernel driver is actively attempting to establish the connection and will keep retrying automatically for up to 30 minutes (based on the ctrl_loss_tmo we set with -l 1800 in nvme connect command).
From the CSI driver's perspective, our job is just to ensure a connection attempt has been initiated for the next step- mounting.
Once we see a path in either "live" or "connecting" state for our subsystem, we know the kernel has it under control.
What happens in practice:
this function is called from the nodeserver in NodeStageVolume(). right after ControllerPublishVolume() added (if was need to add) the host (by GRPC AddHost to nvmeof GW)
If the path is "live" --> great, connection is working
If the path is "connecting" --> it means there is a path but it sis not connected (probably due to remove last nvmeof ns , So the ControllerUnPublishVolume() removed the host and the subsystem) kernel is retrying, and when the ControllerPublishVolume() added the host again, it becomes available but it is too fast get into this check. the nvme will automatically transition to "live" and the namespace will appear.
another example.
- there was connection with live path for some nvmeof ns . then the admin deletes that pod and the pvc. the host and the subsystem both are deleted.
- 30 minutes passed
- admin wants to create new pvc and then new pod.
- the
nvme subsys-listwill not see connecting or live. - nvme connect will rerun again
actually I could anyway to rerun connect, without this checking. if it is "live" or "connecting" I would get an error "Already Connected" So basically, I can parse the error code and stderr and see if this is the case, move on, and do not raise an error.
@Madhu-1 what do you think?
There was a problem hiding this comment.
A couple of questions
- if its in the connecting state, will the application be able to read/write? I assume no. What is the use of mounting such PVC if it cannot be used?
- Will there be any problem during NodeUnstage?
actually I could anyway to rerun connect, without this checking. if it is "live" or "connecting" I would get an error "Already Connected" So basically, I can parse the error code and stderr and see if this is the case, move on, and do not raise an error.
If the nvme subsys-list is not reliable, we should rerun the connect and discard known errors
There was a problem hiding this comment.
if its in the connecting state, will the application be able to read/write? I assume no. What is the use of mounting such PVC if it cannot be used?
The "connecting" state we're checking happens right after ControllerPublishVolume stage, and before the mount stage.
Here's the timeline:
-
CreateVolume - Volume is created on the gateway
-
ControllerPublishVolume - Host is added to the subsystem on the gateway
-
NodeStageVolume - This is where we check for existing connections.
At this point, we call nvme list-subsys to see what's already connected
If we see "connecting", it means a previous operation already initiated the connection (some prev PVC that's already deleted) -
The kernel continues establishing the connection in the background
-
Before we actually mount, we wait for the namespace device to appear using
GetNamespaceDeviceByUUID()which retries until the device is accessible -
Only when the device exists and is accessible do we proceed to format and mount it
So, the key point is: We don't mount a PVC that's stuck in "connecting" state. We just avoid running nvme connect again (which would fail with "already connected"). We still wait for the device to become available before mounting.
If the nvme subsys-list is not reliable, we should rerun the connect and discard known errors
I dont think it is not reliable.
at some runs the nvme subsys-list already shows live . I think it is about timing. but for sure (and easier way) to parse the stderr is fit solution.
But I would avoid parsing the string in stderr in some reason the next nvme-tool version will return another sting.
There was a problem hiding this comment.
I'm happy with this approach for now. If there is a need to always try to connect, we can do that at a later point in time too.
|
@Mergifyio queue |
✅ The pull request has been merged automaticallyDetailsThe pull request has been merged automatically at 8df46e3 |
Merge Queue Status✅ The pull request has been merged at 4dc16ff This pull request spent 35 minutes 10 seconds in the queue, including 34 minutes 52 seconds running CI. Required conditions to merge
|
When checking if a path to a gateway already exists, treat both "live" and "connecting" states as
valid connections that should not be re-attempted.
The "connecting" state indicates the NVMe kernel
is actively trying to establish or re-establish
a connection, which occurs in scenarios like:
The kernel's ctrl_loss_tmo mechanism will continue retry attempts for up to 30 minutes
( by
-l 1800param innvme connectcommand).Attempting nvme connect while a path is in
"connecting" state results in "already connected"
errors and can cause volume attachment failures
during create/delete cycles.
By treating "connecting" as a valid state,
we allow the kernel's retry logic to handle
reconnection automatically without interference.
Related issues
Fixes: #5964
Checklist:
guidelines in the developer
guide.
Request
notes
updated with breaking and/or notable changes for the next major release.
Show available bot commands
These commands are normally not required, but in case of issues, leave any of
the following bot commands in an otherwise empty comment in this PR:
/retest ci/centos/<job-name>: retest the<job-name>after unrelatedfailure (please report the failure too!)