Add host failure test to verify VM resiliency and SR stability #313

rushikeshjadhav · 2025-05-21T16:05:54Z

Added test_linstor_sr_fail_host to simulate a crash of a non-master host.

Chooses a host within a LINSTOR SR pool and simulate crash using sysrq-trigger.
Verifies VM boot and shutdown on all remaining hosts during the outage, and confirms recovery of the failed host for VM placement post-reboot.
Ensures SR scan consistency post-recovery.

stormi · 2025-05-22T17:15:23Z

tests/storage/linstor/test_linstor_sr.py

If the test fails while the SR is in a failed state, could this leave the pool in a bad state?

If the test fails and SR goes bad, I don't see an easy way to recover from it. There could be scenario based troubleshooting required before cleaning up. However, principally single host failure in XOSTOR should be tolerable and this test should catch in case failures.

We probably should use nested tests for this so that in case the pool goes bad, we can just wipe and start with a new clean one.

or the SR used for physical tests should be a throw-away one too?

There are block devices involved (LVM, drbd, tapdisk which gets blocked on IO), and an improper teardown needs careful inspection and recover or a harsh wipe of everything + reboots. Host failure is still tolerated better than a disk/LVM failure.

XOSTOR SR is hardly a throw-away.

Well if we use those block devices for nothing else than this test, we can easily blank them to restart from scratch. We do that for all local SRs, and in some way a Linstor SR is "local to the pool". If our test pool is the sole user, it looks throw-away to me.

Yes, this happens (manually) when test needs clean start. If it's acceptable then we can add it into prepare_test or similar so that manual script is not required.

rushikeshjadhav · 2025-06-23T13:08:58Z

@ydirson do we need new fixture to handle VM reboots (during host failed and recovered state)? https://github.com/xcp-ng/xcp-ng-tests/pull/313/files#diff-e40824d600ab1c5614cf60bf13e30d8bea1634a03c0df205b9cb1a15239a8505R162-R164

Wescoeur · 2025-08-04T15:35:26Z

tests/storage/linstor/test_linstor_sr.py

+        hosts.remove(sr.pool.master)
+        # Evacuate the node to be deleted
+        try:
+            random_host = random.choice(hosts) # TBD: Choose Linstor Diskfull node


Suggested change

random_host = random.choice(hosts) # TBD: Choose Linstor Diskfull node

random_host = random.choice(hosts) # TBD: Choose Linstor Diskful node

Wescoeur · 2025-08-04T15:37:10Z

tests/storage/linstor/test_linstor_sr.py

+        Fail non master host from the same pool Linstor SR.
+        Ensure that VM is able to boot and shutdown on all hosts.
+        """
+        import random


random remains a common module, perhaps we should put it in global import in case a new future function uses it in this module?

stormi · 2025-11-24T15:45:20Z

Ping @rushikeshjadhav. What's the status after the various comments made?

rushikeshjadhav · 2025-12-01T06:29:54Z

Ping @rushikeshjadhav. What's the status after the various comments made?

Will rebase and update the PR.

Lankou66 · 2026-01-23T14:41:59Z

Ping @rushikeshjadhav. What's the status after the various comments made?

Will rebase and update the PR.

Hello @rushikeshjadhav , any update about the review and the rebase made above ?

… host. - Chooses a host within a LINSTOR SR pool and simulate crash using sysrq-trigger. - Verifies VM boot and shutdown on all remaining hosts during the outage, and confirms recovery of the failed host for VM placement post-reboot. - Ensures SR scan consistency post-recovery. Signed-off-by: Rushikesh Jadhav <[email protected]>

rushikeshjadhav · 2026-01-28T11:48:42Z

Ping @rushikeshjadhav. What's the status after the various comments made?

Will rebase and update the PR.

Hello @rushikeshjadhav , any update about the review and the rebase made above ?

Its ready.

glehmann · 2026-01-28T12:45:29Z

tests/storage/linstor/test_linstor_sr.py

+        """
+        sr = linstor_sr
+        vm = vm_on_linstor_sr
+        # Ensure that its a single host pool and not multi host pool


This comment, as I understand it, says the opposite of what is checked the line below.
IMO, for simple code like this one, a comment doesn't enhance the clarity and might become out of sync with the code. I would just remove it.

glehmann · 2026-01-28T12:54:04Z

tests/storage/linstor/test_linstor_sr.py

+                vm.wait_for_os_booted()
+                vm.shutdown(verify=True)
+
+        # Wait for radom_host to come online


The random host is rebooting while we test that the hosts can run a vm, but I don't see anything to ensure that the random host was still unusable at that point.
Could you add an assert random_host.is_ssh_up() or something like that here?

We are not checking the VM resiliency only on "Host down" scenario but intension was to let other hosts take over (via HA or anything) and not be contingent on "random host" state. Does it work?

I'm not sure I understand. Is this not a test to validate the availability of the Linstor SR when one of the Linstor hosts is down?

stormi reviewed May 22, 2025

View reviewed changes

Wescoeur requested changes Aug 4, 2025

View reviewed changes

rushikeshjadhav force-pushed the feat-storage-linstor-625 branch from 73e2c06 to ebb1e7d Compare January 26, 2026 11:37

rushikeshjadhav requested review from a team as code owners January 26, 2026 11:37

glehmann reviewed Jan 28, 2026

View reviewed changes

	random_host = random.choice(hosts) # TBD: Choose Linstor Diskfull node
	random_host = random.choice(hosts) # TBD: Choose Linstor Diskful node

Add host failure test to verify VM resiliency and SR stability #313

Are you sure you want to change the base?

Add host failure test to verify VM resiliency and SR stability #313

Conversation

rushikeshjadhav commented May 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rushikeshjadhav commented Jun 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stormi commented Nov 24, 2025

Uh oh!

rushikeshjadhav commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Lankou66 commented Jan 23, 2026

Uh oh!

rushikeshjadhav commented Jan 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rushikeshjadhav commented Dec 1, 2025 •

edited

Loading