If your vSAN Witness partition is partitioned, it means that the vSAN Witness appliance cannot communicate with one or more of the hosts in the vSAN cluster. This can happen for a number of reasons, such as network connectivity issues, misconfiguration of the hosts, or hardware failures. In this blog post, we will explore some steps that you can take to fix the issue.
Step 1: Verify network connectivity
The first step in troubleshooting a partitioned vSAN Witness appliance is to verify network connectivity. Make sure that the vSAN Witness appliance can communicate with all hosts in the vSAN cluster using the management network. Check the network configuration of the hosts to ensure that they are configured correctly and that there are no network misconfigurations.
How to validate the vSAN network connectivity please look at the blog-post “Verify network connectivity between Data nodes and Witness”
Step 2: Verify vSAN configuration
Once you have verified that the network connectivity is not the issue, the next step is to verify the vSAN configuration. Ensure that the vSAN cluster is configured correctly and that all hosts are configured with the correct vSAN network settings. Check that the vSAN cluster is healthy and that there are no issues with any of the components.
It will be easy to validate Skylane health status for other configuration parameters than Objects.
Step 3: Check logs
If the network and vSAN configuration are both correct, the issue may be with the vSAN Witness appliance itself. Check the vSAN Witness appliance logs for any errors or warnings that may indicate the cause of the partition. To access the vSAN Witness appliance logs, connect to the appliance using SSH and run the following command:
# less /var/run/log/vmkernel.log | grep -E “LeaderRemoveNodeFromMembership” or “Lost contact”
In this event it is clear that the cluster lost access to the witness node. Heartbeat is lossed and hence the witness node was removed from the cluster
Above events clearly says that no path found for the witness IP, it indicates the issue is related to the network. It might not be a complete disconnect but also can be slowness causing udp packet delay to port 12321. This needs to be investigated from the inhouse network perspective.
Step 4: Restart the vSAN Witness appliance
If you cannot find the cause of the partition from the logs, you can try restarting the vSAN Witness appliance. This will refresh the network and vSAN configurations and may resolve the issue. To restart the vSAN Witness appliance, connect to the appliance using SSH and run the following command:
Step 5: Re-add the witness node to cluster
If restarting the appliance does not fix the issue, you may need to follow:
- Put the witness node in maintenance mode
- Remove Fault domain configuration
- Remove Disk groups from the witness node
- Erase vSAN partitions from the disks
- Re-Create fault domain configuration and select the same witness for it
Step 6: Contact VMware Support
If none of the above steps resolve the issue, you may need to contact support for further assistance. Provide them with the logs and any other relevant information to help them diagnose the issue.
In conclusion, a partitioned vSAN Witness appliance can be caused by a number of factors, including network connectivity issues, vSAN misconfiguration, or issues with the vSAN Witness appliance itself. By following the above steps, you can troubleshoot the issue and resolve it in a timely manner.