Datastore out of the space, but no alerts generated by VC

As an organization, it is important to ensure that your datastore does not run out of space. Running out of space can lead to system crashes and data loss, which can be catastrophic for your business. However, it is not uncommon for datastore space to become critically low without generating any alerts in vCenter. In this blog post, we will explore the possible reasons why vCenter may not be generating alerts for low datastore space and what steps you can take to mitigate this risk.

Possible Reasons Why vCenter is Not Generating Alerts

  1. Alert Thresholds Not Set: One of the most common reasons why vCenter may not generate alerts for low datastore space is that the alert thresholds are not set correctly. Alert thresholds are set to notify administrators when specific metrics fall below a certain level. If these thresholds are not set, or are set too high, then alerts will not be generated.
  2. Notification Settings Not Configured: Another reason why vCenter may not be generating alerts is that the notification settings are not configured correctly. Notifications can be sent via email or SNMP, and if these settings are not configured, then alerts will not be sent.
  3. Alerting Service is Disabled: In some cases, the alerting service may be disabled in vCenter. If this is the case, then no alerts will be generated even if the alert thresholds are set correctly.
  4. Datastore Metrics Not Being Monitored: Lastly, vCenter may not be generating alerts if the metrics for the datastore are not being monitored. If the datastore is not being monitored, then vCenter will not be able to detect when the space is running low and therefore will not generate any alerts.

Steps to Mitigate Risk of Running Out of Datastore Space

  1. Set Alert Thresholds: The first step in mitigating the risk of running out of datastore space is to set alert thresholds in vCenter. This can be done by navigating to the datastore in vCenter, clicking on the “Configure” tab, and then selecting “Alarm Definitions”. From there, you can set the thresholds for various metrics, such as free space percentage, and configure the notification settings.
  2. Test Alerting System: Once the alert thresholds have been set, it is important to test the alerting system to ensure that it is working correctly. This can be done by intentionally lowering the datastore space and verifying that an alert is generated.
  3. Monitor Datastore Metrics: It is important to monitor the metrics for the datastore regularly to ensure that there are no issues. This can be done by setting up regular monitoring and reporting in vCenter or by using third-party monitoring tools.
  4. Increase Datastore Capacity: If you find that your datastore is frequently running out of space, it may be time to consider increasing the capacity of the datastore. This can be done by adding more disks to the datastore or by adding additional datastores to the cluster.



The issue I faced while supporting customer:

Problem Statement:

Datastore out of the space, but no alerts generated by vCenter

Analysis summary:

The Lun ran out of space, which caused the VMs to go down. The datastore running out of available space is not a log event. LUN with thin provisioning is underlined. Based on the size Lun had indicated, a datastore was created and space was allotted for it. As a result, Datastore wasn’t full. Nonetheless, the storage side of the Lun was entirely utilised.

For example:

The actual storage space is 100GB. The thin provisioned lun created with 1TB of size. When the datastore gets created it sees 1TB as a size and not 100GB. Hence even when the space on the storage gets full up to 99.99 GB host sees enough space is available on the datastore i.e. 900+GB. Remember vCenter only sends the alert when the datastore gets 75%. 100GB is not 75% for 1TB of space.

Hence, neither the host nor vCenter receive an alert. Yet, the vmkernel logs contain several events indicating storage space problems.

Next Action:

Need to validate with storage if they need to increase the Lun space from the array.

Detailed analysis:

2021-01-23T17:41:41.497Z cpu33:66009)<7>fnic : 2 :: Abort Cmd called Cmd=0x0x4395d8159400 CmdSn=0x6927c6042 FCID 0x40320, LUN 0x0 TAG a2 Op=0x89 flags 3

2021-01-23T17:41:41.497Z cpu33:66009)<6>fnic : 2 :: CBD Opcode: 89 Abort issued time: 8000 msec

2021-01-23T17:41:41.497Z cpu9:286594249)<6>fnic : 2 :: icmnd_cmpl abts pending hdr status = FCPIO_ABORTED tag = 0xa2 sc = 0x0x4395ca362240scsi_status = 0 residual = 0

2021-01-23T17:41:41.497Z cpu9:286594249)<7>fnic : 2 :: abts cmpl recd. id 162 status FCPIO_SUCCESS

2021-01-23T17:41:41.497Z cpu33:66009)<7>fnic : 2 :: Returning from abort cmd type 2 SUCCESS

### Could see Fnic drivers started aborting the IOs at 2021-01-23T17:41:41.497Z (UTC)

2021-01-23T17:41:41.497Z cpu15:378449387)ScsiDeviceIO: 3015: Cmd(0x4395d8159400) 0x89, CmdSN 0x13c418a8 from world 65599 to dev "naa.60001440000000107058b8e339184a4b" failed H:0x8 D:0x0 P:0x0 Invalid sense data: 0x7 0x43 0x0.

### The device where the commands were getting aborted was "naa.60001440000000107058b8e339184a4b"

### Host Status [0x8] RESET This status returned when the HBA driver has aborted the I/O.

### Sense Key [0x7] DATA PROTECT <<< Indicate the space issue

### Additional Sense Data 43/00 MESSAGE ERROR

### The above messages are observed when the LUN is out of space from the storage side.

Conclusion

Running out of space on your datastore can be a serious issue for any organization. While vCenter is a powerful tool for managing your virtual infrastructure, it is not infallible. By understanding the possible reasons why vCenter may not generate alerts for low datastore space and taking steps to mitigate the risk of running out of space, you can ensure that your virtual infrastructure remains stable and secure.

Leave a comment