Foglight for Storage Management Shared 4.6

Analyzing Storage Issues

If the view for a datastore or RDM disk extent shows the Attention icon, the troubleshooting algorithm has discovered evidence of a performance problem related to storage. The problem may or may not be in the SAN Storage environment. Review the details to determine the cause of the performance issue.

Each datastore/RDM view has three summary panels (from left to right):

•

VM I/O to Datastore/RDM (first panel)

•

Latency for Disk Extents (middle panel)

•

Diagnosis (last panel)

A virtual machine may be connected to multiple datastores and RDM disk extents, each of which may report varying degrees of problems. When a virtual machine has more than one datastore/RDM view, start by scanning the timeline bars in the VM I/O to Datastore/RDM panel to identify a datastore/RDM with consistently slow I/O performance or significant changes from typical performance.

The following workflow describes one way to identify a latency problem in the collected SAN Storage environment. While the details in your investigation may differ, the general workflow should be similar to this one.

To analyze storage issues:

In a view showing the Attention

icon, scan the VM I/O to Datastore/RDM summary (first panel). Look for timeline bars that primarily show colors such as yellow, orange, or pink, that is, any color other than green (which represents acceptable activity).

In this example, the VM Latency vs Threshold timeline is orange, which means the virtual machine is consistently exceeding the default latency thresholds that were specified for the analysis. We should focus our investigation here.

The VM Latency vs Typical timeline is green, which means that the latency is typical for the time period; this behavior has been going on for some time. The typical values are statistical values determined by IntelliProfile from activity within the last 30 days.

Now look at the Latency for Disk Extents summary (middle panel) to identify the disk extents that are contributing to the problem.


	NOTE: When a datastore is connected to a NASVolume, this panel is empty.

In this case, there is only one disk extent attached to the datastore. Its timeline is orange, which means the disk extent is exceeding latency thresholds. The number in brackets indicates that the virtual machine was performing I/O to the disk extent while the VM was experiencing latency. The larger the number, the more I/O was occurring. When this number is zero, no I/O occurred while the VM was experiencing latency.


	TIP: In your own investigations, you may have more than one disk extent, in which case you may be able to see that one disk extent is slow while the others are normal. Or you may have multiple virtual machines sharing the same disk extent. In this case, the Diagnosis panel may display a message to let you know that the latency may not reflect this virtual machine alone.

Next, review the notes in the Diagnosis summary (last panel).

In this case, one of the notes describes a correlation between the virtual machine latency and the disk extent latency. The other note points toward a problem in the SAN Storage or the network, and below is a button that begins an analysis of the SAN Storage.


	NOTE: If the troubleshooting algorithm determines that the issue is likely in the SAN Storage environment, and if the SAN Storage is provided by an element that the Foglight for Storage Management system is monitoring, the Analyze SAN Storage button appears in the Diagnosis panel.

Before analyzing the SAN Storage, you may want to quantify the performance issue by reviewing the metric values that underlie the timeline bars in the summary panels.

In the VM I/O to Datastore/RDM summary, click the Chart

icon.

The charts show the values of the metrics over the time period. Some charts also contain a baseline range, which shows a statistical range of values encountered over the last 30 days. Spikes outside of the normal range represent a significant change in behavior that may warrant further investigation.

In this case, the top chart shows that the VM I/O latency is hovering around 200 m/s, well above the default 25 m/s and 35 m/s latency thresholds.

Close the window.

In the Latency for Disk Extents summary, click the Chart

icon.

The chart shows the disk extent latency values hovering around 180 m/s. The latency here may reflect the activity of multiple VMs doing I/O or the performance within the ESX itself. It is clear that a significant delay is occurring in the disk extent.

Close the window.

In the Diagnosis panel, click Analyze SAN Storage.

The Storage-Side Analysis window breaks down performance by the Extent, the LUN to which it belongs, and the pool to which the LUN belongs.

This view shows that the disk extent is slow for all virtual machines and for the selected virtual machine. The disk extent to LUN I/O performance also exceeds latency thresholds, though its current performance is within the baseline range, which shows a statistical range of typical values encountered over the last 30 days. This indicates the high latency has been occurring for some time. If you want to see the metric values, click the Chart

icon.

The next step in the investigation depends on the state of the pool.

•

If the pool timeline bars are green, the investigation is complete.

•

If the pool timeline bars are a color other than green, you can analyze the changes within the pool and the load on the pool. Continue the investigation by following the workflow in Analyzing the Pool.

In this example, the pool’s Avg Queue Depth timeline bar is yellow. The note beside the bar suggests that this may be a cause of slow I/O to the LUN. The pool warrants further investigation.

Analyzing the Pool

When pool timeline bars show abnormal average queue depth or ops rate, analyze the changes within the pool and the load on the pool.

•

Perform Pool Change Analysis. The Pool Change analyzer identifies the LUNs primarily responsible for increased I/O. It compares LUN activity in the problem time range with LUN activity during the same time range in the past. Changes are reported in terms of average operations rate and change amount.

•

Perform Pool Load Analysis. The Pool Load analyzer identifies the busiest LUNs and ranks them based on their activity during the same time range over the last 30 days (not the current time frame). Activity is measured in operations per second.

To analyze the pool:

From the Server-Side Analysis window or the Pool Explorer window, click Perform Pool Change Analysis.