Foglight for Storage Management Shared 4.8


	NOTE: When a component is selected, its detail views display the component’s Status followed by its State. Status is determined by Foglight for Storage Management as describe above. State refers to the physical state of a component as reported by the vendor; if the vendor does not provide the physical state, the state is unknown. A component’s physical state may affect its status only when an enabled rule triggers alarms based on state. Consult with your Foglight Administrator if you want to enable or create rules that perform this check.

A storage device often has large numbers (thousands) of child components. With a few exceptions, alarms on child components do not change the status of the parent device. For example, a failed disk may have a Fatal status, but because arrays are designed to cope with a failed disk, the parent device continues to display a Normal status. The parent device status may be changed by child components in the following circumstances:

•

Controller problems typically affect the performance of the storage array or filer, so the parent device inherits the alarm status of the controller. Pool alarms reflect capacity issues and performance problems that can affect many users of the storage array or filer, so the parent device inherits the alarm status of the pool or aggregate.

•

When a significant percentage of child components have problems (for example, many disks are failing), the problems may be indicative of a systemic problem. In this case, the parent device status changes when the number of affected child components reaches a threshold defined in rules.

For information about changing default rules and alarm settings, see Managing Foglight for Storage Management Rules.

Reviewing the Status of All Devices

Use the Monitoring tab in the Storage Environment dashboard to gain a high-level understanding of the status of the devices in your environment, organized by device type. For a general description of the dashboard, see Introducing the Storage Environment Dashboard.

To monitor the storage environment:

On the navigation panel, under Dashboards, click Storage & SAN > Storage Environment.

Click the Monitoring tab.

The time range and selected tile are the same as the last time the dashboard was opened.

Optional—Change the Time Range. For an initial review, use the default time range.

Scan the lower part of the tiles.

If any of the tiles show that there are devices with the

Fatal,

Critical, or

Warning status, do one of the following:

•

To view all levels of alarms, click the top part of a tile that has the highest number of devices with the most severe errors. Go to the next section, Assessing Storage Alarms.

•

To view alarms at one severity level only, such as all Critical alarms, click the Critical status count. Go to the next section, Assessing Storage Alarms.

If all tiles report that devices are in the Normal status, it means that the devices are operating within acceptable parameters. The next step is to review the devices to see if any of their child components show an alarm status. See one of the following topics:

•

Monitoring Fabrics

•

Monitoring Storage Arrays

•

Monitoring Filers

Assessing Storage Alarms

When a storage device or component enters an unacceptable state (as defined in a rule), the rule that monitors the entity triggers an alarm and sets the status of the resource. Examine the alarm messages starting with the most severe alarms.

This walkthrough assumes that you are looking at alarms of all severity levels in an Alarm Summary view. In many places in the software you can restrict your assessment to resources with the same alarm severity. This can be useful when you want to prioritize your alarm assessment, such as focusing on all storage arrays with Critical alarms first, and then on the Warning alarms.

To assess alarms:

In the Alarm Summary view, review the alarm messages to understand the issues. If the list contains multiple levels of alarm severity, start with the highest severity alarm.


	TIP: If you see alarms on devices or components that you think are operating within acceptable parameters, consider creating new rules to better suit your environment. For more information, see Managing Foglight for Storage Management Rules.

If you need more details to understand the issue, click the alarm message.

An Alarm window displays more information about the alarm and troubleshooting tips.


	TIP: To bypass the Alarm window and go straight to the Choose Diagnostic Focus Time window (described in the next step), click an instance name instead of the message.

From the Troubleshooting pane, click the Diagnose button.

Choose the time period to use as your diagnostic time range:

•

Explore at Storage Alarm time. Select this option when you want to view diagrams and other details of the affected component at the time the alarm occurred. Shows data for the time period leading up to and including the alarm time. For example, given an alarm time of 10:32 AM and the default four hour time range, the diagnostic time range is set to 6:32 AM – 10:32 AM.

•

Explore at Default Diagnostic time. Select this option when you want to determine if the situation causing the alarm persisted or if it resolved on its own. Shows data before and after the alarm, with the alarm time positioned three quarters of the way into the time range. For example, given an alarm time of 10:32 AM and the default four hour time range, the diagnostic time range is set to three hours before the alarm and one hour after the alarm, that is, 7:32 AM – 11:32 AM. If the current time falls within the range, for example, it is currently 11:05 AM, the time range is set to 7:32 AM – 11:05 AM.

A component dashboard opens with its time range set to the selected diagnostic time range.

Review the component dashboard to better understand the data that led to the alarm. If you navigate to other dashboards, the diagnostic time range remains the same.

When you complete your investigation, in the breadcrumbs, click Storage Environment to return to the Choose Diagnostic Focus Time window. If desired, choose the other diagnostic time range.

When you are finished, close the Choose Diagnostic Focus Time window, and in the Alarm window click one of the following options:

•

Acknowledge. Continues to display the alarm, but it is marked as acknowledged until the alarm is triggered again. For example, for Warnings, an appropriate action may be to acknowledge the alarm and ignore it.

•

Acknowledge Until Normal. Continues to display the alarm, but it is marked as acknowledged until the affected component returns to the Normal status. This is useful when a component has failed and you want to know when it is replaced.

•

Clear. Deletes the alarm. Choose this option when the situation is resolved.

Close the Alarm window.


	TIP: When you close the window, the time range returns to the time range in use before your alarm analysis. If it does not, in the Time Range either click the Frozen Time Range icon to return to real time or click the arrow to expand the zonar and set the range. For more information, see “Working in a Current or a Diagnostic Time Range” in the online help.

Take action to resolve the issue in your storage infrastructure, either by yourself or by notifying the appropriate person.

Monitoring Fabrics

Foglight for Storage Management provides insight into both physical and virtual fabrics available with Brocade and Cisco Fibre Channel (FC) switches. A physical fabric is a group of interconnected FC switches. The definition of a virtual fabric differs depending on the vendor:

•

Brocade switches enable customers to group ports on physical switches into logical switches. Logical switches and physical switches can then be interconnected into virtual fabrics. Brocade creates logical ISL ports to interconnect logical switches. No metrics are available for LISL ports.

•

Cisco switches enable customers to create virtual storage area networks (VSANs) partitioned from a physical fabric. A VSAN is a logical group of ports, where the ports are located on one or more of the interconnected FC switches that form the physical fabric.

Fabrics are displayed in the Fabrics quick view of the Storage Environment dashboard. When you expand a fabric branch to view its components, the list of components varies depending on the type of fabric as follows:

This walkthrough introduces the quick views for fabrics and their components.

To monitor fabrics, switches, and VSANs:

On the Storage Environment dashboard, ensure the Monitoring tab is selected.

Click the Fabrics tile to open the Fabrics quick view.


	TIP: You can also open this quick view from the navigation panel. For more information, see Introducing the Storage Explorer.

To identify the busiest fabrics in your environment, in the Fabrics list, click Summary.

The Fabrics Summary (All Fabrics) panel opens. The Fabrics view identifies the top three fabrics with the highest average values for Data Rate, Link Error Rate, and Non-Link Error Rate, respectively. The FC Switches view identifies the top three switches in terms of the same metrics. The charts plot the metric values over the time period, while the tables show the average and current values for each component.

To investigate one of the top three fabrics or switches:

•

To explore a top fabric, click its name in a table. See Exploring a Fabric.

•

To explore a top switch, click its name in a table. See Exploring a Switch.

•

To return to this quick view, in the breadcrumbs, click Storage Environment.

To monitor the performance of a fabric, in the Fabrics list, click a fabric name.

In the Fabric Summary (Selected Fabric), the Related Inventory view contains alarm summaries for the selected fabric as well as its switches, ISL ports, N ports, and VSANs (Cisco fabrics only). The Resource Utilization charts display the following metrics for ISL ports (left) and N ports (right) used by the fabric:

•

Avg Utilization Distribution. For each type of port, displays aggregated values for Rcvd Utilization and Xmit Utilization grouped by percentage of usage. Most of your port utilization should be in the lower percentages. When there are ports performing at high utilization rates, you may want to investigate port performance further.

•

Data Rate. For each type of port, plots aggregated values for Data Receive Rate and Data Send Rate over the time period and displays the Baseline.

•

Error Rate. For each type of port, plots aggregated values for Link Error Rate and Non-Link Error Rate over the time period.

To continue investigating the selected fabric:

•

To explore details about the fabric, its switches, and its ports, click View in Explorer. See Exploring a Fabric.

•

To explore an FC switch in the selected fabric, in the Related Inventory view click FC Switches or an alarm icon, and select a switch. See Exploring a Switch.

•

To investigate a port used in the selected fabric, in the Related Inventory view click either ISL Ports or N Ports or an alarm icon, and select a port. See Investigating an FC Switch Port.

•

Cisco fabrics only —To investigate a VSAN used in the selected fabric, in the Related Inventory view click VSANs or an alarm icon, and select a VSAN. See Exploring a Cisco VSAN.

•

To return to the quick view, in the breadcrumbs, click Storage Environment.

To monitor the performance of a switch (physical or logical), in the Fabrics list, expand a fabric and click a switch.

In the FC Switch Summary, the Related Inventory view contains alarm summaries for the selected switch, the fabric it belongs to, and its ISL ports and N ports. The charts display the following metrics for ISL ports and N ports used by the switch:

•

Ports Average Utilization Distribution. For each type of port, displays aggregated values for Rcvd Utilization and Xmit Utilization grouped by percentage of usage. Most of your port utilization should be in the lower percentages. When there are ports on the switch performing at high utilization rates, you may want to investigate port performance further.

•

Rcv Rate. For each type of port, plots aggregated values for Data Receive Rate over the time period and displays the Baseline.

•

Xmit Rate. For each type of port, plots aggregated values for Data Send Rate over the time period and displays the Baseline.

•

Error Rate. For each type of port, plots aggregated values for Link Error Rate and Non-Link Error Rate over the time period.