Foglight for Application Operations 5.9.8.5


	TIP: You can add and remove columns to customize the information that appears in this table by clicking the customizer icon to the right of the Search box.

To triage an application:

Review the Service Level Compliance and User columns of the table to locate the tier of the service that is experiencing performance degradation.

In this case, the User tier icon

indicates an issue for users.

The service level compliance icon

indicates that the service has not yet violated the service level compliance policy in Foglight. However, if the issue persists, this icon may change to indicate a violation.

Review the Transactions tab to determine the impact on users.

Figure 36. End User Transactions tab

End user tiles are grouped by transaction: MD1Patient (top row), MD1Physician (middle row), and MD1Admin (bottom row). They are also ordered by severity (that is, the users with the worst health status or most alarms appear at the top).

Here you see that one transaction is having an issue, with both the real user tile and synthetic tiles reflecting an issue when accessing MD1Patient. The users accessing MD1Physician and MD1Admin are not experiencing any issues (their status is green across the board), which indicates that the issue is not affecting the whole application, only the MD1Patient transaction.

The Trace Analysis health status (

) icon on the MD1Patient Real User tile indicates that there is a problem somewhere in the trace analysis metrics.

Foglight has also generated critical level alarms (

) for all four MD1Patient transactions. Since all synthetic transactions are affected, this is not a location-specific issue. Synthetic transactions can be used as a benchmark for healthy performance. In this case, the fact that the synthetics are affected indicates that this is not a problem caused by a one-time error from a single user; it is a recurring or ongoing problem affecting all users.

Click the title bar of the affected Real Users tile to drill down for more information.

The Real User Performance detail view opens.

Figure 37. Real User Performance detail view

This detail view captures several key metrics in chart format. Consider the following:

Table 3. Detail view key metrics
Metric	Key Insight
Response Time	Beginning at 09:35, users are experiencing a longer response time.
Page Requests	At 09:35, the number of page requests is slightly lower than earlier peak loads. This indicates that the problem is not due to a spike in user activity.
SLA/OLA Attainment	Also at 09:35, the Service Level Agreement (SLA) and Operating Level Agreement (OLA) are still being met, but the percentage is decreasing.
Front End / Back End	The time the transaction spent in the back end (that is, in the supporting architecture) has increased, and the problem appears to have started shortly before users were affected (that is, before 09:35).
Trace Analysis	The overall number of hits has decreased, as shown in the MD1Patient Hits chart. The number of users experiencing delay of more than seven seconds in their transaction execution has increased, as shown in the MD1Patient Hits Over 7 Seconds chart.


	NOTE: Click the title bar of the MD1Patient Hits Over 7 Seconds chart to drill down into the individual sessions recorded by FxV. The FxV details allow you to see the Hit URLs and error messages, and to drill down into individual sessions, where you can step through a session to locate the error point. In this case, drilling down to the sessions shows that this is a content error issue. Some users are seeing error messages in their web browsers. This increases the priority of this issue.

This looks like a serious problem that is impacting many users, and the issue seems to be originating in the back end.

Close this view by clicking the ‘x’ in the upper right corner.

Next, investigate the application topology to determine where in the supporting architecture the problem originates.

Click the Dependencies tab to view the application topology.


	TIP: Always work from left to right when triaging dependencies.

Figure 38. Dependencies tab

Overall, the infrastructure is in good health — all of the hosts have a green check mark, indicating that their status is normal. However, several of the tiers that contain platform and code components appear to have issues that must be investigated.

The web tier has no issues. Moving to the right, you see that there is a warning on the Application (MedRecApp) tier.


	TIP: Hover the mouse pointer over a host icon to open a popup view of the application components.

Click the application tier title bar to drill down for more details.

The MedRecApp Details view opens. The banner section of the Summary tab displays the service level compliance for the tier. The lower portion of the view contains a tile for each application component, grouped by host.

Figure 39. MedRecApp Details view


	TIP: For more information about the metrics displayed on views and tiles, see APM Tile and View Reference.

Click the title bar of the application component tile to open the Application Details view.

Figure 40. Application Details view

On the Request Types tab, you see that POST/medrec/patient/viewPatient.action is the source of the warning. Select this request.


	TIP: Use the Search box to filter the list of requests. Type medrec/patient to include only those requests with that string in the name.

The response time for this request has increased and remains high. The Bottleneck Tier Name column lists the database tier as the source of the bottleneck.


	NOTE: There are additional columns that may be of interest that are hidden by default. To show them, click the customizer icon to the right of the Search box and select the columns you want to display.

The Execution Time chart at the bottom of the Application Details view (below the list of request types) indicates that the most time is spent in the database tier. It also shows that the time spent in the database tier began to increase at 09:30, which corresponds to the increase in the time spent in the back end that appeared in the Real User Performance view in Step 3.

This evidence all points to a problem in the database tier, which is where you should look next.

Click the ‘x’ in the upper-right corner to close the Application Details view. Close the MedRecApp Details view as well.

On the Dependencies tab, click the title bar of the MedRecDB database tier to drill down for more details.

The details view for the database tier opens.

Figure 41. The details view for the database tier

There is a critical alarm

on the application component tile for the Oracle® database.

The Oracle tile displays the top four session bottlenecks as color-coded bars. The larger the main color bar, the higher the percentage of resources were spent on that metric. The thin blue lines on the top of each bar indicate the range for the normal state. The thin green/yellow/orange bar below the main color bar indicate the metric’s trend compared against itself.

In this case, the database has Lock wait issues.

Now that you have determined the most probable source, you can engage the database administrator to evaluate the problem.

In this scenario, the database and application administrators should have been aware of the problems before the Application Performance Manager became involved. If the administrators were using Foglight, they would have received alarms generated and sent by Foglight.

Application layer

Figure 42. Service Operations Console dashboard


	TIP: You can add and remove columns to customize the information that appears in this table by clicking the customizer icon to the right of the Search box.

To triage an application:

Review the Service Level Compliance and User columns of the table to locate the service that is experiencing performance degradation.

In this case, the User tier icon

indicates an issue for users. The Service Level Compliance icon is green, indicating that the service level compliance policy in Foglight is being met.

Review the Transactions tab to determine the impact on users.

Figure 43. Ens User Transactions tab


	TIP: You can change the display options to roll-up single synthetic transaction locations into multi-location tiles, as shown above. For more information, see Configuring display options.

Here you see that Real Users are experiencing an issue when accessing MD1Physician, but the problem is only visible in the Trace Analysis. Synthetic transactions are not experiencing this problem.

The Trace Analysis health status (

) icon on the MD1Physician Real User tile indicates that there is a problem originating in a traced request.

The OLA (Operating Level Agreement) and SLA (Service Level Agreement) values have degraded slightly. This suggests there are some outliers that are violating the compliance policy in Foglight.

Click the title bar of the affected Real User tile to drill down for more information.

The Real User Performance view opens.

Figure 44. Real User Performance view

This detail view captures several key metrics in chart format. Consider the following:

Table 4. Details view key metrics
Metric	Key Insight
Response Time	Starting at 06:30, the response time has increased slightly, but the increase is constant.
Page Requests	The number of page requests increased slightly before the response time increased, and has remained high. This indicates an increased load, not just a single spike in user activity.
SLA/OLA Attainment	The Service Level Agreement (SLA) and Operating Level Agreement (OLA) are still being met. The majority of users are not experiencing any problems.
Front End / Back End	The time spent in both the front end and back end (infrastructure) has increased, with the larger portion of the time being spent in the back end.
Trace Analysis	The MD1Physician Hits Over 7 Seconds chart shows the number of users experiencing a delay of more than seven seconds in their transaction execution has increased and remains high. This indicates that a handful of outlier transactions are severely impacting these users, but other users remain unaffected.


	TIP: Click the MD1Physician Hits Over 7 Seconds chart to drill down to the user sessions in FxV.

The evidence from the metrics indicates that a small number of users are consistently experiencing problems.

Close this view by clicking the ‘x’ in the upper right corner.

Next, investigate the application topology to determine where in the supporting architecture the problem originates.

Click the Dependencies tab to view the application topology.


	TIP: Always work from left to right when triaging dependencies.

Figure 45. Dependencies tab

Overall, the infrastructure is in good health — all of the hosts have a green check mark, indicating that their status is normal. However, the tiers that contain platform and code components appear to have issues that must be investigated.

Click the MedRecApp application tier title bar to drill down for more details.


	TIP: If one of the hosts was experiencing an issue, you could click the host icon to drill down into a Host detail view instead of looking at the tier view.

The MedRecApp Details view opens.

Figure 46. MedRecApp Details view

The banner section of the Summary tab displays the service level compliance for the tier. The lower portion of the view contains a tile for each application component, grouped by host.

Here you can see that both nodes in the WebLogic® Server cluster (MedRec1Server1 and MedRec1Server2) are experiencing a problem with requests. The hosts are unaffected.

Click the title bar of the MedRec1Server2 application component tile to open the Application Details view.

Figure 47. Application Details view

On the Request Types tab, you see that POST/physician-web/physician/createRecord.action is the source of the warning. Select this request.


	TIP: Use the Search box to filter the list of requests. Type physician to include only those requests with that string in the name.

The response time for this request has increased and remains high. The Bottleneck Tier Name column lists the MedRec1Cluster tier as the source of the bottleneck.

The Execution Time chart below the list of request types also indicates that the most time is spent in the MedRec1Cluster of the application tier.

This evidence points to a problem in the Java™ code, but requires further investigation.

Now that you have determined the probable source of the issue that is affecting users, you can engage the domain expert to investigate further. The domain expert can use the Custom Applications (Java) dashboards to examine the request and traces in greater detail.

Virtual machine

Figure 48. Service Operations Console dashboard


	TIP: You can add and remove columns to customize the information that appears in this table by clicking the customizer icon to the right of the Search box.

To triage an application:

Review the Service Level Compliance, User, and App columns of the table.

In this case, the User tier icon

indicates an issue for users. The Service Level Compliance icon is green, indicating that the service level compliance policy in Foglight is still being met.

The App tier icon

indicates there are issues somewhere in the application tier as well.

Review the Transactions tab to determine the impact on users.

Figure 49. Ens User Transactions tab

Here you can see that problems are occurring for both real users and synthetic transactions.

Real users for two of the three transaction types (MD1Patient and MD1Physician) are affected. The response time is increasing, and the OLA and SLA for these users are not being fully met. The Trace Analysis health icon indicates a problem somewhere in the traced requests.

You can also see that the issue has begun to affect synthetic transactions. There are spikes present in the response time charts, where there had been previously steady lines. This points to outlier sessions experiencing issues.

Click the title bar of the MD1Physician Real User tile to drill down for more information.

The Real User Performance view opens.

Figure 50. Real User Performance view

The following trends are visible:

•

Response time spiked sharply then levelled out, but is increasing again.

•

The SLA is still being attained, but the OLA is no longer being fully met.

•

There have been sharp changes in the page requests (user activity).

•

The longest amount of time is being spent in the back end.

•

The number of requests taking over seven seconds (MD1Physician Hits Over 7 Seconds) is increasing.

Close this view by clicking the ‘x’ in the upper right corner.

Next, investigate the application topology to determine where in the supporting architecture the problem originates.

Click the Dependencies tab to view the application topology.

Figure 51. Dependencies tab

One of the hosts in the application tier is in a fatal condition. The rest of the infrastructure is in good health.

Click the title bar of the application tier to drill down for more information.

Figure 52. MedRecApp Details view

The first host in the tier (alvscjw143) and the application server on that host (MedRec1Server1) are both healthy.

The other host (alvscjw145) is in a fatal condition. The CPU Ready percentage on this host is very high, at 55.1%, indicating a CPU starvation issue, which appears to now be impacting the ability of the MedRec1Server2 application server to process its requests effectively.


	TIP: You could click the title bar of the MedRec1Server2 tile to drill down to the requests. In some cases, the requests can also provide more evidence to support your conclusions about the source of the issue.

Click the title bar of the host tile with the fatal alarm (alvscjw145) to drill down for details.

Figure 53. alvscjw145 Details view

This detail view provides a summary of the host status with graphed data as well as the numerical values for metrics that are also present on the host tile.

Here you can see the steep increase and high level of CPU Ready, and the sawtooth patterns of unstable Memory and CPU use.

Click the ESX tab to investigate the state of the host server and the virtualization layer.

Figure 54. ESX tab

Here you can see the top resource usage on the ESX® host. The CPU Load is at 100% Used, and has been for some time.

The Network I/O and Disk I/O both look healthy, but the Memory usage is also quite high at 87% of total.

The table at the bottom of the view summarizes the most used resources and the VM that is consuming them.

Click the (more...) link for the Top CPU Utilization to compare the CPU usage of the hosts on this server.

Figure 55. Top CPU Utilization Details view

The bar chart makes it easy to see the VM (qscracsvrlc) that is currently consuming the most CPU resources. It has also been hogging the CPU resources over the entire period.

Now that you have determined the most likely source of the problem, you can engage the VM administrator to further investigate and resolve the situation.