Monitoring Solutions - Azure Native
Monitoring Introduction
Context & Problem
Terminology
Used definition: A monitoring solution helps the monitoring consumer achieve the satisfactory level of control of a defined service. (Link to source)
This definition already includes the following:
-
Defined service: The resources you want to monitor aka monitored resources. The resources to be monitored can be split in infrastructure and applications on top.
-
Level of control: That is your bandwidth in which your defined service operates normally aka known as baseline
-
Measuring: A measurement is a single act that quantifies an attribute of a part, equipment, service or process (CPU load, available memory etc.). Data measured is emitted by the monitored resources and aka telemetry.
-
Monitoring consumer: The user trying to keep the service within its baseline boundaries.
The key to achieve that is a single control plane is usually preferred to simplify the operations for the consumer aka monitoring plane. The relevant content depends on the perspective of the consumer such as performance, costs, compliance and health. Performance in this pattern includes the following as described here:
-
Health monitoring: purpose of health monitoring is to generate a snapshot of the current health of the system so that you can verify that all components of the system are functioning as expected.
-
Error monitoring: Bugs & errors need to be detected by monitoring. Supporting information must be provided that allows monitoring consumer to analyze the root cause.
-
Availability monitoring: A truly healthy system requires that the components and subsystems that compose the system are available. Availability monitoring is closely related to health monitoring. But whereas health monitoring provides an immediate view of the current health of the system, availability monitoring is concerned with tracking the availability of the system and its components to generate statistics about the uptime of the system.
-
Performance monitoring: As the system is placed under more and more stress (by increasing the volume of users), the size of the datasets that these users access grows and the possiblity of failure of one or more components becomes more likely. Frequently, component failure is preceded by a decrease in performance. If you’re able detect such a decrease, you can take proactive steps to remedy the situation.
-
SLA Monitoring: SLA monitoring is closely related to performance monitoring. But whereas performance monitoring is concerned with ensuring that the system functions optimally, SLA monitoring is governed by a contractual obligation that defines what optimally actually means. You can calculate the percentage availability of a service over a period of time by using the following formula:
%Availability = ((Total Time – Total Downtime) / Total Time ) * 100
-
Providing a control plane requires a monitoring pipeline that should be implemented as feedback loop. The pipeline transforms raw telemetry into meaningful information that the monitoring consumer can use to determine the state of the system. The loop ensures that lessons learnt are the starting point for further improvements on the defined service side. E.g. by adaptive scaling depending on monitored traffic. The entire monitoring must be compliant and provide integration features. The conceptual stages of the pipeline are as follows:
-
Data Sources/ Instrumention (Monitored resources): concerned with identifying the sources from where the telemetry needs to be captured, determining which data to capture and how to capture it.
-
Collection/ Storage (Monitoring plane)
-
Analysis/ Diagnosis (Monitoring plane): generate meaningful information that an monitoring consumer can use to determine the state of the system
-
Visualization/ Alerting (Monitoring plane): decisions about possible actions to take and then feed the results back into the instrumentation and collection stages
The picture below summarizes the aspects:
Standard Problems
The list below describes the standard problems that apply independent from the monitoring consumer perspective. Solutions with concrete technology are first described in subsequent chapters. Per monitoring pipeline stage the following standard problems are known:
-
Data Sources/ Instrumention (Monitored Resources)
This also includes the possibility of preprocessing to reduce or enrich sent telemtry data to the monitoring consumer. Telemetry itself might be of different structure and convey different information.
-
Collection/ Storage (Monitoring Plane)
The drop location of the telemetry needs to be determined such as inside the monitoring plane or externally.
Monitoring can result in a large amount of data. Storing such granular data is costly. Therefore an archiving mechanism is required to make sure costs are not exploding. Once archived the ingested telemetry should be removed.
-
Analysis/ Diagnosis (Monitoring Plane)
Includes standard problems like:
-
Filtering
-
Aggregation
-
Correlation
-
Reformating
-
Comparison against Key Performance Indicatorss (=KPIs). KPIs have no weight in software development unless they are paired with your business goals. You don’t need a handful of KPI metrics for your software team. All you need is the right KPI to help you improve your product or process. KPIs should be SMART (S = Specific; M = Measureable; A = Assignable; R = Realistic; T = Time Bound). Examples: Code Quality KPIs such as Maintainability index, Complexity metrics, Depth of inheritance, Class coupling, Lines of code; Testing Quality such as Test effort, Test coverage; Availability = Mean time between failures, Mean time to recovery/ repair as described here
-
-
Visualization/ Alerting (Monitoring Plane)
Includes standard problems like:
-
Visualization for monitoring consumer
-
Alerts: Programmatic action that free the monitoring consumer from manual intervention. It states the trigger and the action to bee executed. One challenging aspect is to minimize the number of alerts or to detect patterns behind multiple alerts. Infering a suitable thresholds can be challing especially if the threshold is not static.
-
Reports
-
Ad-hoc queries
-
Exploration
-
-
Improving Feedback Loop (Plane/ Resources)
Cases where the monitored resources operated outside their baseline should be the starting point for improvements. This might mean a better tuning of alerts and intervention or system requirements.
Integrating and compliance affect the entire pipeline. Telemetry might have to be collected from other systems to achieve a single monitoring plane. However alerts/ notications might have to be forwarded to other systems. Of course a monitoring must be compliant regarding the enterprise guidelines.
The following patterns are not dicussed here:
-
Provisioning of the monitoring plane and the monitored resourves
For solutions with a certain technology see the specific guides on platform and concrete service level.
Monitoring Platforms
Azure
Overview
This chapter lists major features/ concrete services for monitoring of the Azure platform. This architecture pattern builds on the general problem description for monitoring. The picture below summarizes major services and concepts that are discussed in detail in the next chpater.
Monitoring Pipeline
Major features per stage of the monitoring pipeline are as follows:
-
Data Sources/ Instrumention
Telemetry in Azure is split in logs and metrics. Logs contain non-structured text entries whereas metric is a value measured at a certain time. Dimensions are additional characterisitics of the measured metric.
The major logs/ metrics are one of the following categories: (1) Activity logs, (2) resource logs (former diagnostic logs) and (3) Azure Active Directory (=AAD) related logs. Activity logs track actions on Azure Resource Manager level such as creation, update or deletion of Azure resources. Resource logs track operations within a resource such as reading secrets from a key vault after it has been created.
-
Monitoring Plane
The services used for processing depend on the perspective. A major stop for a unified end-to-end monitoring is Azure Monitor. It unifies the former separate services Application Insights and Log Analytics as features. Application Insights is focusing at application monitoring whereas Log Analytics started as part of the operation management suite targeting infrastructure monitoring. Both come with their own repository for storing the telemetry. In the future a Log Analytic Workspace will be the central place for collecting data from infrastructure and application perspective.
Telemetry can either be (1) forwarded (=pushed) to the monitoring plane or (2) pulled from the monitoring plane. Pushing can be necessary if the telemetry is not available in Azure monitor out of the box or pulling from the monitored resources is not possible. Monitored resources have to be instrumented to forward telemetry to the monitoring consumer for later processing within the monitoring plane. App insight requires linking via instruentation keys. Log Analytic workspaces require diagnostic settings. Possible targets are only log analytics workspace, event hub or azure blob storage. Telemetry that can be forwarded is predefined. Fine granular selection of metrics/ logs is not always possible. Pulling reads telemetry such as metrics directly from the monitored resource. Logs cannot be read directly and require pushing. Compared to pushing this method is also faster.
Both features cover health and performance perspectives. Cost management is covered by Azure Cost Management. The major services for monitoring compliance are Azure Security Center and Azure Sentinel (Larger enterprise scope compared to Azure Security Center with SIEM and SOAR capabilities).
Azure monitor provides various options for visualizations but also other services are possible. Dashboards like features provide a single pane of control across a number resources. Kusto is the major language for analyzing logs and metrics e.g. as part of the root cause analysis. Additional features of app insights/ log analytics complement the language.
Alert thresholds can be dynamic and actions can be grouped in action groups for multiple reuse. Dynamic Thresholds continuously learns the data of the metric series and tries to model it using a set of algorithms and methods as described here. Alerts can be grouped dynamically to reduce noise and filtered/ scoped to reduce false alarms.
Various options for archiving exist in Azure such as Logic Apps. A cheap archive is usually Azure blob storage. Policies can be used to automatically delete archived blobs. Removal of ingested telemetry is configurable by setting the retention period accordingly in Log Analytics/ App Insights.
-
Improving Feedback Loop (Plane/ Resources)
The platform allows to track track end-user behavior and engagement. Impact Analysis helps to prioritize which areas to focus on to improve the most important KPIs as described here. Autoscaling is provided by Azure monitor and other Azure services directly.
Azure monitor can integrate with and forward telemetry from various sources. Some services like Azure Security center forward telemtry to Azure monitor. IT service management tools such as ServiceNow or System Center Service Manager can integrate with Azure monitoring tools. Azure provides the standard compliance mechanisms also for monitoring which ensure authentication/ authorization (via Azure Active Directory), compliance for data at-rest and in-transit.
Monitoring Solutions - Azure Native
Application Monitoring
Overview
The solution is to use Azure Monitor with App Insights, Log Analytics and the following platform features regarding the monitored resources. The focus of this chapter is to introduce the relevant features. Recommendations for a concrete setup are given in the next chapter.
The relevant Azure monitor features are as follows:
-
Collection/ Storage (Monitoring Plane)
Telemetry can either be stored internally inside the monitoring plane by App Insights/ Log Analytics or externally.
Telemetry can be pulled from the monitoring plane. This is limited to metrics but faster than pushing. Pushing can be necessary if the telemetry is not available in Azure monitor out of the box or pulling from the monitored resources is not possible. The major mechanisms to push telemetry to the monitoring plane are:
-
Diagnostic setting
-
App Insights Instrumentation/ Linking: Linked App Insights must be specified for the monitored resource. Some Azure Services such as Azure App Service come with a built-in App Insight integration. However, other services only provide diagnostic settings instead such as API management.
-
Manual forwarding: E.g. by scheduled process using the APIs provided by Azure Monitor for App Insights and Log Analytics. A lightweight Azure service for polling services to be monitored is Azure Automation. It allows to host and running scripts. App Insights/ log analytics also provide APIs to manual forward data. However this APIs have some constraints:
-
timestamp cannot be set freely (Both)
-
deleting something is not possible (Both)
-
saved query cannot be updated (App insights only)
-
-
-
Analysis/ Diagnosis (Monitoring Plane)
Azure Monitor comes with no built-in support for KPIs such as code quality, test coverage or availability/ maintenance. However, standard KPIs such as mean time between failure (=MTBF) can be programmed with Kusto queries. Azure Application Insights sends web requests to your application at regular intervals from points around the world. It can alert you if your application isn’t responding, or if it responds too slowly as described here.
Kusto queries across multiple application insights or log analytic workspaces are possible. App insight or log analytic workspaces must then be referenced with an additional identifier (App Insights:
app('<identifier>')
; Log Analytics: (workspace('<qualifier>')
) as shown in the samples below. Various options for identifiers exist such as name and guid as described here:// Cross-Kusto app insights example union app('mmsportal-prod').requests, app('AI-Prototype/Fabrikam/fabrikamprod').requests, requests | summarize count() by bin(timestamp, 1h) // Cross-Kusto log analytics example union Update, workspace("b438b4f6-912a-46d5-9cb1-b44069212ab4").Update | where TimeGenerated >= ago(1h) | where UpdateState == "Needed" | summarize dcount(Computer) by Classification
Azure Data Explorer is a service for large scale analysis of telemetry. Large refers large amount of data or high frequency of time series data as described here.
-
Visualization/ Alerting (Monitoring Plane)
Natively Azure monitor provides as dashboarding options (1) Azure dashboards and (2) Azure workbooks.
Alerts come with the following features:
-
Trigger: Results from Kusto queries can be used as trigger.
-
Action Groups: Assigning same action (=Action Group) to different triggers
-
Smart Groups (Preview as of 24.08.2021): Groups alerts that are triggered simultanously by using artificial intelligence as described here
-
Action Rules (Preview as of 24.08.2021): Allows to suppress (e.g. due to maintenance), scope and filter alerts as described here
-
Reporting: Existing report for SLA/ Outages by using predefined Azure Monitor workbooks from gallery as described here
Application Insight comes with the following tools for exploration and root cause analysis:
-
Application Map ⇒ application dependencies in other services such as backend APIs or databases
-
Smart Detection ⇒ warn you when anomalies in performance or utilization patterns
-
Usage Analysis ⇒ features of your application are most frequently used
-
Release annotations ⇒ visual indicators in your Application Insights charts of new builds and other events. Possible to correlate changes in application performance to code releases.
-
Cross-component transaction diagnostics ⇒ The unified diagnostics experience automatically correlates server-side telemetry from across all your Application Insights monitored components into a single view. It doesn’t matter if you have multiple resources with separate instrumentation keys. Application Insights detects the underlying relationship and allows you to easily diagnose the application component, dependency, or exception that caused a transaction slowdown or failure as described here.
-
Snapshot Debugger ⇒ collect a snapshot of a live application in case of an exception, to analyze it at a later stage.
-
Correlation ⇒ Special fields are provided to convey global identifiers appearing in every request as described here.
Azure Monitor has also extensive integration features. This includes:
-
Integrating telemetry from other Azure services (e.g. Azure Security Center also forwards to Azure Monitor)
-
Integrating external data sources (e.g. Blobs by using Kusto external operator)
-
Integrating third party tools such as Prometheus for Azure Kuberenetes
-
Exposing telemtry as data sources for external third party (e.g. Log Analytics Workspaces for Grafana) as described here
-
The following picture summarizes potential Azure services/ features that might be potentially relevant:
Variations
A detailed configuration is not possible because the setup depends on the resources to be monitored and their capabilities. Therefore only guidelines are given to infer the right setup:
-
Collection/ Storage (Monitoring Plane)
Two main decision must be made: (1) storage of telemetry and (2) push versus pull.
The number of app insights/ log analytic workspaces needs to be determined per environment. Production should be kept separate already for compliance/ resilience reasons. Dev/ test environments are rather a question mark. Subsuming dev/ test environments into a single monitoring plane is benefecial for the monitoring consumer, since he then has to check only a single place. That also means you need an additional mechanism inferring the environment for later drill down or root cause analysis. Additional custom attributes are recommended if possible. Separate App Insights/ Log Analytic instances per environment require another one for a consolidated dev/ test view.
Microsoft recommends a single app insights resource in the following cases as described here:
-
For application components that are deployed together. Usually developed by a single team, managed by the same set of DevOps/ITOps users.
-
If it makes sense to aggregate Key Performance Indicators (KPIs) such as response durations, failure rates in dashboard etc., across all of them by default (you can choose to segment by role name in the Metrics Explorer experience).
-
If there is no need to manage Azure role-based access control (Azure RBAC) differently between the application components.
-
If you don’t need metrics alert criteria that are different between the components.
-
If you do not need to manage continuous exports differently between the components.
-
If you do not need to manage billing/quotas differently between the components.
-
If it is okay to have an API key have the same access to data from all components. And 10 API keys are sufficient for the needs across all of them.
-
If it is okay to have the same smart detection and work item integration settings across all roles.
Storing telemetry within the monitoring plane is easy to set up if the Azure service supports diagnostic settings or comes with app insights integration. App insights instrumentation allows extensive customization such as preprocessing. Log Analytics allows less customization out-of-the box. Log analytics can target cheap Azure blob storage. It can be accessed with Kusto and would also eliminate the need for archiving. However, an shared access signature is required in this case which has to be renewed. Updating a saved query is only possible for Log Analytics workspace. Due to simpler setup storing the telemetry inside the monitoring plane is the recommended option.
Pull via metrics explorer is only possible for metrics but not logs. Pushing via a custom script makes sense if:
-
API restrictions on monitoring plane are not a problem. E.g. not being able to set the timestamp according to original occurence.
-
Tracking of UI driven actions that are not pushed automatically
-
Service targets log analytic workspace but built-in limitations like filtering/ aggregations needed before ingestions in workspace
The table below compares various options:
Diagnostic Settings App Insights Logging Push via resource API Metrics Explorer Possible per resource
(X)
(X)
X
(X)
Telemetry Customization
Limited
High
Limited-High
Limited
Custom Logging in executed code
X
Telemetry always captured
X
(X)
X
X
Latency
Medium
Medium
Medium
Low
Direction
Push
Push
Push
Pull
Comments:
-
Option “Push via resource API” ⇒ A scheduled script that reads periodically telemetry and pushes it to monitoring plane using the Rest API
-
„Telemetry always captured“ ⇒ Some resources allow multiple ways to run something e.g. via UI or programmatically. If the telemetry is always captured the way does not matter.
-
-
Visualization/ Alerting (Monitoring Plane)
Various options inside Azure and by external tools such as Grafana exist. If you are using Grafana you have to (1) find a hosting option and (2) have to connect Grafana with Azure.
The basic options are either using the (1) Grafana cloud or (2) hosting Grafana in Azure. The hosting options within Azure can be further diveded into configurations where a single VM with Grafana preinstalled is enough or more sophisticated high availability configurations with additional redundancy on node/ VM level.Hosting options with additional redundancy include:
For connecting Grafana with data source in Azure the options below exist. This also means that Grafana cannot directly connect to Azure Services. Therefore it is required to collect Azure telemetry in Azure places such as Azure Monitor/ Data Explorer:
Grafana and Azure Monitor provide visualization and alerts. The following recommenations are intended to help to choose:
-
Service Management Tool Integration (ITSM): Both can be integrated. See here how to integrate Grafana Events into the ITSM tool ServiceNow. For Azure Monitor connectors exist that depend on the ServiceNow version and arepartially in preview.
-
Azure Monitor telemetry is also available via other means (Portal etc.)
-
Cloud agnostic: Using Grafana opens a cloud agnostic way and could also be used for other clouds
-
Including critical features in Grafana such as alert might require Grafana hosting with additional redundancy
-
Grafana can be integrated with Azure AD or LDAP as described here for authentication
Grafana is moving to the new version 8.0 which is in public preview (as of 4.11.2021). Machine learning mechanisms e.g. for dynamic thresholds are only in place for Grafana Cloud users which is possible in Azure. A final decision depends on the priorities e.g. cloud agnostic/ drilldown vs. dynamic thresholds. The table below summarizes the features related to the previously introduced Azure dashbaord. See the options below for dashboarding/ visualization:
Azure Third party Workbooks
Dashboards
Power BI
Grafana
Auto refresh in 5 Min Intervall
X
X
X
Full screen
X
X
X
Tabs
X
Fixed Parameter lists
X
X
Drill down
X
X
Additional hosting required
X
Terraform Support
X
X
X
Regarding components for logs/ metrics:
-
Metrics: Pull (Metrics explorer) or push (Kusto query targeting data source) possible
-
Logs: Push to monitoring plane only
-
When to use
This solution assumes that your application monitoring plane is in Azure and that your monitored resources are located in Azure.
Infrastructure Monitoring
Overview
The solution is to use Azure Monitor with Log Analytics and the following platform regarding the monitored resources. The focus of this chapter is to introduce the available features. Recommendations for a concrete setup are given in the next chapter.
The relevant Azure monitor features are as follows:
-
Data Sources/ Instrumention
A major source for infrastructure is the health information provided by the platform. The following health information is relevant:
-
Service Health Information which also includes planned downtime of the Azure platform and problems on service type level such as VMs
-
Resource Health which includes health information for service instances you created
On resource level resource utilization is relevant. This includes:
-
Hitting capacity limits regarding CPU/ memory
-
Idle resources
Availability differs per service. They are usually exposed via metrics.
-
-
Collection/ Storage (Monitoring Plane)
Telemetry can either be stored internally inside the monitoring plane or externally.
Telemetry can be pulled from the monitoring plane. This is limited to metrics but faster than pushing. Pushing can be necessary if the telemetry is not available in Azure monitor out of the box or pulling from the monitored resources is not possible. Pushing can be done as follows:
-
Resource diagnostic: Useful to push resource specific telemtry.
-
Health diagnostic: Resource Health tracks the health of your resources for specific known issues. With diagnostic settings configured on subscription level you can send that data to Log Analytics workspace. You will need to send the ResourceHealth/ Service Health categories (Source Health-Overall Source Possible-Categories).
-
-
Analysis/ Diagnosis (Monitoring Plane)
Health relevant KPIs can be determined via Kusto as shown in the example below:
AzureActivity // Filter only on resource health data in activity log | where CategoryValue == 'ResourceHealth' // dump any resource health data where the health issue was resolved. We are interested only on unhealthy data | where ActivityStatusValue <> "Resolved" // Column Properties has nested columns which we are parsing as json | extend p = parse_json(Properties) // Column the parsed Properties column is now a dynamic in column p // We take the top level properties of column p and place them in their own columns that start with prefix Properties_ | evaluate bag_unpack(p, 'Properties_') // We do the same for the newly created column Properties_eventProperties | extend ep = parse_json(Properties_eventProperties) | evaluate bag_unpack(ep, 'EventProperties_' ) // We list the unique values for column EventProperties_cause | distinct EventProperties_cause
Availability of resource utilization specific KPIs depends on the monitored resources.
Kusto queries across multiple application insights or log analytic workspaces are possible (See app monitoring for details).
Log Analytics comes with the following tools for exploration and root cause analysis:
-
Table based access allows you to define different permissions per log table. This is done using custom roles where you define the tables as part of the resource type as described here.
-
Additional management solutions: They have to be installed per werkspace. An example is the ITSM Connector used to automatically create incidents or work items when Alerts are created within Log Analytics. Such as System Center Service Manager or Service Now.
-
Log analytics agent managentment: agent collects telemetry from Windows and Linux virtual machines in any cloud, on-premises machines, and those monitored by System Center Operations Manager and sends it collected data to your Log Analytics workspace in Azure Monitor. The Log Analytics agent also supports insights and other services in Azure Monitor such as VM insights, Azure Security Center, and Azure Automation as described here.
-
Service Map automatically discovers application components on Windows and Linux systems and maps the communication between services. Service Map shows connections between servers, processes, inbound and outbound connection latency, and ports across any TCP-connected architecture, with no configuration required other than the installation of an agent as described here.
-
-
Visualization/ Alerting (Monitoring Plane)
See Application monitoring features for alerts and visualization.
The following picture summarizes potential Azure services/ features that might be potentially relevant:
Variations
See application monitoring.
When to use
This solution assumes that your infrastructure monitoring plane is in Azure and that your monitored resources are located in Azure.