Consumers and users of applications expect near 100% availability and reliability to work, transact, collaborate, etc. There's a lot of talk about monitoring the performance of the application itself, but what about the underlying systems and components supporting the app, and in particular the infrastructure it sits on? If any piece of this stack fails, it can negatively impact the user experience, and in turn, your business.
It's critical to have full-stack observability, including the infrastructure your application is running on. This is why having granular visibility with proactive notifications on the performance and resource utilization of the hosts and processes running your production applications is also important. Real-time alerts equip your team to make the necessary changes before an end-user experiences an issue.
Monitoring host and process metrics is key
Sumo Logic's new Host and Process Metrics capability provides visibility across hosts and processes, in one place, to measure and manage compute, memory, storage, and network resource utilization for both hosts and processes that run on them, so you can:
Ensure the infrastructure is functioning reliably
Provide the best possible end-user experience
Quickly troubleshoot issues to reduce MTTI and MTTR
Optimize infrastructure and maintenance costs
How does it work?
The new Host and Process Metrics app supports the collection of host and process metrics telemetry data from Windows and Linux hosts, physical or virtual running in multiple hybrid environments.
This app uses Telegraf for the collection of metrics from your hosts. Telegraf is an open-source data collection agent and uses built-in input plugins to fetch metrics from hosts and software applications. We use a variety of input plugins to collect CPU, memory, disk, network, and process metrics from hosts. The Sumo Logic output plugin sends collected metrics to Sumo Logic.
For more information on how collection works, see:
Using the Sumo Logic app
Once data collection has been set up, the next step is to analyze it with dashboards and set up alerts to get notified when critical conditions occur.
Alerts for host and process metrics
Pre-packaged alerts enable you to get proactively notified when critical conditions occur on your hosts. These alerts are based on Sumo Logic Monitors, which allow you to set robust and configurable alerting policies that notify you about critical conditions in your application infrastructure that could adversely affect your production applications and customer experience.
Monitors for host and process metrics include preset thresholds for high CPU/memory/filesystem/swap utilization, network errors, unusual network throughput, page faults, and open file descriptors. For a complete list, see Host and Process Metrics Alerts.
While running your applications in production, it's critical you monitor all your hosts across various dimensions. The Host Metrics - Overview dashboard gives you exactly this with an at-a-glance view of key metrics like CPU, memory, disk, network, and TCP connections across all your hosts. You can use this dashboard to quickly identify hosts with high CPU, disk, memory utilization, and identify anomalies over time.
For more detailed investigations, we have dashboards for analyzing disk, memory, network, and TCP connections.
You can drill down from this Host Metrics Overview dashboard to any of the detailed dashboards by using the honeycombs or line charts in all the panels.
You can also use each of the host dashboards to also filter by individual hosts you want to monitor.
Once you've established the overall resource utilization on a host, it's essential to understand what processes are causing spikes.
To do so, we have the Process Metrics Overview dashboard that gives you a view of all the top processes by open file descriptors, CPU usage, memory usage, disk read/write operations, and thread count. You can also use this dashboard to identify the longest-running processes and users that have spawned the most number of processes.
You can drill down from this dashboard to the Process Metrics - Details dashboard by using the honeycombs or line charts in all the panels.
The Process Metrics - Details dashboard can give you a detailed view of key process-related metrics such as CPU and memory utilization, disk read/write throughput, and major/minor page faults. You can also use this dashboard to:
Determine the number of open file descriptors across processes since if the number of open file descriptors exceeds maximum file descriptor limits, your applications will get IOException errors
Identify anomalies in CPU usage, memory usage, major/minor page faults and reads/writes over time
Troubleshoot memory leaks using the resident set memory trend chart
For a complete list of Host and Process Metrics dashboards see Host and Process Metrics Dashboards.
This new Host and Process Metrics app can also be used in conjunction with other Sumo Logic Apps:
The Linux app allows you to view information about events, logins, and the security status of your Linux hosts. The app consists of predefined searches and dashboards that provide visibility into your environment for real-time or historical analysis.
The Windows app provides insight into the operations of your Windows hosts and consists of predefined searches and dashboards that provide visibility into security status, system activity, OS updates, user activity, and application installation activity.
In conclusion, this new Sumo Logic app for host and process metrics can help you comprehensively monitor your critical application infrastructure that identifies key service level indicators (SLI) and reduce MTTI and MTTR, which further help you achieve your Service Level Objectives (SLO).
Get started now!
To get started, check out the following documentation for the new Host and Process Metrics App.
If you don't yet have a Sumo Logic account, you can sign up for a free trial today.
For more great DevOps/Observability and security-focused reads, check out the Sumo Logic blog.
Complete visibility for DevSecOps
Reduce downtime and move from reactive to proactive monitoring.