Definition of Monitor
- A monitor observes the activity of a system.
Monitoring Concepts
-
A system contains physical and logical resources with state:
- CPU, RAM, disk, virtual memory, processes, threads, …
-
State changes as events:
- CPU/memory instructions, I/O operations, application requests…
-
A trace is a log of events:
- each entry at the log typically contains a timestamp and other details of the event.
-
The domain is the set of activities (events) observed:
- E.g., resource consumption (CPU, RAM, disk, network), I/O operations, database requests.
-
The detail of the information being captured (monitored) varies:
- According to the input rate:
- how many events can the monitor observe per unit of time?
- According to the resolution:
- at what granularities (ms, us, ns, etc) can the monitor observe events.
- According to the input rate:
-
A monitor imposes overhead, changing the observed activity:
- E.g., the monitor also consumes system resources (e.g., CPU, RAM, I/O), such may interfere with the system being monitored and the observations.
Monitor Classification
-
What triggers the observation/collection of events?
- Event-driven: observation is triggered every time an event of interest occurs;
- Sampling: observation is triggered only for a subset of events (every 100 events, …);
- One is trading accuracy (observing as many events as possible - event-driven) for performance (less overhead introduced by the monitor - sampling).
-
When are the observations available for users and/or systems?
- On-line: collected events are sent straight away to the user and/or system (e.g., to a file or a database);
- Batch: events are grouped and sent in a batch to the user and/or system;
- One is trading real-time observation (on-line) for better performance (batch).
-
How is the monitor implemented?
- Hardware monitors are typically more accurate and have better resolution;
- Software-based monitors are usually less expensive, as well as more flexible and extensive in terms of the metrics and events one can observe.
-
How is the monitor designed distribution-wise?
- Centralized: the monitor service is deployed on a single node (e.g., server);
- Distributed: the service is spread, and even replicated, across several nodes.
-
What about the scope of monitoring distribution-wise?
- Single node: events collected for a single node;
- Distributed: events collected for a cluster of nodes;
- This may be orthogonal to the distributed or centralized design of the monitor service. Do not confuse the design of the monitor with its distribution scope!
Monitor Architecture
-
Observation of the raw events in systems (e.g., resource usage, I/O requests, …)
-
Collection and normalization of data (e.g., normalization of time units);
-
Analysis of normalized data (e.g., filter, querying, summarizing);
-
Presentation of the analysis results (e.g., dashboards, reports, alarms).
Observation Layer
-
Events of interest can be observed through different techniques;
-
Passive observation or spying: no need to instrument the resources and/or applications being monitored. The monitor passively observes the system;
-
Instrumentation: the hardware, operating system, and/or applications are instrumental to observe relevant information;
-
Probing: the monitor interacts (probes) with the system/application being observed to collect metrics.
Collection Layer
-
Observed events can be collected and normalized through two approaches:
- Push-based: events are sent by the observation layer to the collection one;
- Pull-based: events are pulled by the collection layer from the observation one.
-
Besides collecting data, the layer is used to normalize it (e.g., normalize time intervals, aggregate data, …);
-
In some cases, the layer may provide persistency (temporary storage) for collected events.
Analysis Layer
-
The analysis layer has two main goals;
-
Efficient and reliable data storage and indexing:
- if a comprehensive set of events is collected, the analysis component may have to persist and index large volumes of data.
-
Efficient data processing:
- stored and indexed data must be queryable, and sometimes analyzed in streams (e.g., for time-series analysis);
-
The choice of technology for this layer depends on the type of data being analyzed and the analysis requirements (e.g., time-series database, document-oriented database, …)
Presentation Layer
-
The presentation layer can be used for visualizing:
- Performance metrics of applications and/or systems (e.g., requests throughput and latency, CPU usage, RAM consumption, …);
- The configuration of cluster resources and the applications deployed in these;
- Errors at cluster resources and/or applications.
-
Users can use this layer through different visual representations (e.g., dashboards, reports, …)
Elastic Stack
-
Beats: lightweight data shippers (Observation and Collection):
- Purpose: to observe, collect and send data to Logstash or ElasticSearch for further processing and indexing.
-
Logstash: a data ingestion tool (Collection):
- Purpose: to collect, transform, and send data from the various sources to ElasticSearch.
-
ElasticSearch: a distributed, JSON-based search and analytics engine (Analysis):
- Purpose: to index, search, and analyze data.
-
Kibana: a data visualization and exploration tool (Presentation):
- Purpose: to visualize and manage data from ElasticSearch.
Monitoring and Actuation
-
Monitoring can be combined with actuation to automatically apply actions as the system;
-
The actions to apply are specified as policies (e.g., when a server’s CPU is above a given threshold migrate some VMs to another server);
-
The orchestrator is responsible for defining the best strategy for applying policies (e.g., choose what VM(s) to migrate and to which server(s));
-
The actuator is responsible for applying the strategy defined by the orchestrator (e.g., migrate the VM resource to another server).