Definition of Monitor
- A monitor observes the activity of a system.
 

Monitoring Concepts
- 
A system contains physical and logical resources with state:
- CPU, RAM, disk, virtual memory, processes, threads, …
 
 - 
State changes as events:
- CPU/memory instructions, I/O operations, application requests…
 
 - 
A trace is a log of events:
- each entry at the log typically contains a timestamp and other details of the event.
 
 - 
The domain is the set of activities (events) observed:
- E.g., resource consumption (CPU, RAM, disk, network), I/O operations, database requests.
 
 - 
The detail of the information being captured (monitored) varies:
- According to the input rate:
- how many events can the monitor observe per unit of time?
 
 - According to the resolution:
- at what granularities (ms, us, ns, etc) can the monitor observe events.
 
 
 - According to the input rate:
 - 
A monitor imposes overhead, changing the observed activity:
- E.g., the monitor also consumes system resources (e.g., CPU, RAM, I/O), such may interfere with the system being monitored and the observations.
 
 
Monitor Classification
- 
What triggers the observation/collection of events?
- Event-driven: observation is triggered every time an event of interest occurs;
 - Sampling: observation is triggered only for a subset of events (every 100 events, …);
 - One is trading accuracy (observing as many events as possible - event-driven) for performance (less overhead introduced by the monitor - sampling).
 
 - 
When are the observations available for users and/or systems?
- On-line: collected events are sent straight away to the user and/or system (e.g., to a file or a database);
 - Batch: events are grouped and sent in a batch to the user and/or system;
 - One is trading real-time observation (on-line) for better performance (batch).
 
 - 
How is the monitor implemented?
- Hardware monitors are typically more accurate and have better resolution;
 - Software-based monitors are usually less expensive, as well as more flexible and extensive in terms of the metrics and events one can observe.
 
 - 
How is the monitor designed distribution-wise?
- Centralized: the monitor service is deployed on a single node (e.g., server);
 - Distributed: the service is spread, and even replicated, across several nodes.
 
 - 
What about the scope of monitoring distribution-wise?
- Single node: events collected for a single node;
 - Distributed: events collected for a cluster of nodes;
 - This may be orthogonal to the distributed or centralized design of the monitor service. Do not confuse the design of the monitor with its distribution scope!
 
 
Monitor Architecture
- 
Observation of the raw events in systems (e.g., resource usage, I/O requests, …)
 - 
Collection and normalization of data (e.g., normalization of time units);
 - 
Analysis of normalized data (e.g., filter, querying, summarizing);
 - 
Presentation of the analysis results (e.g., dashboards, reports, alarms).
 

Observation Layer
- 
Events of interest can be observed through different techniques;
 - 
Passive observation or spying: no need to instrument the resources and/or applications being monitored. The monitor passively observes the system;
 - 
Instrumentation: the hardware, operating system, and/or applications are instrumental to observe relevant information;
 - 
Probing: the monitor interacts (probes) with the system/application being observed to collect metrics.
 
Collection Layer
- 
Observed events can be collected and normalized through two approaches:
- Push-based: events are sent by the observation layer to the collection one;
 - Pull-based: events are pulled by the collection layer from the observation one.
 
 - 
Besides collecting data, the layer is used to normalize it (e.g., normalize time intervals, aggregate data, …);
 - 
In some cases, the layer may provide persistency (temporary storage) for collected events.
 

Analysis Layer
- 
The analysis layer has two main goals;
 - 
Efficient and reliable data storage and indexing:
- if a comprehensive set of events is collected, the analysis component may have to persist and index large volumes of data.
 
 - 
Efficient data processing:
- stored and indexed data must be queryable, and sometimes analyzed in streams (e.g., for time-series analysis);
 
 - 
The choice of technology for this layer depends on the type of data being analyzed and the analysis requirements (e.g., time-series database, document-oriented database, …)
 
Presentation Layer
- 
The presentation layer can be used for visualizing:
- Performance metrics of applications and/or systems (e.g., requests throughput and latency, CPU usage, RAM consumption, …);
 - The configuration of cluster resources and the applications deployed in these;
 - Errors at cluster resources and/or applications.
 
 - 
Users can use this layer through different visual representations (e.g., dashboards, reports, …)
 

Elastic Stack
- 
Beats: lightweight data shippers (Observation and Collection):
- Purpose: to observe, collect and send data to Logstash or ElasticSearch for further processing and indexing.
 
 - 
Logstash: a data ingestion tool (Collection):
- Purpose: to collect, transform, and send data from the various sources to ElasticSearch.
 
 - 
ElasticSearch: a distributed, JSON-based search and analytics engine (Analysis):
- Purpose: to index, search, and analyze data.
 
 - 
Kibana: a data visualization and exploration tool (Presentation):
- Purpose: to visualize and manage data from ElasticSearch.
 
 
Monitoring and Actuation
- 
Monitoring can be combined with actuation to automatically apply actions as the system;
 - 
The actions to apply are specified as policies (e.g., when a server’s CPU is above a given threshold migrate some VMs to another server);
 - 
The orchestrator is responsible for defining the best strategy for applying policies (e.g., choose what VM(s) to migrate and to which server(s));
 - 
The actuator is responsible for applying the strategy defined by the orchestrator (e.g., migrate the VM resource to another server).
 
