Software Systems
- 
Software systems can assume different forms (e.g., operating system, file system, database, web server, …); 
- 
Systems perform useful work for users by receiving requests, handling these, and producing responses; 
- 
Systems use physical and logical resources with limited capacity: - Physical: CPU, memory, disk, network, …
- Logical: locks, caches, …
 

System Benchmarking
Why is it useful?
- 
How can one build efficient and reliable systems? - follow good design and implementation guidelines;
- review and validade the code and implementation;
- test (i.e., benchmark) implementations!
 
- 
How can one know the system is using logical and physical resources efficiently? - Benchmark the system!
 
- 
How can one know that the environment where the system is running has enough resources? - Benchmark the system in that environment.
 
Ecosystem
- A benchmarking ecosystem consists of:
- The workload (group of requests) being issued to the System Under Test (SUT);
- The environment where the SUT is running;
- The metrics collected to measure the performance, efficiency, and/or reliability of the SUT.
 

Workload
- 
Testing systems in production may not be viable, or the best option: - imagine only knowing if your SUT works when deployed and serving users…
 
- 
One can use traces of requests from a real production setup instead: - Advantage: requests extracted from real workloads;
- Disadvantage: Hard to get (sometimes) and scale (i.e., how can one scale a trace with 100 requests to millions of requests without losing realism?).
 
- 
Another option is to use synthetic workloads: - Use a subset of synthetically generated requests (e.g., set of database queries, file system operations, web requests, …);
- Generate parameters to mimic different behaviors (e.g., request type, size, parallelism):
- following a given distribution (e.g., sequential, uniform, Poisson, …).
 
 
Environment
Hardware and Software
- 
Knowing and setting up the right environment to run experiments is very important! - testing your SUT in an unrealistic environment may lead to wrong conclusions;
- it is also important for others to be able to reproduce your results.
 
- 
One must be able to characterize the experimental environment; 
- 
Hardware: - what CPU, RAM, GPU, and disk models are being used?
- what hardware configurations are being used? (e.g., number of CPUs, amount of RAM, …).
 
- 
Software: - what operating system is being used? And the kernel version?
- what about the libraries and corresponding versions?
- and finally, what are the versions of the different SUT components?
 
Metrics
- The SUT will be serving multiple user requests over time;
- The responses varies:
- some requests are served correctly or incorrectly;
- other requests may not be served (i.e., the SUT rejects a request because it is too busy).
 

Performance
Metric: Response Time (RT)
- 
RT: The time interval between the user’s request and the system’s response: - sometimes also referred to as latency.
 
- 
As the load in the system increases, the RT of requests tends also to increase. 

Metric: Throughput
- Throughput: rate at which user requests are serviced by the system (i.e., operations served per unit of time);
- As the load in the system increases, the throughput of requests tends to increase until a certain point, reaching the system’s nominal capacity:
- when the system becomes saturated, the throughput starts decreasing.
 

Response Time vs Throughput
- A naive view (RT = 1 / Throughput) is false!!!
- this is only true when the system is busy 100% of the time executing exactly 1 request!
 

Phases
- In the figure below, one can observe 3 distinct phases of the system:
- Idle: request are immediately handled as the system has spare capacity;
- Near Capacity: requests are handled after a brief wait (throughput and RT increasing);
- Overload: resources become saturated (throughput decreases while RT increases).
 

Tradeoff
- 
Ideally, one would optimize a system to increase throughput and decrease RT but this is really hard to achieve; 
- 
Most optimizations trade one metric for the other (e.g., batching requests usually improve throughput at the cost of RT). 

More Metrics
- 
Utilization of resources (e.g., CPU, RAM, network, disk, energy, …); 
- 
Efficiency of the system (e.g., the ratio between throughput and utilization, between throughput and energy consumption, …); 
- 
Reliability of a system (e.g., number of errors, failure, …); 
- 
Availability of a system (e.g., uptime, downtime). 
Analysis of our Metrics
- 
When running experiments, one must collect several samples for metrics of interest! - repeat your experiments several times to identify abnormal behavior (outliers);
- remember that many workloads may not be deterministic!
 
- 
After collecting the samples, these must be analyzed and, ideally, summarized. 
Some Conclusions
- 
Use mean, mode, median, or high percentiles: - along with confidence intervals (CI).
 
- 
Combine mean and std. dev. to get the **coefficient of variation (C.O.V): - Std. dev. / mean (usually expressed as %).
 
- 
Use visual representations to observer and better understand your samples! - Time-series plots, histograms, ECDFs, …
 
Common Mistakes
- 
No goals or biased ones: - define the evaluation questions you want to answer with your experiments!
 
- 
Unsystematic approach - Reproducibility is key! - set and document your workloads, experimental environment, and configurations!
 
- 
Unrepresentative workloads and metrics: - choose realistic workloads and experiments that help answer your questions;
- avoid being biased (i.e., don’t be afraid to show the strenghts and drawbacks of the SUT).
 
- 
Wrong analysis and presentation of results: - Inspect samples in time for stability:
- consider external phenomenons.
 
- Inspect ECDFs for distribution;
- Then summarize…
 
- Inspect samples in time for stability: