Benchmarking

Software Systems

Software systems can assume different forms (e.g., operating system, file system, database, web server, …);
Systems perform useful work for users by receiving requests, handling these, and producing responses;
Systems use physical and logical resources with limited capacity:
- Physical: CPU, memory, disk, network, …
- Logical: locks, caches, …

System Benchmarking

Why is it useful?

How can one build efficient and reliable systems?
- follow good design and implementation guidelines;
- review and validade the code and implementation;
- test (i.e., benchmark) implementations!
How can one know the system is using logical and physical resources efficiently?
- Benchmark the system!
How can one know that the environment where the system is running has enough resources?
- Benchmark the system in that environment.

Ecosystem

A benchmarking ecosystem consists of:
- The workload (group of requests) being issued to the System Under Test (SUT);
- The environment where the SUT is running;
- The metrics collected to measure the performance, efficiency, and/or reliability of the SUT.

Workload

Testing systems in production may not be viable, or the best option:
- imagine only knowing if your SUT works when deployed and serving users…
One can use traces of requests from a real production setup instead:
- Advantage: requests extracted from real workloads;
- Disadvantage: Hard to get (sometimes) and scale (i.e., how can one scale a trace with 100 requests to millions of requests without losing realism?).
Another option is to use synthetic workloads:
- Use a subset of synthetically generated requests (e.g., set of database queries, file system operations, web requests, …);
- Generate parameters to mimic different behaviors (e.g., request type, size, parallelism):
  - following a given distribution (e.g., sequential, uniform, Poisson, …).

Environment

Hardware and Software

Knowing and setting up the right environment to run experiments is very important!
- testing your SUT in an unrealistic environment may lead to wrong conclusions;
- it is also important for others to be able to reproduce your results.
One must be able to characterize the experimental environment;
Hardware:
- what CPU, RAM, GPU, and disk models are being used?
- what hardware configurations are being used? (e.g., number of CPUs, amount of RAM, …).
Software:
- what operating system is being used? And the kernel version?
- what about the libraries and corresponding versions?
- and finally, what are the versions of the different SUT components?

Metrics

The SUT will be serving multiple user requests over time;
The responses varies:
- some requests are served correctly or incorrectly;
- other requests may not be served (i.e., the SUT rejects a request because it is too busy).

Performance

Metric: Response Time (RT)

RT: The time interval between the user’s request and the system’s response:
- sometimes also referred to as latency.
As the load in the system increases, the RT of requests tends also to increase.

Metric: Throughput

Throughput: rate at which user requests are serviced by the system (i.e., operations served per unit of time);
As the load in the system increases, the throughput of requests tends to increase until a certain point, reaching the system’s nominal capacity:
- when the system becomes saturated, the throughput starts decreasing.

Response Time vs Throughput

A naive view (RT = 1 / Throughput) is false!!!
- this is only true when the system is busy 100% of the time executing exactly 1 request!

Phases

In the figure below, one can observe 3 distinct phases of the system:
- Idle: request are immediately handled as the system has spare capacity;
- Near Capacity: requests are handled after a brief wait (throughput and RT increasing);
- Overload: resources become saturated (throughput decreases while RT increases).

Tradeoff

Ideally, one would optimize a system to increase throughput and decrease RT but this is really hard to achieve;
Most optimizations trade one metric for the other (e.g., batching requests usually improve throughput at the cost of RT).

More Metrics

Utilization of resources (e.g., CPU, RAM, network, disk, energy, …);
Efficiency of the system (e.g., the ratio between throughput and utilization, between throughput and energy consumption, …);
Reliability of a system (e.g., number of errors, failure, …);
Availability of a system (e.g., uptime, downtime).

Analysis of our Metrics

When running experiments, one must collect several samples for metrics of interest!
- repeat your experiments several times to identify abnormal behavior (outliers);
- remember that many workloads may not be deterministic!
After collecting the samples, these must be analyzed and, ideally, summarized.

Some Conclusions

Use mean, mode, median, or high percentiles:
- along with confidence intervals (CI).
Combine mean and std. dev. to get the **coefficient of variation (C.O.V):
- Std. dev. / mean (usually expressed as %).
Use visual representations to observer and better understand your samples!
- Time-series plots, histograms, ECDFs, …

Common Mistakes

No goals or biased ones:
- define the evaluation questions you want to answer with your experiments!
Unsystematic approach - Reproducibility is key!
- set and document your workloads, experimental environment, and configurations!
Unrepresentative workloads and metrics:
- choose realistic workloads and experiments that help answer your questions;
- avoid being biased (i.e., don’t be afraid to show the strenghts and drawbacks of the SUT).
Wrong analysis and presentation of results:
- Inspect samples in time for stability:
  - consider external phenomenons.
- Inspect ECDFs for distribution;
- Then summarize…

📚 College Note's

Explorer

Software Systems

System Benchmarking

Why is it useful?

Ecosystem

Workload

Environment

Hardware and Software

Metrics

Performance

Metric: Response Time (RT)

Metric: Throughput

Response Time vs Throughput

Phases

Tradeoff

More Metrics

Analysis of our Metrics

Some Conclusions

Common Mistakes

Graph View

Table of Contents

Backlinks