Software Systems

  • Software systems can assume different forms (e.g., operating system, file system, database, web server, …);

  • Systems perform useful work for users by receiving requests, handling these, and producing responses;

  • Systems use physical and logical resources with limited capacity:

    • Physical: CPU, memory, disk, network, …
    • Logical: locks, caches, …

System Benchmarking

Why is it useful?

  • How can one build efficient and reliable systems?

    • follow good design and implementation guidelines;
    • review and validade the code and implementation;
    • test (i.e., benchmark) implementations!
  • How can one know the system is using logical and physical resources efficiently?

    • Benchmark the system!
  • How can one know that the environment where the system is running has enough resources?

    • Benchmark the system in that environment.

Ecosystem

  • A benchmarking ecosystem consists of:
    • The workload (group of requests) being issued to the System Under Test (SUT);
    • The environment where the SUT is running;
    • The metrics collected to measure the performance, efficiency, and/or reliability of the SUT.

Workload

  • Testing systems in production may not be viable, or the best option:

    • imagine only knowing if your SUT works when deployed and serving users…
  • One can use traces of requests from a real production setup instead:

    • Advantage: requests extracted from real workloads;
    • Disadvantage: Hard to get (sometimes) and scale (i.e., how can one scale a trace with 100 requests to millions of requests without losing realism?).
  • Another option is to use synthetic workloads:

    • Use a subset of synthetically generated requests (e.g., set of database queries, file system operations, web requests, …);
    • Generate parameters to mimic different behaviors (e.g., request type, size, parallelism):
      • following a given distribution (e.g., sequential, uniform, Poisson, …).

Environment

Hardware and Software

  • Knowing and setting up the right environment to run experiments is very important!

    • testing your SUT in an unrealistic environment may lead to wrong conclusions;
    • it is also important for others to be able to reproduce your results.
  • One must be able to characterize the experimental environment;

  • Hardware:

    • what CPU, RAM, GPU, and disk models are being used?
    • what hardware configurations are being used? (e.g., number of CPUs, amount of RAM, …).
  • Software:

    • what operating system is being used? And the kernel version?
    • what about the libraries and corresponding versions?
    • and finally, what are the versions of the different SUT components?

Metrics

  • The SUT will be serving multiple user requests over time;
  • The responses varies:
    • some requests are served correctly or incorrectly;
    • other requests may not be served (i.e., the SUT rejects a request because it is too busy).

Performance

Metric: Response Time (RT)

  • RT: The time interval between the user’s request and the system’s response:

    • sometimes also referred to as latency.
  • As the load in the system increases, the RT of requests tends also to increase.

Metric: Throughput

  • Throughput: rate at which user requests are serviced by the system (i.e., operations served per unit of time);
  • As the load in the system increases, the throughput of requests tends to increase until a certain point, reaching the system’s nominal capacity:
    • when the system becomes saturated, the throughput starts decreasing.

Response Time vs Throughput

  • A naive view (RT = 1 / Throughput) is false!!!
    • this is only true when the system is busy 100% of the time executing exactly 1 request!

Phases

  • In the figure below, one can observe 3 distinct phases of the system:
    • Idle: request are immediately handled as the system has spare capacity;
    • Near Capacity: requests are handled after a brief wait (throughput and RT increasing);
    • Overload: resources become saturated (throughput decreases while RT increases).

Tradeoff

  • Ideally, one would optimize a system to increase throughput and decrease RT but this is really hard to achieve;

  • Most optimizations trade one metric for the other (e.g., batching requests usually improve throughput at the cost of RT).

More Metrics

  • Utilization of resources (e.g., CPU, RAM, network, disk, energy, …);

  • Efficiency of the system (e.g., the ratio between throughput and utilization, between throughput and energy consumption, …);

  • Reliability of a system (e.g., number of errors, failure, …);

  • Availability of a system (e.g., uptime, downtime).

Analysis of our Metrics

  • When running experiments, one must collect several samples for metrics of interest!

    • repeat your experiments several times to identify abnormal behavior (outliers);
    • remember that many workloads may not be deterministic!
  • After collecting the samples, these must be analyzed and, ideally, summarized.

Some Conclusions

  • Use mean, mode, median, or high percentiles:

    • along with confidence intervals (CI).
  • Combine mean and std. dev. to get the **coefficient of variation (C.O.V):

    • Std. dev. / mean (usually expressed as %).
  • Use visual representations to observer and better understand your samples!

    • Time-series plots, histograms, ECDFs, …

Common Mistakes

  • No goals or biased ones:

    • define the evaluation questions you want to answer with your experiments!
  • Unsystematic approach - Reproducibility is key!

    • set and document your workloads, experimental environment, and configurations!
  • Unrepresentative workloads and metrics:

    • choose realistic workloads and experiments that help answer your questions;
    • avoid being biased (i.e., don’t be afraid to show the strenghts and drawbacks of the SUT).
  • Wrong analysis and presentation of results:

    • Inspect samples in time for stability:
      • consider external phenomenons.
    • Inspect ECDFs for distribution;
    • Then summarize…