Systems Performance - Enterprise and the Cloud


Chapter 1

  • “Full stack” *

    The term full stack is sometimes used to describe only the application environment, including databases, applications, and web servers. When speaking of systems performance, however, we use full stack to mean the entire software stack from the application down to metal (the hardware), including system libraries, the kernel, and the hardware itself. Systems performance studies the full stack.

  • Workload analysis vs resource analysis, approaching the problem from either end

  • Bottlenecks can also be complex and related in unexpected ways; fixing one may simply move the bottleneck elsewhere in the system

  • Observability vs. experimental tools

  • 400

  • Profiling vs Tracing

    • Profiling: perform sampling: taking a subset (a sample) of measurements to paint a coarse picture of the target
    • Tracing: event-based recording, where event data is captured and saved for later analysis or consumed on-the-fly for custom summaries and other actions
  • The linux kernel has hundreds of “tracepoints” (static instrumentation) but also has things like kprobes/BPF (dynamic instrumentation)

  • USDT (user statically defined tracing) is a technology (what is this though, a library?) that allows user-space code to set up tracepoints

  • Methodologies

    • Checklists > randomly rooting around
    • 60-second checklist: 700
  • Case studies

    • Slow database queries
      • High disk IO, no corresponding CPU spike

      • No disk errors *

        He checks disk error counters from /sys; they are zero.

        • What counters?
      • Database process is blocking on slow IO (via offcputime)

      • iostat numbers look like increased disk load

      • No increased load from the database

      • Theories:

        • fs fragmentation because the disk is close to full (no)
        • inode exhaustion?
        • page cache hit rate is lower because something else is using up memory (yes, verified via cachestat)

Chapter 2: Methodologies

  • “Latency” on its own is too vague: *

    For example, the load time for a website may be composed of three different times measured from different locations: DNS latency, TCP connection latency, and then TCP data2.3 Concepts transfer time. DNS latency refers to the entire DNS operation. TCP connection latency refers to the initialization only (TCP handshake). At a higher level, all of these, including the TCP data transfer time, may be treated as latency of something else. For example, the time from when the user clicks a website link to when the resulting page is fully loaded may be termed latency, which includes the time for the browser to fetch a web page over a network and render it. Since the single word “latency” can be ambiguous, it is best to include qualifying terms to explain what it measures: request latency, TCP connection latency, etc.

  • 700

  • By tuning at the application level, you may be able to eliminate or reduce database queries and improve performance by a large factor (e.g., 20x). Tuning down to the storage device level may eliminate or improve storage I/O, but a tax has already been paid in executing higher-level OS stack code, so this may improve resulting application performance only by percentages (e.g., 20%).

  • 500

    • Examples for issues that may cause “fast degradation”: system is out of memory and pages are evicted to disk/swap, saturated (rotational) disks
    • Examples for issues that may cause “slow degradation”: CPU load
  • Utilization: % of time the resource is in use (or % of stated capacity in use)

  • Saturation: degree to which a resource is unable to accept more work (at 100% utilization)

  • Workload analysis vs. resource analysis

  • Methodologies

    • Scientific method: hypothesize, test, validate/discard hypothesis based on data

    • USE method

      • For every resource, check utilization, saturation, and errors
      • Utilization is time-based (% of time the resource is busy) or capacity-based (% of the rated capacity that’s being used) based on the type of resource
      • 300
      • Examples of resources to check: 300
      • Suited to resources that degrade at high utilization/saturation (unlike caches, where perf improves as utilization goes up)
      • Some combinations are harder to check: 500
    • RED method

      • Specific to microservices
      • For each service, check the request rate, errors, and duration
      • the USE method for machine health, and the RED method for user health

    • Workload characterization

      • Examine the load being applied, not the system itself
      • 500
    • Drill-down analysis

      • Monitoring for an initial sign that something is wrong (grafana, etc.), eliminate & narrow down on a single system (grafana/etc. but also things like vmstat/htop on single machines), root-cause (tracing/debugging/eBPF/etc.)
    • Latency analysis

      • Starting with a broad latency issue, narrow/eliminate until a single slow system is revealed
  • Stopped at pg58

Chapter 16: Case Study