thoroughness and dedication, belief in the value of preparation and documentation, and an awareness of what could go wrong, coupled with a strong desire to prevent it.

  • On-call playbook
  • Can we roll back changes safely when outages occur
  • Version control DB
  • Service redundancy
  • Record the failures, calculate reliability etc.
  • Canarying
  • Service error budget

Why tracking recent pushes (versions) are important?

Tickets, email alerts, and pages.

Push: Any change to a service’s running software or its configuration.

We avoid “magic” systems that try to learn thresholds or automatically detect causality.

You monitoring system should address two questions: what’s broken, and why?

Maximum signal and minimum noise.

The Four Golden Signals (of monitoring)

  • Latency: it is important to track error latency, as opposed to just filtering out errors
  • Traffic
    • Web service: HTTP requests per second
    • Audio streaming system: network I/O rate or concurrent sessions
    • Key-value storage: transactions and retrievals per second
  • Erros
    • The rate of requests that fail
  • Saturation
    • Emphasizing the resources that are most constrained
    • Saturation is also concerned with predication of impending saturation

The simplest way to differentiate between a slow average and a very slow “tail” of requests is to collect request counts bucketed by latencies. It can be tempting to combine monitoring with other aspects of inspecting complex systems, such as detailed system profiling, single process debugging, tracking details about exception or crashes, load testing, log collection and analysis, of traffic inspection.

A fundamental philosophy on pages and pagers:

  • Every time the pager goes off, I should be able to react with a sense of urgency. It can’t happen too frequently.
  • Every page should be actionable.
  • Every page response should require intelligence but not robotic response.
  • Page should be about a novel problem or an event path hasn’t been seen before.

“A tension between short term and long term availability”.

Doing automation thoughtlessly can create as many problems as it solves.

The value of consistency, platform, faster repairs, faster action, time-saving in automation, not just scale.

In fact, instead of having a system that has to have external glue logic, it would be even better to have a system that needs no glue logic at all.

“Turnup automation and the core system”.

Question: How to manage the system configuration change?

  • Were all of the services’s dependencies available and correctly configured?
  • Were all configurations and package consistent with other deployments?
  • Could the team confirm that every configuration exception was desired?

Configuration auto-fix.

Automation processes can vary in three aspects:

  • Competence, i.e., their accuracy
  • Latency, how quickly all steps are executed when initiated
  • Relevance, or proportion of real-world process covered by automation

Organizational incentives:

  • A team whose primary task is to speed up the current turn-up has no incentive to reduce the technical debt of the service-owning team running the service in production later
  • A team not running automation has no incentive to build systems that are easy to automate
  • A PM whose schedule is not affected by low-quality automation will always prioritize new features over simplicity and automation

The most functional tools are usually written by those who use them.

Our evolution of turn-up automation followed a path:

  • Operator-triggered manual action (no automation)
  • Operator-written, system-specific automation
  • Externally maintained generic automation
  • Internally maintained, system-specific automation
  • Autonomous systems that need no human intervention

Talking about “Borg” — We achieved this goal by bringing ideas related to data distribution, APIs, hub-and-spoke architectures, and classic distributed system software development t bear upon the domain of infra management.

“Self-repairing” “Self-introspection”.

Idempotent operations.

associate a binary to a record of how it was built

Configuration Management

Release Engineering

  • Self-Service model
  • High velocity
  • Some teams perform hourly builds and then select the version to actually deploy to production from the resulting pool of builds
  • Some adopted a “Push on Green” release model

“Gated operations”

Uniquely identified artifact.

Example: Unpacking the Causes of a Symptom

  • A Spanner cluster has high latency and RPCs to its servers are timing out
  • Why? CPU time is used up
  • Where? Evaluating a regular expression against paths to log files
  • Solution: Rewrite the regexp, no backtracking

You may annotate a graph showing the systems’s error rates with the start and end times of a deployment of a new version.

  • An ideal test should have mutually exclusive alternatives, so that it can rule one group of hypotheses in and rule another set out
  • “Is the problem getting worse on its own, or because of the logging?”

Latency (logarithmic heat map)

Making Troubleshooting Easier

  • Building observability — with both white-box metrics and structured logs — into each component from the ground up
  • Designing systems with well-understood and observable interfaces between components

SRE should retain highly reliable, low overhead backup systems

“In response, the on-call engineers disabled all team automation in order to prevent further damage”

There is no greater test than reality. Ask yourself some big, open-ended questions: What if the building power fails…? What if the network equipment racks are standing in two fee of water…? DO YOU HAVE A PLAN? COULD THE PERSON SITTING NEXT TO YOU DO THE SAME?

A postmortem is a written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.

“If you haven’t tried it, assume it’s broken”

“Writing your configuration files in an interpreted language is risky, as this approach is fraught with latent failures that are hard to definitively address.”

Capacity Planning

Never-ending cycle: assumptions change, deployments slip, and budgets are cut, resulting in revision upon revision of The Plan.

Brittle By Nature

  • Service outage
  • Customer demand
  • New resource came late
  • Performance goal changed

“Intent-Based” Capacity Planning =bin packing, autogenerated allocation schema.

Traffic load balancing is how we divide which of the many, many machines in our datacenter will serve a particular request.

Some requests might be directed to a datacenter that is slightly farther away in order to keep caches warm.

  • Layer 1: DNS Load Balancing
  • Layer 2: Virtual IP address Load Balancing
  • Consistent hashing
  • Packet encapsulation: a network load balancer puts the forwarded packet into another IP packet with GRE, and use a backend’s address as the destination =larger MTU within the datacenter
  • Loyer 3: Subsetting
  • Layer 4: Policing


  • Degradation of response

QpS makes a bad metric.

The criticality of a request is orthogonal to its latency requirements and thus t the underlying network QoS used.

Process health checking and service health checking are two conceptually distinct operations.

The key breakthrough of Big Data is the widespread application of “embarrassingly parallel” algorithms to cut a large workload into chunks small enough to fit onto individual machines.

Because Spanner does not make for a hihi-throughput filesystem

Protecting against a failure at layer X requires storing data on diverse components at that layer. Media isolation protects against media flaws: a bug or attach in a disk device driver is unlikely to affect tape drives.

  • First layer: Soft Deletion
  • Second Layer: Backups and Their Related Recovery Methods
  • Third layer: Early Detection

Data validation pipeline.

  • Continuously test the recovery process as part of your normal ops
  • Set up alerts that fire when a recovery process failed to provide a heartbeat indication of its success

Nor do traditional companies have the opportunity to design a detailed launch process, because they don’t accumulate enough experience performing launches to generate a robust and mature process.

When designing a new system, it is useful to have SRE on spot.

Check TCP port:

nc -vz targetServer portNum