Skip to content

Software engineering

  • "Push on Green"
    • All changes must be accompanied by tests that ensure the correct execution of the code both under expected and unexpected conditions.
    • Rollback: runtime flags specified on the command line => Feature flag


The release process originates from the need to separate development from production. It is like a pipeline. Only some changes that are important and finished review are cherry-picked to the stable branch, being version-controlled, and get released.

Packaging is a sub-problem. We should prefer source distribution for script code, both for ease of setup, and for ease of debugging. Package with dpkg is a good option since it provides many features out of box. Also, the package itself should be sha1sumed.

Chaos Engineering Chaos Practice in TiDB


Our gerrit system has a long pool of pending change sets. Not that it is not good, but some features has been proposed but never merged. So what is the conflict here? How conservatism and velocity interacts?


Check out the list: “Recently closed” — since we are pushing actively, the velocity looks high. However, high velocity leads to production bugs. More bugs distracts us from producing high quality code, since core engineers are almost all on-call.

What is a healthy velocity?


In this type, all changes are blocked until they are blocking other changes. Only the user-visible changes are not blocked by default. The problem here is, however, how we understand “user-visible”. We always talk about this concept in terms of short-term profit: a imminent feature & event, a bug fix etc. However, we usually exclude long-term process optimization from that. But this is apparently wrong. One question, Google is developing many internal tool, like Borg, which is not really user-visible, esp. on its formation stage. However, Borg made Google infra better and led to long term customer satisfaction.

On the other hand, can’t our smart engineer get this point about long term profit? What are they afraid of? What are they losing sight of?


CI Value

  • Reduce risks
  • Reduce repetitive manual processes
  • Generate deployable software at any time and any place
  • Enable better project visibility
  • Establish greater confidence in the software product from the development team
  • Reduce duplicate code
  • Assess code coverage

Continuous Deployment

  • Run all tests
  • Possess capability to roll back release

In fact, most software developed by large teams spends a significant proportion of its development time in an unusable state.

The goal of continuous integration is that the software is in a working state all the time.

  • They free testers to concentrate on exploratory testing and higher-value activities instead of boring repetitive tasks.

Michael Feathers, in his book Working Effectively with Legacy Code, provoca- tively defined legacy systems as systems that do not have automated tests.

Selenium Release it!

The shortest feedback loops are created through sets of automated tests that are run upon every change to the system.


The aim of the deployment pipeline is threefold. First, it makes every part of the process of building, deploying, testing, and releasing software visible to everybody involved, aiding collaboration. Second, it improves feedback so that problems are identified, and so resolved, as early in the process as possible. Finally, it enables teams to deploy and release any version of their software to any environment at will through a fully automated process.

  1. Over time, deployments should tend towards being fully automated. Antipattern: Deploying to a Production-like Environment Only after Development Is Complete

"Releasing into staging is the first time that operations people interact with the new release” CD is about test your deployment. WHETHER your product is ready.

  1. Antipattern: Manual Configuration Management of Production Environments
  2. Antipattern: Manual Configuration Management of Production Environments
  • Frequent. If releases are frequent, the delta between releases will be small. This significantly reduces the risk associated with releasing and makes it much easier to roll back. Frequent releases also lead to faster feedback—indeed, they require it. Much of this book concentrates on getting feedback on changes to your application and its associated configuration (including its environment, deployment process, and data) as quickly as possible

A working software application can be usefully decomposed into four components: executable code, configuration, host environment, and data. If any of them changes, it can lead to a change in the behavior of the application. Therefore we need to keep all four of these components under control and ensure that a change in any one of them is verified.

  1. We want to free people to do the interesting work and leave repetition to machines. One of the key principles of the deployment pipeline is that it is a pull system—it allows testers, operations or support personnel to self-service the version of the application they want into the environment of their choice.

Only in exceptional circumstances should you use shared environments for development enabling developers to run a smoke test against a working system on a developer machine prior to each check-in can make a huge difference to the quality of your application. In fact, one sign of a good application architecture is that it allows the application to be run without much trouble on a development machine.

Rigorous build discipline was essential, to the extent that we had a dedicated build master who not only maintained the build but also sometimes policed it, ensuring that whoever broke the build was working to fix it. If not, the build engineer would revert their check-in. . We have both worked in this role. However, we consider it a failure if we get to the point where only those specialists can maintain the CI system. The expertise of specialists is not to be undervalued, but their goal should be to establish good structures, patterns, and use of technology, and to transfer their knowledge to the delivery team. Once these ground rules are established, their specialist expertise should only be needed for significant structural shifts, not regular day-to-day build maintenance. If you are from a test-driven design background, you are perhaps wondering why these aren’t the same as our unit tests. The difference is that acceptance tests are business-facing, not developer-facing. They test whole stories at a time against a running version of the application in a production-like environment To paraphrase, performance is a measure of the time taken to process a single transaction, and can be measured either in isolation or under load. Throughput is the number of transactions a system can process in a given timespan. It is always limited by some bottleneck in the system. The maximum throughput a system can sustain, for a given workload, while maintaining an acceptable response time for each individual request, is its capacity.

Cleaner code

In priority order, simple code:

  • Runs all the tests;
  • Contains no duplication;
  • Expresses all the design ideas that are in the system;
  • Minimizes the number of entities such as classes, methods, functions, and the like.
  1. As we’ll see later on, even if the container is a List, it’s probably better not to encode the container type into the name.

It is not sufficient to add number series or noise words, even though the compiler is satisfied. If names must be different, then they should also mean something different.

My personal preference is that single-letter names can ONLY be used as local variables inside short methods. The length of a name should correspond to the size of its scope

Class Names: Classes and objects should have noun or noun phrase names like Customer, WikiPage, Account, and AddressParser. Avoid words like Manager, Processor, Data, or Info in the name of a class. A class name should not be a verb.

Pick One Word per Concept

Functions should be small

Move fast & break things?

  1. It is super hard to reinstate unit/integration requirements once some component has get used to being without it
  2. Most of things which requires a lot of human labor can be automated, if designed beforehand AEAP
    1. Storyline — can we automate user interactions with our whole stack?
    2. Tool — can we use a reasonable programming language? How functional and non-functional changes evolve with each other as time goes?
    3. Support — a good manager should understand the reality and support his team to archive the common goal 1. Reality: some people has severe illusion about reality, imaging something might help or might not doesn’t mean that it is so. You have to test it scientifically 2. Goal: the goal of anyone in the team should not be to satisfy his boss or just to earn some money. The goal should be grand and the vision should be clearly delivered. 3. What does it mean by support? Spot the chances to make your developers to finish his part in our common endeavor in an efficient and faithful way. Don’t nurture blame culture. Empower your devs to choose with self-consciousness.


  1. review first:代码审阅优先于开发
  2. small diff:每个 diff 要尽量小
  3. testing:重视测试

[1] [2]

System Prevalence in production

Our code system follows the idiom of system prevalence. All of our state resides in memory, and all write requests are asynchronously journaled to disk before the response is sent back to the client. So actually we only have two threaded, one for crafting response to request, and one for journalling the write request and send back the response. Two threaded communicated through a locked queue. Note that we respond to the client in the exact order that requests come in, be the request read or write. We haven't yet implemented snapshot yet since it is mostly a bonus and our speed of replaying is still satisfying for current size of requests log. This greatly simplified our code structure. In details, we choose protobuf as our message format, which has a rich set of types and very efficient wire encoding. For replaying requests journaled on disk, we found that using mmap is a great optimization over plain read. The system state is reproducible because its inner core is de-facto a state machine, receiving requests/events from outside and respond according to a set of well-defined transitions rules. No other state or I/O is involved except for the RAM. This made the system very easy to formalize and reason about. But we still need a thin outer layer put before the incoming requests which decorates the requests with timestamps etc. if necessary. TODO


  • What is the distribution of the different requests? Also considering their size? Their R/W type?
  • What is a proper model of the underlying operating system? What is the latency of each stage? How fast can the replaying be? Other interesting bits
  • Rewriting and garbage collection: usually, we assume that the amount of information that might be useful to user now is bounded. For example, orders placed a year ago are probably not useful to users anymore. Thus, it is no longer necessary to keep them around in RAM. For this reason, we can reduce the size of memory needed by rewriting the requests log to clean garbage up. This also bounded the time needed to replaying. (Currently, replaying is a common action in Ops, since every time you upgraded the system, restarted the node, or killed the process etc., you need to replay from scratch, since all content in memory will be lost. However, we don't need this line of thinking after we implemented the snapshot)
  • Very high reproducibility: The state machine is very easy to debug, since it is single-threaded and transactionally journaled. Any state, once externally visible, can be precisely recovered by pulling back the requests log and replaying it until certain event happens. Except for debugging, high reproducibility also enables regression test.
  • API compatibility issues: We use protobuf files to define our API's syntax, and use regression test to define our API's semantics (incomplete). We use a three-level versioning schema (MAJOR.MINOR.PATCH) for such a system. We tried to make sure that: _ for a PATCH level update, the API syntax is the same, and the semantics might improve in a small way _ for a MINOR level update, the API syntax should only be appended, and the semantics should be backwards-compatible (old regression test can pass without change, but you may update this to introduce new input fields and output fields). Or put it another way, MINOR update is for introducing new features and only for such purpose. * for a MAJOR level update, the API syntax might change in any direction (deletion of any field is already enough to break all client code). The regression test is bound to fail, since the input might have unknown fields; but overall speaking, the regression test was simply updated to reflect a different commitment, not necessary improving any old thing or adding new features. Some random notes
  1. reduce complexity.
  2. ensure that the core of the system is correct
    • architectural considerations: the inner core as a state-transaction machine, outer core adding more features layer by layer. the outer layer is less stateful => faster iteration and hot patch is possible
    • Develop from the simplest possible description: very functional, local invariant