I have been one of the OpenTelemetry Collector maintainers for a while now and one of the things that I pursue in my work is making it robust. Robust in this case means resilient to external and internal failures, able to recover from the problems and operate continuously, without human intervention, be easy to diagnose and understand when such intervention is necessary.
Software that runs continuously without any direct interaction with humans (as opposed to software that has a UI or server software which serves direct user requests) is a special case. This category of software can be considered a type of Batch Applications but I prefer to use the term “Autonomous Software” instead. Achieving robustness in autonomous software requires approaches and methods that maybe somewhat different from generic robustness recommendations.
Below I am going to talk about a few techniques to increase the robustness in particular scenarios. This post is a compilation of patterns and practices I have been continuously developing and converging to personally or with my teams over the last few years.
Let’s consider a typical situation with a user-interacting software. The software receives an input from the user, performs some processing and returns the output to the user. If an error occurs during processing typically this error will be reported to the user. The user may also be given a choice to retry the operation. So, there is normally a User Request->Process->Response->User Retry sequence that includes a step where the user makes a decision to do a retry.
With autonomous software there is no user who we can report the error directly to and give a choice to retry or not retry the operation. The software normally needs to make this decision and execute the decision. Let’s see how the software can handle this.
The simplest approach is to always immediately retry the failed operation and continuously do so until the operation succeeds. It is easy to see that in some cases such automatic and immediate retrying can be a bad idea. The operation may involve a costly internal processing or require a request to an external system. Continuous and immediate retrying may result in infinitely long lasting and significant consumption of local resources or bombarding an external system with requests.
A somewhat better approach is to space the retry operations apart (wait between retries) or place a limit to the number of times a particular operation can be retried. A more sophisticated variation of this approach is to perform retries with exponential backoff. The nice property of exponential backoffs is that if it is possible to succeed by retrying quickly it will do so and if not it will reduce the rate of retries in order not to overwhelm the system (whether external or internal).
For my Go programs I normally use cenkalti/backoff library, which makes very easy to implement the backoff logic and customize it if needed. In the simplest case all you need to do is the following:
This implements exponential backoff of whatever operation we are performing in just a few lines of code and can be further customized if needed (check the docs). Chances are whatever other language you use also has a suitable similar library and if not it is not difficult to implement one and reuse where needed.
It is important to keep in mind that there are cases where no retries are needed at all. It may be that the failure is fatal and no amount of retries is going to help. Be aware of these situations and avoid automatic retries for these cases.
When using any sort of automatic retries you need to assess the effect such retries will have locally on your program and globally on external systems (if retrying involves any external systems). For example a common mistake when designing connections in autonomous client/server systems is implementing automatic re-connection logic on the client-side and using deterministic logic to wait between re-connection attempts. A naive approach that uses exponential backoff between re-connection attempts seems to be a good way to ensure that clients that re-connect will wait between connection attempts, which will avoid overwhelming the server. This naive approach fails to account for the case when the server goes down for a while, all clients begin a backoff re-connection strategy at the same time. The result is that all clients perform re-connection attempts at almost precisely the same moment of time. When after a while the server eventually comes up this typically results in all clients initiating their next re-connection attempt at virtually the same time. This will overload the server with connections and subsequent requests that hit the server almost simultaneously.
The right approach in this case is to introduce jitter in the waiting period between retries. This will ensure that clients that re-connect (or retry some other operation) will do so at different moments and server’s load will be spread over time.
Well-written exponential backoff libraries typically have built-in support for jitter. For Go library that I use the randomized jitter is the default behavior, which can be customized if necessary.
Using autonomous retries on errors during application startup may not be very desirable. During startup it is better to verify the initial inputs (such as the configuration) and fail fast if the input is invalid. This will bring the attention of a human to the problem as it is more typical for humans to notice problems when the process is just starting as opposed to problems that may arise some time (potentially long time) after process startup. Monitoring systems are likely to automatically flag processes that exit with failure during startup, making it easier to notice the problem. It is best to log/show a reasonable error message to explain the problem and exit. It may be even acceptable to crash the process during startup if there is no good way to exit cleanly but do your best to log and exit with a non-zero process exit code.
One of the ways applications handle failures is to crash and let be restarted. This is a simple (and often the best) behavior for certain situations such as out-of-memory conditions, when attempting to do anything else is likely to result in further failures.
However, crash-only approach should be used carefully for applications that receive their inputs from an output of another autonomous software. It is rare that crashing is the right approach for this type of applications. As an example of how this can go bad think of the following case. Let’s assume that we have 2 applications: one is producing some data and the other is receiving the data and processing it. Now imagine that the producer application uses automatic retries and the receiver application uses crash as a recovery mechanism when a fatal problem is encountered during request processing. What will happen if the receiver application crashes during request processing? Most likely after the restart the application is going to receive the same request again because the producer application will retry it. This is an easy to way to end up with infinite crashing and restarting loop.
The important takeaway from this example is that we should avoid crashing our program on bad input data. Extra care must be taken to process invalid input data without crashing and to respond to such input in an expected manner (e.g. if our program serves HTTP requests then we likely need to return HTTP 400 Bad Data response to the requestor).
Generally employing crash as a recovery from fatal errors should be considered carefully in each individual case. If there is a good chance that after the restart our program may end in the same state as before the crash then it is likely not a good idea, it will just result in repeating series of crashes. This is especially important for programs that persist their processing state (e.g. use persistent input queue for pending data to be processed and the queue survives restarts).
Crash-only approach can be thought of as a way to implement Fail-fast software, which is often a recommended way to write robust software. However as we can see for autonomous software it conflicts with Robustness principle, which recommends to “be liberal in what you accept from others”. Typically the robustness principle should prevail when autonomous software processes its input data. As always carefully assess your particular situation before applying any generalized software engineering principles.
It is useful to log details of your application startup and shutdown, including successful outcomes (but don’t overdo it, keep the number of success message to a readable minimum). Success messages can help to understand the context of failures if they occur elsewhere after a certain piece of code is successfully executed and logged a success message.
However use logging carefully for events that can happen frequently, to avoid flooding the logs. Avoid outputting logs per a received or processed data item since this can amount to very large number of log entries if your application is designed to process data items at high rate. For such high-frequency events instead of logging consider adding an internal metric and increment it when the event happens. Metrics can be exposed and made available using monitoring tools. If you have to log high-frequency events use “debug” level, which can be suppressed during normal operation. Be aware of the costs of logging imposed by your logging library and log collection system.
Make log messages human readable and include data that is needed for easier understanding of what happened and in what context.
Another important factor for autonomous software is the usage of CPU, RAM and other local resources.
Limit the usage of resources that the code uses. Do not write code that consumes resources in an uncontrolled manner. For example if you have a queue that can contain unprocessed messages always limit the size of the queue unless you have other ways to guarantee that the queue will be consumed faster than items are added to it.
Performance test the code for both normal use-cases under acceptable load and also for abnormal use-cases when the load exceeds the acceptable many times. Ensure that your code performs predictably under abnormal use. For example if the code needs to process received data and cannot keep up with the receiving rate it is not acceptable to keep allocating more memory for received data until the application runs out of memory. Instead have protections for these situations, e.g. when hitting resource limits drop the data and record the fact that it was dropped in a metric that is exposed to users.
Often depending on the enforced resource limits your application will exhibit different performance. Where it matters you may want to expose to the end user an ability to configure the limits so that they can fine tune your application to their use case and to the available resources.
If your application is expected to work in environments that automatically restart processes (e.g. during deployment of new versions) make sure it is able to shutdown gracefully. This typically means reacting to SIGTERM and finishing processing and draining any data that is accumulated in memory and then exiting the process. This will ensure graceful operation in environments like Kubernetes and will help to avoid data losses.
Make sure to limit the duration of your shutdown operation so that the process does not get stuck and is not killed forcedly by the controlling environment.
These recommendations are by no means exhaustive. A lot has been written elsewhere about how to create robust software. Most of that generic advice is also applicable to autonomous software. The goal of this post was to talk about the situations where the generic practices may not be applicable, I hope you find the post useful.