In his talk at OSMC 2023 Lucas Copi, Kubernetes Expert at IBM Cloud, tells us about their journey to observability in their modern cloud environment based on RedHat Openshift.
First of all, let’s look at the differences between observability and monitoring.
- Monitoring means tracking things happening on your infrastructure. It helps you to detect issues as they occur and to take action in order to counter them.
- Observability, on the other hand, involves the collection of data. By analyzing them, it allows you to get insights about the system’s overall state.
As Lucas and his team at IBM Cloud faced issues with their old infrastructure as a big monolithic, they decided to separate it into many smaller parts – you could call them microservices. They integrated tons of tests, like about 50k of regression cases, and refactored many parts of their infrastructure’s code for better unit tests. All of that made them learn one lesson: Testing in pre production environments is not always enough.
Not testing in prod is like not practicing with the full orchestra because your solo sounded fine at home.
Usually, even the best pre-prod environment is much smaller than the actual prod environment and therefore not suitable for certain tests. Testing in production does not mean only testing in production.
Another lesson they learned: It’s not always possible to fix issues in your environment, due to not having enough metrics and logs. There are 4 golden pillars for every operation: Latency, Throughput, Errors and Saturation. There are some existing solutions that are great at adding observability to the interactions between services. They include Grafana, OpenTelemetry, istio and honeycomb. But all these were not able to satisfy all needs of Lucas‘ Team. As a solution, they made a custom tool in golang, called „The Observability context“. Basically, it provides consistency throughout execution flows and across the observability pillars. They are using the new tool for measuring code performance.
Observability changed their mindset. Now, it’s not only about features and „Runs everything?“, but more „How good is it working?“. Introducing observability actually decreased the number of problems customers are facing. This shift not only overcomes testing limitations but also minimizes customer-facing issues. Observability emerges as a key catalyst for continuous improvement and reliability in modern cloud environments.