Ever wonder why your Engineers don’t necessarily like being on call? There can be many different reasons for this, and one cause could be a poorly configured monitoring system. In his talk at the Open Source Monitoring Conference (OSMC) Daniel Uhlmann shared the different stages he went through with his team at T-Systems Multimedia Solutions to get from an inadequate monitoring to a solution that provides real value not only for the customer but also for the team.
OSMC is always a great chance to get a bigger picture of the full scope of Open Source Monitoring. There were some speakers that talked about topics that I never heard from, which was for me very interesting to hear. But there were also some speakers who talked about things I am already using or in touch with. Another new experience were the people I talked to from other companies. It’s interesting how many different opinions are out there and how much fun it can be to talk to someone who has a different opinion. I am in my second year of my apprenticeship at Icinga and it was my first OSMC, my first conference in general. I participated in the role of the cameraman and really appreciated it. After the conference, I have the honor to sum up the talk from Daniel Uhlmann. Here is what I’ve learned:
Daniel Uhlmann has a passion for Linux and Open Source and talked about how they improved their monitoring that everyone likes to be on-call. The team at T-Systems Multimedia Solutions is maintaining several customer services and applications. Their monitoring is very distributed with various services and environments which means that they need to adapt very fast because they have to switch context often.
At the start of his talk, Daniel Uhlmann talked about quotes of their own company before they enhanced their monitoring:
“Why should I take the on-call duty. I thought someone else will do this for us.”
“If you debugged the live database system at 3:00 in the morning, you’re not a real developer”
“I sacrificed so much sleep and lost my mental health being on-call. But this is okay because it is for my/our product.”
This was not acceptable to them, said Daniel, so they thought about what they can learn from those quotes. As a result, they identified a lot of toxic patterns about being on-call. Some of the points were no sleep, impacting personal lives, flappy alerts and missing training which can all feel disrespectful for the one who needs to be on-call. That caused the T-Systems Multimedia Solutions to change things.
How did they enhanced their Monitoring?
They analysed which notifications they get and which fo those might not be necessary. As a result, they set an appointment for the team to figure out which checks are truly business critical. After that they implemented two “hotlines” to separate 24/7 and business hour calls. That resulted in less calls during the night-time. Another learning was that you should delete every check without any meaningful information. Set the bar high for waking up people at 2 AM. Not every check is a business critical check. Another good way to prevent people from hating being on-call is detailed monitoring.
What in the case of a Real Outage?
Regarding real outages Daniel talked about some thoughts everybody knows and probably has had when they need to be on-call. Will I be able to fix the problem now? What happens if I don’t successfully fix the problem? What do my colleagues or the customers think then? He says that it’s not your fault when you have those thoughts. They are completely normal and you missed out as a team and should try to solve it later.
Different Approaches to On-Call
The T-Systems Multimedia Solutions took two different approaches to solve their problem. One was a test by an experienced colleague. He created a test for the new colleagues with the usual issues with monitoring to help the new colleagues getting well prepared for most of the issues that can happen. The other approach was someone who is responsible for the new colleagues and helps them in their first few weeks. I think, Daniel Uhlmann gave some good ideas on how to make on-call monitoring more attractive and not that demanding for the people that need to do it.
If you want to see his whole talk you can check out our YouTube channel or visit the OSMC Archives where all talks and slides are available. I can also recommend taking a look at the photos from the conference. I liked the talks and think each and every topic was interesting and gave a new perspective. I’m looking forward to the next OSCM in 2023!