Seite wählen

Graphite vs. IOPS – experience exchange

von Ronny Biering | Jan 17, 2018 | Linux, Technology, Monitoring & Observability, Graphite & Grafana

Last year, Blerim told us how to benchmark a Graphite and now we want to share some experience on how to handle the emerging IOPS with its pro and contra. It is important to know that Graphite can store its metrics in 3 different ways (CACHE_WRITE_STRATEGY). Each one of them influences another system resource like CPU, RAM or IO. Let us start with an overview about each option and its function.

max

The cache will keep almost all the metrics in memory and only writes the one with the most datapoints. Sounds nice and helps a lot to reduce the general and random io. But this option should never be used without the WHISPER_AUTOFLUSH flag, because most of your metrics are only available in memory. Otherwise you have a high risk of losing your metrics in cases of unclean shutdown or system breaks.
The disadvantage here is that you need a strong CPU, because the cache must sort all the metrics. The required CPU usage increases with the amount of cached metrics and it is important to keep an eye on enough free capacity. Otherwise it will slow down the processing time for new metrics and dramatically reduce the rendering performance of the graphite-web app.

naive

This is the counterpart to the max option and writes the metrics randomly to the disk. It can be used if you need to save CPU power or have fast storage like solid state disk, but be aware that it generates a large amount of IOPS !

sorted

With this option the cache will sort all metrics according to the number of datapoints and write the list to the disk. It works similar to max, but writes all metrics, so the cache will not get to big. This helps to keep the CPU usage low while getting the benefit of caching the metrics in RAM.
All the mentioned options can be controlled with the MAX_UPDATES_PER_SECOND, but each one will be affected in its own way.

Summary

At the end, we made use of the sorted option and spread the workload to multiple cache instances. With this we reduced the amount of different metrics each cache has to process (consistent-hashing relay) and find the best solution in the mix of MAX_UPDATES_PER_SECOND per instance to the related IOPS and CPU/RAM usage. It may sound really low, but currently we are running 15 updates per second for each instance and could increase the in-memory-cache with a low CPU impact. So we have enough resources for fast rendering within graphite-web.
I hope this post can help you in understanding how the cache works and generates the IOPS and system requirements.

0 Kommentare

Einen Kommentar abschicken Antworten abbrechen

Mehr Beiträge zum Thema Linux | Technology | Monitoring & Observability | Graphite & Grafana

Kritisch: Fehler in Elasticsearch mit JDK22 kann einen sofortigen Stop des Dienstes bewirken

von Daniel Neuberger | Apr 8, 2024

Update Seit gestern Abend steht das Release 8.13.2 mit dem BugFix zur Verfügung. Kritischer Fehler Der Elasticsearch Dienst kann ohne Vorankündigung stoppen. Diese liegt an einem Fehler mit JDK 22. In der Regel setzt man Elasticsearch mit der "Bundled" Version ein....

NETWAYS GitHub Update März 2024

von Markus Opolka | Apr 4, 2024

This entry is part 14 of 14 in the series NETWAYS GitHub Update

End of Life von CentOS Linux 7 – Was bedeutet das für mich?

von Dirk Götz | Mrz 25, 2024

Der ein oder andere Admin wird sich vermutlich schon lange den 30. Juni 2024 im Kalender vorgemerkt haben, denn dann ist für CentOS Linux 7 das "End of Life" erreicht. Aber auch Benutzer von Red Hat Enterprise Linux 7 sollten sich Gedanken machen, denn auch dieses...

NETWAYS GitHub Update Februar 2024

von Markus Opolka | Mrz 6, 2024

This entry is part 13 of 14 in the series NETWAYS GitHub Update

Trainings

Web Services

Events

Series