Back in about 2010, I was working on a highly custom supercomputer research project with about 2.2 million cores called Cyclops64. It was a joint project between the University of Delaware, IBM and the NSA. It was interesting in a lot of respects, but what its most unusual feature was its networking. Each system was directly connected to 6 other systems by high-speed interconnects forming a cube, and only a few systems on one of the cube faces were connected to a “normal” ethernet network. The cube was 24x24x24 and each system had 160 cores (total: 2211840 cores).
As I thought about it, I realized that the way conventional monitoring systems like Nagios or Icinga work would be absolutely horrible in this environment. Imagine that the monitoring system was in one corner of the cube. Then each “Are You OK?” message would have to through 24 systems up, then 24 systems to the right, then 24 systems forward. The return message would have to follow the reverse return path – meaning that each round trip would touch 144 systems. Of course, the network in the vicinity of the central system would be totally destroyed by this traffic if it happened at any reasonable rate.
After thinking about it, I realized that it would be much better if each system monitored its six neighbors instead – delegating these “Are You OK?” queries to the machines being monitored. This totally changed the complexity of the task, and the network load – no network hot spots, and the central system had nothing to do most of the time – an amazing difference! I also realized that more conventional systems have these same problems of network congestion and ever growing centralized workload.
So, I adapted this idea to “normal” systems, and the Assimilation Project was born – providing greater scalability than any predecessors, while being incredibly simple. To keep the same topology-awareness that the original idea had, I realized that I needed to discover network connectivity. Once I’d done that, I realized that I could easily discover what services the system had which would eliminate manual configuration. Further, I could discover dependencies, which are essential to tracing problems to their root causes. Then I realized that this was all really a huge graph, so I adopted the Neo4j graph database. Lots of other valuable capabilities quickly became obvious results of having comprehensive and scalable discovery.
About this time, I realized that discovery was the real value and doing things like monitoring, and security compliance, network management and many other incredibly valuable applications naturally fell out from being able to easily know basically everything about everything and putting it into a graph-based configuration management database (CMDB).
So, it goes without saying that I’m excited to return to the OSMC in Nürnberg and talk about all the exciting new things we’re doing in the Assimilation Project. After all, the last time I was there, I had a great time and my talk got some pretty cool tweets.
*The Cyclops64 has special monitoring hardware making this unnecessary.
The Author Alan Robertson
Alan Robertson is an open source project leader and speaker on security, availability, discovery and monitoring. He founded Linux-HA (Pacemaker) and the Assimilation Project which maintains a scalable configuration management database driving monitoring and security.