Monitoring Theory

One ring to rule them all, one ring to find them, one ring to bring them all, and in the darkness bind them.

J.R.R. Tolkien, Lord of the Rings

Note

The entire Monitoring section of this manual is exclusively intended for people that build their own CloudCIX instances using the open-source software. If you are using a public cloud you do not need the information in this section of the document. Your cloud provider will be monitoring and maintaining the infrastructure for you.

PAT is a Cloud to monitor other Clouds. Although PAT is referred to as a Pod Flavor, it is really a blend of COP and Region, configured in a particular way and running Projects that actually do the provsioning and telemetry.

Note on Nomenclature: PAT is an acronym standing for “Provisioning and Telemetry”. This acronym applies to two separate things.

  • PAT is a Web GUI App that implements key monitoring and provisioning functionality in CloudCIX.

  • The PAT, using the definitive article, refers to the entire hardware infratructure and associated software of the POD implementing Provisioning and Telemetry.

CloudCIX monitoring falls into the following seven tools.

  1. PAT is a GUI used to manage the PAT’s adoption of COPs and Regions.

  2. Icarus reports on Region utilisation. Its objective is to make sure that SLAs within Regions are maintained and that customer requirments remain resourced.

  3. InfluxDB and Grafana monitor Region hardware utilisation.

  4. Elastic Stack & Jaeger are used to trace software run times to identify bottlenecks in CloudCIX API calls in COPs.

  5. Sentry catches 500 errors (Internal Server Error) generated by CloudCIX code.

  6. LibreNMS monitors SNMP messages. Its prime task is to graph traffic data.

  7. Jumphost allows access to hosts within a Pod.

PAT

PAT is a Django based application that:

  • Facilitates adoption of COPs and Regions.

  • Facilitates the adoption (or retirement) of hosts by Regions.

  • Allows engineers to support COPs, Regions and Hosts with Rocky tools.

Icarus

Icarus is a Go application that reports on resource utilisation in Regions. The Icarus project comprises four parts each implemented as a Docker container. It consists of the following four docker containers: - Icarus resides in the PAT and exposes a Web based GUI - Daedalus resides in the PAT and implements the API and is consumed by the GUI - Majora resides in the PAT and fetches the required data from Pit - Pit is an agent running in every COP that collects data from the CloudCIX API

InfluxDB & Grafana

InfluxDB is a time series database that acts as a repository for data collected.

Grafana is a graphical user interface to allow the creation of dashboards. Data is sent to InfluxDB from the following sources.

  • Each application that exposes an API

  • Robot

InfluxDB and Grafana are located in PAT.

Elastic Stack & Jaeger

Elastic Stack comprises Elasticsearch, Logstash and Kibana. Elesticstack is the log monitoring system of CloudCIX. Elastic Stack is located in PAT.

Jaeger performs distributed tracing and is used to identify latency bottleneck in the CloudCIX API. The PAT runs Jaeger Collector as a Docker container and Each COP and Region runs Jaeger Agent as a Docker container.

Sentry

Sentry is an open source application monitoring system (https://open.sentry.io/) and it is used to handle “HTTP Server 500” errors in CloudCIX applications. These are critical (and hopefully rare) failures of the CloudCIX software and Sentry gathers information necessary to debug and fix such errors.

LibreNMS

LibreNMS is an open source network monitoring system that collects SNMP traffic data from network devices (such as switch fabric) in a Pod. LibreNMS is used to identify network congestion issues.

LibreNMS is an optional monitoring tool that can be installed in any pod. It is considered standalone as it has no automation or integration functionality in CloudCIX. Switch fabric must be manually configured for SNMP to communicate with LibreNMS in a Pod.

Jumphost

The jumphost is an instance of Ubuntu Desktop running in a container and it is intended to be used to connect to other devices within its Pod.

  • Terminal/SSH and VirtManager are used to manage KVM hosts

  • Remmina is used to manage manage HyperV hosts

  • Terminal/SSH and Firefox are used to manage Ceph hosts and Switch Infrastructure