One ring to rule them all, one ring to find them, one ring to bring them all, and in the darkness bind them.
J.R.R. Tolkien, Lord of the Rings
Note
The entire Monitoring section of this manual is exclusively intended for people that build their own CloudCIX instances using the open-source software. If you are using a public cloud you do not need the information in this section of the document. Your cloud provider will be monitoring and maintaining the infrastructure for you.
PAT is a Cloud to monitor other Clouds. Although PAT is referred to as a Pod Flavor, it is really a blend of COP and Region, configured in a particular way and running Projects that actually do the provsioning and telemetry.
Note on Nomenclature: PAT is an acronym standing for “Provisioning and Telemetry”. This acronym applies to two separate things.
PAT is a Web GUI App that implements key monitoring and provisioning functionality in CloudCIX.
The PAT, using the definitive article, refers to the entire hardware infratructure and associated software of the POD implementing Provisioning and Telemetry.
CloudCIX monitoring falls into the following seven tools.
PAT is a GUI used to manage the PAT’s adoption of COPs and Regions.
InfluxDB and Grafana monitor Region hardware utilisation.
Elastic Stack & Jaeger are used to trace software run times to identify bottlenecks in CloudCIX API calls in COPs.
Sentry catches 500 errors (Internal Server Error) generated by CloudCIX code.
LibreNMS monitors SNMP messages. Its prime task is to graph traffic data.
Jumphost allows access to hosts within a Pod.
PAT is a Django based application that:
Facilitates adoption of COPs and Regions.
Facilitates the adoption (or retirement) of hosts by Regions.
Allows engineers to support COPs, Regions and Hosts with Rocky tools.
InfluxDB is a time series database that acts as a repository for data collected.
Grafana is a graphical user interface to allow the creation of dashboards. Data is sent to InfluxDB from the following sources.
Each application that exposes an API
Robot
InfluxDB and Grafana are located in PAT.
Elastic Stack comprises Elasticsearch, Logstash and Kibana. Elesticstack is the log monitoring system of CloudCIX. Elastic Stack is located in PAT.
Jaeger performs distributed tracing and is used to identify latency bottleneck in the CloudCIX API. The PAT runs Jaeger Collector as a Docker container and Each COP and Region runs Jaeger Agent as a Docker container.
Sentry is an open source application monitoring system (https://open.sentry.io/) and it is used to handle “HTTP Server 500” errors in CloudCIX applications. These are critical (and hopefully rare) failures of the CloudCIX software and Sentry gathers information necessary to debug and fix such errors.
LibreNMS is an open source network monitoring system that collects SNMP traffic data from network devices (such as switch fabric) in a Pod. LibreNMS is used to identify network congestion issues.
LibreNMS is an optional monitoring tool that can be installed in any pod. It is considered standalone as it has no automation or integration functionality in CloudCIX. Switch fabric must be manually configured for SNMP to communicate with LibreNMS in a Pod.
The jumphost is an instance of Ubuntu Desktop running in a container and it is intended to be used to connect to other devices within its Pod.
Terminal/SSH and VirtManager are used to manage KVM hosts
Remmina is used to manage manage HyperV hosts
Terminal/SSH and Firefox are used to manage Ceph hosts and Switch Infrastructure