Monitoring Procedures

This chapter explains how to monitor and diagnose issues with CloudCIX infrastructure from the PAT.

Note

Day to day management of each Cloud is performed by an Administrative User at the COP level using the Compute Admin App.

The PAT has responsibility for monitoring and diagnosing issues with one or more Clouds. PAT is intended for use by trained computer and network specialists.

The CloudCIX provisioning and telemetry (PAT) system implements a set of tools that are located in Docker containers on the PAT appliance. These tools are divided into three categories intended for use by three different levels of support engineers. Different tools are provided for each category of engineer.

1) Level 1 support engineers monitor the COPs and Regions that they have adopted. They can confirm the health of a system. If they diagnose a problem, they pass the problem for analysis to a level 2 or level 3 support engineer.

2) Level 2 support engineers are primarily interested in monitoring and understanding the network traffic of a CloudCIX environment. They are capable of diagnosing network congestion issues or infrstructure over provisioning.

3) Level 3 support engineers have the training, tools and authorisation to login to appliances, PodNets, hosts and storage appliances in order to identify and solve complex problems.

PAT uses the out of band (OOB) network, via site to site VPN tunnels, to reach all COPs and Regions that it has adopted.

Level 1 Tools

The level 1 monitoring tools are deployed in a single web application called PAT. The PAT web application is reachable at IPv4 address 10.0.0.6.

  1. Rocky is a application in PAT for managing and monitoring Clouds. Rocky contains multiple tools that can be used for support.
    1. Alarmer is a tool that identifies hardware alarms on network devices

    2. Tester is a tool that performs detailed analysis of customer or service infrastructure.

    3. Checker is a tool that identifies connectivity issues in Pods.

    4. Learner is a tool that learns the switchport state of all network devices.

    5. Fetcher is a tool that fetches the configuration of all network devices and push to Git repository

  2. Grafana is a dashboard system that displays detailed information about the infrastructure and database of a CloudCIX. It can be reached on grafana.cloudcix.com from the OOB network of via the VPN tunnel terminated on 91.103.3.33
    1. Heartbeat is a dashboard that displays the heartbeat of a robot in a region.

    2. PingPlot is a dashboard that detects IP addresses that are not reponding to a Ping.

  3. LibreNMS is an SNMP network monitoring system. It is deployed in each PAT and it is used to monitor traffic through every port the PodNet Services Gateway and optionally on switch fabric within the region. Because project traffic is separated by VLAN ID it is possible to monitor north and south bound traffic into each project.

Level 2 Tools

  1. Kibana is a logging system that records Robot’s activities. When Robot puts a cloud object into an unresourced state, an error message stating the rason will be logged in Kibana. These error messages can be viewed using the ‘Monitor Regions - ERRORS’ filter.

Level 3 Tools

A jumphost at 10.0.0.5 in the PAT contains a number of tools available for debug of technical issues.

  • The jumphost is running Ubuntu Desktop 20.04 with VNC Server so it can be reached via SSH for CLI access

and via a VNC Client for GUI access. From the jumphost the Level 3 engineer can reach every services gateway, appliance or host ober the OOB network.

  • Once the jumphost is reached you can ‘jump’ onto the rest of the infrastructure using SSH for CLI access and Remmina for access to Windows HyperV hosts.

  1. GitLab contains * of running configuration