This chapter explains how to monitor and diagnose issues with CloudCIX infrastructure from the PAT.
Note
Day to day management of each Cloud is performed by an Administrative User at the COP level using the Compute Admin App.
The PAT has responsibility for monitoring and diagnosing issues with one or more Clouds. PAT is intended for use by trained computer and network specialists.
The CloudCIX provisioning and telemetry (PAT) system implements a set of tools that are located in Docker containers on the PAT appliance. These tools are divided into three categories intended for use by three different levels of support engineers. Different tools are provided for each category of engineer.
1) Level 1 support engineers monitor the COPs and Regions that they have adopted. They can confirm the health of a system. If they diagnose a problem, they pass the problem for analysis to a level 2 or level 3 support engineer.
2) Level 2 support engineers are primarily interested in monitoring and understanding the network traffic of a CloudCIX environment. They are capable of diagnosing network congestion issues or infrstructure over provisioning.
3) Level 3 support engineers have the training, tools and authorisation to login to appliances, PodNets, hosts and storage appliances in order to identify and solve complex problems.
PAT uses the out of band (OOB) network, via site to site VPN tunnels, to reach all COPs and Regions that it has adopted.
The level 1 monitoring tools are deployed in a single web application called PAT. The PAT web application is reachable at IPv4 address 10.0.0.6.
Alarmer is a tool that identifies hardware alarms on network devices
Tester is a tool that performs detailed analysis of customer or service infrastructure.
Checker is a tool that identifies connectivity issues in Pods.
Learner is a tool that learns the switchport state of all network devices.
Fetcher is a tool that fetches the configuration of all network devices and push to Git repository
Heartbeat is a dashboard that displays the heartbeat of a robot in a region.
PingPlot is a dashboard that detects IP addresses that are not reponding to a Ping.
LibreNMS is an SNMP network monitoring system. It is deployed in each PAT and it is used to monitor traffic through every port the PodNet Services Gateway and optionally on switch fabric within the region. Because project traffic is separated by VLAN ID it is possible to monitor north and south bound traffic into each project.
Kibana is a logging system that records Robot’s activities. When Robot puts a cloud object into an unresourced state, an error message stating the rason will be logged in Kibana. These error messages can be viewed using the ‘Monitor Regions - ERRORS’ filter.
A jumphost at 10.0.0.5 in the PAT contains a number of tools available for debug of technical issues.
The jumphost is running Ubuntu Desktop 20.04 with VNC Server so it can be reached via SSH for CLI access
and via a VNC Client for GUI access. From the jumphost the Level 3 engineer can reach every services gateway, appliance or host ober the OOB network.
Once the jumphost is reached you can ‘jump’ onto the rest of the infrastructure using SSH for CLI access and Remmina for access to Windows HyperV hosts.
GitLab contains * of running configuration