Ceph Cluster

Intro to ceph

Whether you want to provide Ceph Object Storage and/or Ceph Block Device services to Cloud Platforms, deploy a Ceph Filesystem or use Ceph for another purpose, all Ceph Storage Cluster deployments begin with setting up each Ceph Node, your network, and the Ceph Storage Cluster. A Ceph Storage Cluster requires at least one Ceph Monitor, Ceph Manager, and Ceph OSD (Object Storage Daemon). The Ceph Metadata Server is also required when running Ceph Filesystem clients.

  • Monitors: A Ceph Monitor (ceph-mon) maintains maps of the cluster state, including the monitor map, manager map, the OSD map, and the CRUSH map. These maps are critical cluster state required for Ceph daemons to coordinate with each other. Monitors are also responsible for managing authentication between daemons and clients. At least three monitors are normally required for redundancy and high availability.

  • Managers: A Ceph Manager daemon (ceph-mgr) is responsible for keeping track of runtime metrics and the current state of the Ceph cluster, including storage utilization, current performance metrics, and system load. The Ceph Manager daemons also host python-based plugins to manage and expose Ceph cluster information, including a web-based dashboard and REST API. At least two managers are normally required for high availability.

  • Ceph OSDs: A Ceph OSD (object storage daemon, ceph-osd) stores data, handles data replication, recovery, rebalancing, and provides some monitoring information to Ceph Monitors and Managers by checking other Ceph OSD Daemons for a heartbeat. At least 3 Ceph OSDs are normally required for redundancy and high availability.

  • MDSs: A Ceph Metadata Server (MDS, ceph-mds) stores metadata on behalf of the Ceph Filesystem (i.e., Ceph Block Devices and Ceph Object Storage do not use MDS). Ceph Metadata Servers allow POSIX file system users to execute basic commands (like ls, find, etc.) without placing an enormous burden on the Ceph Storage Cluster.

Ceph stores data as objects within logical storage pools. Using the CRUSH algorithm, Ceph calculates which placement group should contain the object, and further calculates which Ceph OSD Daemon should store the placement group. The CRUSH algorithm enables the Ceph Storage Cluster to scale, rebalance, and recover dynamically.

Integrating Ceph with CloudCIX

The principal requirements for deploying Ceph in a CloudCIX region are…

From ceph installation guide, the following system requirements must be met before deployment.

  • Python 3

  • Podman or Docker for running containers

  • Time synchronization (such as chrony or NTP)

  • LVM2 for provisioning storage devices

  • To allow easy configuration of large numbers of Ceph hosts, Ansible is used.

  • Once the hosts are configured, Ceph is installed using the cephadm. This is installed on the first Mon node.

  • The Public Network of Ceph is the Management Network of CloudCIX. In the rest of this documentation, the name Public Network will be use.

  • The Cluster Network of Ceph is the Private Network of CloudCIX. In the rest of this documentation, the name Cluster Network will be use.

  • Both networks should be IPv6.

Ceph Cluster Hardware

CloudCIX Ceph is built using Ubuntu 20.04 as the underlying operating system. Ceph user documentation lists the minimum and recommended hardware requirements for Monitors and Nodes https://docs.ceph.com/en/latest/start/hardware-recommendations/.

Networking IP Address Assignments

Network Name

IP Address Assignment

VLAN ID

Internal Cluster

fc00::<rack>.<unit>/64 This address range is replicated in each Region’s Ceph Cluster.

2 Ports are access.

Public (Connected to CloudCIX Management Network)

<prefix>::60:<rack>:<unit>/64

3

  • <prefix> represents the first 48 bits of the IPv6 subnet assigned to the Pod.

  • <rack> is a decimal (not hexadecimal) representation of the rack id in the Pod.

  • <unit> represents U location (bottom U for a multi U rack) of the host in the rack.

DNS Records

DNS records for Ceph Monitors and Hosts are named as outlined here.

Deploy ceph cluster

We will deploy cluster in configuration:

  • 3 Monitors, each with one disk /dev/sda for system.

  • 6 OSD nodes, each with three disks /dev/sda for system, /dev/sdb and /dev/sdc for storage.

  • Each host has 2 NIC for Public and Cluster networks.

  1. Ssh to the first node (cephmon001001), became root.

  2. Prepare hosts file for ansible

    [ceph_mon]
    cephmon001001 ansible_host=<prefix>::60:1:1
    cephmon002001 ansible_host=<prefix>::60:2:1
    cephmon003001 ansible_host=<prefix>::60:3:1
    [ceph_osd]
    ceph001002 ansible_host=<prefix>::60:1:2
    ceph001003 ansible_host=<prefix>::60:1:3
    ceph001004 ansible_host=<prefix>::60:1:4
    ceph002002 ansible_host=<prefix>::60:2:2
    ceph002003 ansible_host=<prefix>::60:2:3
    ceph002004 ansible_host=<prefix>::60:2:4
    
  3. Save this code to file prepare-ceph-nodes.yml

    ---
    - name: Prepare ceph nodes
      hosts: all
      become: yes
      become_method: sudo
        #vars:
        #ceph_admin_user: cephadmin
      tasks:
        - name: Set timezone
          timezone:
            name: UTC
          tags: timezone
        - name: Set hostname
          hostname:
            name: "{{inventory_hostname_short}}"
          tags: update_host
        - name: Update system
          apt:
            name: "*"
            state: latest
            update_cache: yes
        - name: Install common packages
          apt:
            name: [vim,git,bash-completion,wget,curl,chrony]
            state: present
            update_cache: yes
        - name: Install Docker
          shell: |
            curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
            echo "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable" > /etc/apt/sources.list.d/docker-ce.list
            apt update
            apt install -qq -y docker-ce docker-ce-cli containerd.io
        - name: Reboot server after update and configs
          reboot:
    
  4. Run the code:

    ansible-playbook -i hosts prepare-ceph-nodes.yml --user administrator --ask-pass --ask-become-pass
    
  5. Install cephadm. The cephadm command can

    • bootstrap a new cluster

    • launch a containerized shell with a working Ceph CLI

    • aid in debugging containerized Ceph daemons

      apt install cephadm
      
  6. Prepare minimal configuration for networking in ceph.conf in your home directory:

    [global]
    cluster network = fc00::/64
    public network = <prefix>::60/64
    
  7. The first step in creating a new Ceph cluster is running the cephadm bootstrap command on the Ceph cluster’s first host. The act of running the cephadm bootstrap command on the Ceph cluster’s first host creates the Ceph cluster’s first “monitor daemon”, and that monitor daemon needs an IP address. You must pass the IP address of the Ceph cluster’s first host to the ceph bootstrap command, so you’ll need to know the IP address of that host.

    cephadm bootstrap --mon-ip <host1-ip> --allow-fqdn-hostname
    
    
    ceph -s
    
  8. Cephadm does not require any Ceph packages to be installed on the host. However, it recommends enabling easy access to the ceph command.

You can install the ceph-common package, which contains all of the ceph commands, including ceph, rbd, mount.ceph (for mounting CephFS file systems), etc.:

cephadm install ceph-common

Adding additional hosts to the cluster

To add each new host to the cluster, perform two steps:

  • Install the cluster’s public SSH key in the new host’s cephadm user’s authorized_keys file:

Use ansible code create_cephadmin.yml:

---
- name: Prepare ceph nodes
  hosts: all
  become: yes
  become_method: sudo
  tasks:
    - name: Create user cephadmin
      user:
        name: cephadmin
        password: long_ceph_admin_password
        generate_ssh_key: no
        state: present
    - name: sudo without password for cephadmin user
      copy:
        content: 'cephadmin ALL=(ALL:ALL) NOPASSWD:ALL'
        dest: /etc/sudoers.d/cephadmin
        mode: '0440'
        owner: root
        group: root
    - name: Set authorized key taken from file to cephadmin user
      authorized_key:
        user: cephadmin
        state: present
        key: "{{ lookup('file', '/etc/ceph/ceph.pub') }}"
  • Tell Ceph that the new node is part of the cluster:

    
    

Adding storage

A storage device is considered available if all of the following conditions are met:

  • The device must have no partitions.

  • The device must not have any LVM state.

  • The device must not be mounted.

  • The device must not contain a file system.

  • The device must not contain a Ceph BlueStore OSD.

  • The device must be larger than 5 GB.

Ceph will not provision an OSD on a device that is not available.

To add storage to the cluster, either tell Ceph to consume any available and unused device:

ceph orch apply osd --all-available-devices

After running the above command:

  • If you add new disks to the cluster, they will automatically be used to create new OSDs.

  • If you remove an OSD and clean the LVM physical volume, a new OSD will be created automatically.

If you want to avoid this behavior (disable automatic creation of OSD on available devices), use the –unmanaged parameter:

ceph orch apply osd --all-available-devices --unmanaged=true

Create an OSD from a specific device on a specific host:

ceph orch daemon add osd *<host>*:*<device-path>*

The –dry-run flag causes the orchestrator to present a preview of what will happen without actually creating the OSDs.

For example:

ceph orch apply osd --all-available-devices --dry-run

Listing storage devices

In order to deploy an OSD, there must be a storage device that is available on which the OSD will be deployed.

Run this command to display an inventory of storage devices on all cluster hosts:

ceph orch device ls

Output will be like:

Hostname                            Path      Type  Serial  Size   Health   Ident  Fault  Available
host2  /dev/sdb  hdd           85.8G  Unknown  N/A    N/A    Yes
host3  /dev/sdb  hdd           85.8G  Unknown  N/A    N/A    Yes
host1  /dev/sdb  hdd           85.8G  Unknown  N/A    N/A    Yes

Create a pool

Pools are logical partitions for storing objects. When you first deploy a cluster without creating a pool, Ceph uses the default pools for storing data.

By default, Ceph makes 3 replicas of RADOS objects. Ensure you have a realistic number of placement groups. Ceph recommends approximately 100 per OSD and always use the nearest power of 2.

root@host1:~# ceph osd lspools
1 device_health_metrics
root@host1:~# ceph osd pool create datapool 128 128
pool 'datapool' created
root@host1:~# ceph osd lspools
1 device_health_metrics
2 datapool


root@host1:~# ceph osd pool ls detail
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 22 flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
pool 2 'datapool' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 39 flags hashpspool stripe_width 0

root@host1:~# ceph osd pool get datapool all

On the admin node, use the rbd tool to initialize the pool for use by RBD:

rbd pool init datapool

Create rbd volume and map to a block device on the host

The rbd command enables you to create, list, introspect and remove block device images. You can also use it to clone images, create snapshots, rollback an image to a snapshot, view a snapshot, etc.

rbd create --size 512000 datapool/rbdvol1
rbd feature disable datapool/rbdvol1 object-map fast-diff deep-flatten
rbd map datapool/rbdvol1
rbd showmapped
lsblk
ls -la /dev/rbd/datapool/rbdvol1
rbd status datapool/rbdvol1
rbd info datapool/rbdvol1

Create filesystem and mount rbd volume

You can use Linux standard commands to create filesystem on the volume and mount it for different purpose.

Finish

Script for removing cluster and clear all hosts (if something went wrong):

#!/bin/bash

display_usage() {
  echo "The ceph cluster fsid must be provided"
  echo -e "\nUsage: $0 <fsid> \n"
  }

if [ -z $1 ]
then
  display_usage
  exit 1
fi
fsid=$1

#Get information about hosts in the cluster
bootstrap=$(hostname)
hosts=$(cephadm shell --fsid $fsid -c /etc/ceph/ceph.conf -k /etc/ceph/ceph.client.admin.keyring ceph orch host ls --format yaml | grep hostname |  cut -d " " -f2)

#Clean Bootstrap node
echo "Purge cluster in $bootstrap:"
cephadm rm-cluster --fsid $fsid --force
rm -rf /etc/ceph/*
rm -rf /var/log/ceph/*
rm -rf /var/lib/ceph/$fsid

# Clean the rest of hosts
for host in $hosts
do
  if [ $host != $bootstrap ]
    then
      echo "Purge cluster in $host:"
      cephadm_in_host=$(ssh -o StrictHostKeyChecking=no $host ls /var/lib/ceph/$fsid/cephadm*)
      ssh -o StrictHostKeyChecking=no $host python3 $cephadm_in_host rm-cluster --fsid $fsid --force
      # Remove ceph target
      ssh -o StrictHostKeyChecking=no $host systemctl stop ceph.target
      ssh -o StrictHostKeyChecking=no $host systemctl disable ceph.target
      ssh -o StrictHostKeyChecking=no $host rm /etc/systemd/system/ceph.target
      ssh -o StrictHostKeyChecking=no $host systemctl daemon-reload
      ssh -o StrictHostKeyChecking=no $host systemctl reset-failed
      # Remove ceph logs
      ssh -o StrictHostKeyChecking=no $host rm -rf /var/log/ceph/*
      # Remove config files
      rm -rf /var/lib/ceph/$fsid
    fi
done