8. Monitoring

There are multiple ways in which KubeDR’s operations can be monitored. They include:

  • Prometheus metrics

  • “Status” section of individual resources.

  • Kubernetes events

The following sections will elaborate on each of these mechanisms.

8.1. Prometheus Metrics

KubeDR exposes several metrics that can be scraped with Prometheus and visualized using Grafana. Most of the metrics deal with the internal implementation but the following ones provide very useful information to the user. They are widely known as RED metrics.

kubedr_backup_size_bytes (Gauge)

Size of the backup in bytes.

kubedr_num_backups (Counter)

Total number of backups.

kubedr_num_successful_backups (Counter)

Total number of successful backups.

kubedr_num_failed_backups (Counter)

Total number of successful backups.

kubedr_backup_duration_seconds (Histogram)

Time (seconds) taken for the backup.

This metric is a histogram with the following buckets:

15s, 30s, 1m, 5m, 10m, 15m, 30m, 1h, ...., 10h

All the metrics will have a label called policyName set to the name of the MetadataBackupPolicy resource.

Note

More details on how exactly Prometheus can be configured to scrape KubeDR’s metrics will be provided soon. If you are interested, please check out issue 26.

8.2. Status of Resources

All Kubernetes resources have two sections - spec and status.

spec describes the intent of the user and the the cluster constantly drives towards matching it. On the other hand, status is for the cluster components to set and it typically contains useful information about the current state of the resource.

KubeDR makes use of the status field to set the results of backup and other operations. The following sections describe the status details for each resource.

8.2.1. BackupLocation

This resource defines a backup target (which is an S3 bucket) and when it is created, KubeDR initializes a backup repo at the given bucket. The status field of the resource indicates success or failure of such operation.

Here is an example of an error condition:

status:
  initErrorMessage: |+
    Fatal: create key in repository at s3:http://10.106.189.174:9000/testbucket50 failed: repository master key and config already initialized

  initStatus: Failed
  initTime: Thu Jan 30 16:02:53 2020

When initialization succeeds:

status:
  initErrorMessage: ""
  initStatus: Completed
  initTime: Thu Jan 30 16:05:56 2020

8.2.2. MetadataBackupPolicy

This resource defines the backup policy and its status field indicates details about the most recent backup.

An example:

status:
  backupErrorMessage: ""
  backupStatus: Completed
  backupTime: Thu Jan 30 16:04:05 2020
  dataAdded: 1573023
  filesChanged: 1
  filesNew: 0
  mbrName': mbr-4c1223d6
  snapshotId: b0f347ef
  totalBytesProcessed: 15736864
  totalDurationSecs: "0.318463127"

Apart from the stats regarding the backup, the status also contains the name of the MetadataBackupRecord resource that is required to perform restores.

8.2.3. MetadataRestore

This resource defines a restore and its status field indicates success or failure of the operation.

Success:

restoreErrorMessage: ""
restoreStatus: Completed

Error:

restoreErrorMessage: Error in creating restore pod
restoreStatus: Failed

8.3. Events

KubeDR generates events after some operations that can be monitored by admins. The following sections provide more details about each such event. Note that events are generated in the namespace kubedr-system.

8.3.1. Backup repo initialization

When a BackupLocation resource is created first time, a backup repo is initialized at the given S3 bucket. An event is generated at the end of such init process.

Here is an example of the event generated after successful initialization.:

$ kubectl -n kubedr-system get event

...
25s  Normal  InitSucceeded    backuplocation/local-minio   Repo at s3:http://10.106.189.174:9000/testbucket62 is successfully initialized

In case of error:

$ kubectl -n kubedr-system get event

...
5s   Error  InitFailed        backuplocation/local-minio   Fatal: create key in repository at s3:http://10.106.189.174:9000/testbucket62 failed: repository master key and config already initialized

8.3.2. Backup

After every backup, an event is generated containing details about success or failure and in the case of latter, the event will contain relevant error message. Here are couple of sample events.

Success:

Normal  BackupSucceeded  metadatabackuppolicy/test-backup  Backup completed, snapshot ID: 34abbf1b

Error:

Error  BackupFailed  metadatabackuppolicy/test-backup  subprocess.CalledProcessError:
    Command '['restic', '--json', '-r', 's3:http://10.106.189.174:9000/testbucket63',
        '--verbose', 'backup', '/data']' returned non-zero exit status 1.
        (Fatal: unable to open config file: Stat: The access key ID you provided does not exist
        in our records. Is there a repository at the following location?
        s3:http://10.106.189.174:9000/testbucket63

8.3.3. Restore

After every restore, an event is generated containing details about success or failure and in the case of latter, the event will contain relevant error message. Here are couple of sample events.

Success:

Normal  RestoreSucceeded metadatarestore/mrtest  Restore from snapshot 5bbc8b1a completed

Error:

Error RestoreFailed  metadatarestore/mrtest subprocess.CalledProcessError:
    Command '['restic', '-r', 's3:http://10.106.189.175:9000/testbucket110',
    '--verbose', 'restore', '--target', '/restore', '5bbc8b1a']' returned non-zero exit
    status 1. (Fatal: unable to open config file: Stat:
    Get http://10.106.189.175:9000/testbucket110/?location=:
    dial tcp 10.106.189.175:9000: i/o timeout
    Is there a repository at the following location?
    s3:http://10.106.189.175:9000/testbucket110)