There are multiple ways in which KubeDR’s operations can be monitored. They include:
“Status” section of individual resources.
The following sections will elaborate on each of these mechanisms.
8.1. Prometheus Metrics¶
KubeDR exposes several metrics that can be scraped with Prometheus and visualized using Grafana. Most of the metrics deal with the internal implementation but the following ones provide very useful information to the user. They are widely known as RED metrics.
- kubedr_backup_size_bytes (Gauge)
Size of the backup in bytes.
- kubedr_num_backups (Counter)
Total number of backups.
- kubedr_num_successful_backups (Counter)
Total number of successful backups.
- kubedr_num_failed_backups (Counter)
Total number of successful backups.
- kubedr_backup_duration_seconds (Histogram)
Time (seconds) taken for the backup.
This metric is a histogram with the following buckets:
15s, 30s, 1m, 5m, 10m, 15m, 30m, 1h, ...., 10h
All the metrics will have a label called
policyName set to the
name of the
More details on how exactly Prometheus can be configured to scrape KubeDR’s metrics will be provided soon. If you are interested, please check out issue 26.
8.2. Status of Resources¶
All Kubernetes resources have two sections - spec and status.
spec describes the intent of the user and the the cluster constantly drives towards matching it. On the other hand, status is for the cluster components to set and it typically contains useful information about the current state of the resource.
KubeDR makes use of the status field to set the results of backup and other operations. The following sections describe the status details for each resource.
This resource defines a backup target (which is an S3 bucket) and when it is created, KubeDR initializes a backup repo at the given bucket. The status field of the resource indicates success or failure of such operation.
Here is an example of an error condition:
status: initErrorMessage: |+ Fatal: create key in repository at s3:http://10.106.189.174:9000/testbucket50 failed: repository master key and config already initialized initStatus: Failed initTime: Thu Jan 30 16:02:53 2020
When initialization succeeds:
status: initErrorMessage: "" initStatus: Completed initTime: Thu Jan 30 16:05:56 2020
This resource defines the backup policy and its status field indicates details about the most recent backup.
status: backupErrorMessage: "" backupStatus: Completed backupTime: Thu Jan 30 16:04:05 2020 dataAdded: 1573023 filesChanged: 1 filesNew: 0 mbrName': mbr-4c1223d6 snapshotId: b0f347ef totalBytesProcessed: 15736864 totalDurationSecs: "0.318463127"
Apart from the stats regarding the backup, the status also contains
the name of the
MetadataBackupRecord resource that is required to
This resource defines a restore and its status field indicates success or failure of the operation.
restoreErrorMessage: "" restoreStatus: Completed
restoreErrorMessage: Error in creating restore pod restoreStatus: Failed
KubeDR generates events after some operations that can be monitored by admins. The following sections provide more details about each such event. Note that events are generated in the namespace kubedr-system.
8.3.1. Backup repo initialization¶
BackupLocation resource is created first time, a backup
repo is initialized at the given S3 bucket. An event is generated at
the end of such init process.
Here is an example of the event generated after successful initialization.:
$ kubectl -n kubedr-system get event ... 25s Normal InitSucceeded backuplocation/local-minio Repo at s3:http://10.106.189.174:9000/testbucket62 is successfully initialized
In case of error:
$ kubectl -n kubedr-system get event ... 5s Error InitFailed backuplocation/local-minio Fatal: create key in repository at s3:http://10.106.189.174:9000/testbucket62 failed: repository master key and config already initialized
After every backup, an event is generated containing details about success or failure and in the case of latter, the event will contain relevant error message. Here are couple of sample events.
Normal BackupSucceeded metadatabackuppolicy/test-backup Backup completed, snapshot ID: 34abbf1b
Error BackupFailed metadatabackuppolicy/test-backup subprocess.CalledProcessError: Command '['restic', '--json', '-r', 's3:http://10.106.189.174:9000/testbucket63', '--verbose', 'backup', '/data']' returned non-zero exit status 1. (Fatal: unable to open config file: Stat: The access key ID you provided does not exist in our records. Is there a repository at the following location? s3:http://10.106.189.174:9000/testbucket63
After every restore, an event is generated containing details about success or failure and in the case of latter, the event will contain relevant error message. Here are couple of sample events.
Normal RestoreSucceeded metadatarestore/mrtest Restore from snapshot 5bbc8b1a completed
Error RestoreFailed metadatarestore/mrtest subprocess.CalledProcessError: Command '['restic', '-r', 's3:http://10.106.189.175:9000/testbucket110', '--verbose', 'restore', '--target', '/restore', '5bbc8b1a']' returned non-zero exit status 1. (Fatal: unable to open config file: Stat: Get http://10.106.189.175:9000/testbucket110/?location=: dial tcp 10.106.189.175:9000: i/o timeout Is there a repository at the following location? s3:http://10.106.189.175:9000/testbucket110)