8. Monitoring¶
There are multiple ways in which KubeDR’s operations can be monitored. They include:
Prometheus metrics
“Status” section of individual resources.
Kubernetes events
The following sections will elaborate on each of these mechanisms.
8.1. Prometheus Metrics¶
KubeDR exposes several metrics that can be scraped with Prometheus and visualized using Grafana. Most of the metrics deal with the internal implementation but the following ones provide very useful information to the user. They are widely known as RED metrics.
- kubedr_backup_size_bytes (Gauge)
Size of the backup in bytes.
- kubedr_num_backups (Counter)
Total number of backups.
- kubedr_num_successful_backups (Counter)
Total number of successful backups.
- kubedr_num_failed_backups (Counter)
Total number of successful backups.
- kubedr_backup_duration_seconds (Histogram)
Time (seconds) taken for the backup.
This metric is a histogram with the following buckets:
15s, 30s, 1m, 5m, 10m, 15m, 30m, 1h, ...., 10h
All the metrics will have a label called policyName
set to the
name of the MetadataBackupPolicy
resource.
Note
More details on how exactly Prometheus can be configured to scrape KubeDR’s metrics will be provided soon. If you are interested, please check out issue 26.
8.2. Status of Resources¶
All Kubernetes resources have two sections - spec and status.
spec describes the intent of the user and the the cluster constantly drives towards matching it. On the other hand, status is for the cluster components to set and it typically contains useful information about the current state of the resource.
KubeDR makes use of the status field to set the results of backup and other operations. The following sections describe the status details for each resource.
8.2.1. BackupLocation¶
This resource defines a backup target (which is an S3 bucket) and when it is created, KubeDR initializes a backup repo at the given bucket. The status field of the resource indicates success or failure of such operation.
Here is an example of an error condition:
status:
initErrorMessage: |+
Fatal: create key in repository at s3:http://10.106.189.174:9000/testbucket50 failed: repository master key and config already initialized
initStatus: Failed
initTime: Thu Jan 30 16:02:53 2020
When initialization succeeds:
status:
initErrorMessage: ""
initStatus: Completed
initTime: Thu Jan 30 16:05:56 2020
8.2.2. MetadataBackupPolicy¶
This resource defines the backup policy and its status field indicates details about the most recent backup.
An example:
status:
backupErrorMessage: ""
backupStatus: Completed
backupTime: Thu Jan 30 16:04:05 2020
dataAdded: 1573023
filesChanged: 1
filesNew: 0
mbrName': mbr-4c1223d6
snapshotId: b0f347ef
totalBytesProcessed: 15736864
totalDurationSecs: "0.318463127"
Apart from the stats regarding the backup, the status also contains
the name of the MetadataBackupRecord
resource that is required to
perform restores.
8.2.3. MetadataRestore¶
This resource defines a restore and its status field indicates success or failure of the operation.
Success:
restoreErrorMessage: ""
restoreStatus: Completed
Error:
restoreErrorMessage: Error in creating restore pod
restoreStatus: Failed
8.3. Events¶
KubeDR generates events after some operations that can be monitored by admins. The following sections provide more details about each such event. Note that events are generated in the namespace kubedr-system.
8.3.1. Backup repo initialization¶
When a BackupLocation
resource is created first time, a backup
repo is initialized at the given S3 bucket. An event is generated at
the end of such init process.
Here is an example of the event generated after successful initialization.:
$ kubectl -n kubedr-system get event
...
25s Normal InitSucceeded backuplocation/local-minio Repo at s3:http://10.106.189.174:9000/testbucket62 is successfully initialized
In case of error:
$ kubectl -n kubedr-system get event
...
5s Error InitFailed backuplocation/local-minio Fatal: create key in repository at s3:http://10.106.189.174:9000/testbucket62 failed: repository master key and config already initialized
8.3.2. Backup¶
After every backup, an event is generated containing details about success or failure and in the case of latter, the event will contain relevant error message. Here are couple of sample events.
Success:
Normal BackupSucceeded metadatabackuppolicy/test-backup Backup completed, snapshot ID: 34abbf1b
Error:
Error BackupFailed metadatabackuppolicy/test-backup subprocess.CalledProcessError:
Command '['restic', '--json', '-r', 's3:http://10.106.189.174:9000/testbucket63',
'--verbose', 'backup', '/data']' returned non-zero exit status 1.
(Fatal: unable to open config file: Stat: The access key ID you provided does not exist
in our records. Is there a repository at the following location?
s3:http://10.106.189.174:9000/testbucket63
8.3.3. Restore¶
After every restore, an event is generated containing details about success or failure and in the case of latter, the event will contain relevant error message. Here are couple of sample events.
Success:
Normal RestoreSucceeded metadatarestore/mrtest Restore from snapshot 5bbc8b1a completed
Error:
Error RestoreFailed metadatarestore/mrtest subprocess.CalledProcessError:
Command '['restic', '-r', 's3:http://10.106.189.175:9000/testbucket110',
'--verbose', 'restore', '--target', '/restore', '5bbc8b1a']' returned non-zero exit
status 1. (Fatal: unable to open config file: Stat:
Get http://10.106.189.175:9000/testbucket110/?location=:
dial tcp 10.106.189.175:9000: i/o timeout
Is there a repository at the following location?
s3:http://10.106.189.175:9000/testbucket110)