============
 Monitoring
============

There are multiple ways in which KubeDR's operations can be
monitored. They include:

- Prometheus metrics

- "Status" section of individual resources.

- Kubernetes events

The following sections will elaborate on each of these mechanisms. 

Prometheus Metrics
==================

*KubeDR* exposes several metrics that can be scraped with
`Prometheus`_ and visualized using `Grafana`_. Most of the metrics
deal with the internal implementation but the following ones provide
very useful information to the user. They are widely known as
`RED`_ metrics.

kubedr_backup_size_bytes (Gauge)
    Size of the backup in bytes.

kubedr_num_backups (Counter)
    Total number of backups.

kubedr_num_successful_backups (Counter)
    Total number of successful backups.

kubedr_num_failed_backups (Counter)
    Total number of successful backups.

kubedr_backup_duration_seconds (Histogram)
    Time (seconds) taken for the backup.

    This metric is a histogram with the following buckets::

        15s, 30s, 1m, 5m, 10m, 15m, 30m, 1h, ...., 10h

All the metrics will have a label called ``policyName`` set to the
name of the ``MetadataBackupPolicy`` resource.

.. note::

   More details on how exactly Prometheus can be configured to scrape
   KubeDR's metrics will be provided soon. If you are interested,
   please check out `issue 26`_.

Status of Resources
===================

All Kubernetes resources have two sections - *spec* and *status*.

*spec* describes the intent of the user and the the cluster constantly
drives towards matching it. On the other hand, *status* is for the
cluster components to set and it typically contains useful information
about the current state of the resource.

*KubeDR* makes use of the *status* field to set the results of backup
and other operations. The following sections describe the *status*
details for each resource.

BackupLocation
--------------

This resource defines a backup target (which is an S3 bucket) and
when it is created, *KubeDR* initializes a backup repo at the given
bucket. The *status* field of the resource indicates success or
failure of such operation.

Here is an example of an error condition::

    status:
      initErrorMessage: |+
        Fatal: create key in repository at s3:http://10.106.189.174:9000/testbucket50 failed: repository master key and config already initialized

      initStatus: Failed
      initTime: Thu Jan 30 16:02:53 2020

When initialization succeeds::

    status:
      initErrorMessage: ""
      initStatus: Completed
      initTime: Thu Jan 30 16:05:56 2020

MetadataBackupPolicy
--------------------

This resource defines the backup policy and its *status* field
indicates details about the most recent backup.

An example::

    status:
      backupErrorMessage: ""
      backupStatus: Completed
      backupTime: Thu Jan 30 16:04:05 2020
      dataAdded: 1573023
      filesChanged: 1
      filesNew: 0
      mbrName': mbr-4c1223d6
      snapshotId: b0f347ef
      totalBytesProcessed: 15736864
      totalDurationSecs: "0.318463127"

Apart from the stats regarding the backup, the status also contains
the name of the ``MetadataBackupRecord`` resource that is required to
perform restores.

MetadataRestore
---------------

This resource defines a restore and its *status* field indicates
success or failure of the operation.

Success::

    restoreErrorMessage: ""
    restoreStatus: Completed

Error::

    restoreErrorMessage: Error in creating restore pod
    restoreStatus: Failed

Events
======

*KubeDR* generates events after some operations that can be monitored
by admins. The following sections provide more details about each such
event. Note that events are generated in the namespace
*kubedr-system*. 

Backup repo initialization
--------------------------

When a ``BackupLocation`` resource is created first time, a backup
repo is initialized at the given S3 bucket. An event is generated at
the end of such init process. 

Here is an example of the event generated after successful
initialization.::

    $ kubectl -n kubedr-system get event

    ...
    25s  Normal  InitSucceeded    backuplocation/local-minio   Repo at s3:http://10.106.189.174:9000/testbucket62 is successfully initialized

In case of error::

    $ kubectl -n kubedr-system get event

    ...
    5s   Error  InitFailed        backuplocation/local-minio   Fatal: create key in repository at s3:http://10.106.189.174:9000/testbucket62 failed: repository master key and config already initialized

.. _Backup events:


Backup
------

After every backup, an event is generated containing details about
success or failure and in the case of latter, the event will
contain relevant error message. Here are couple of sample events.

Success::

    Normal  BackupSucceeded  metadatabackuppolicy/test-backup  Backup completed, snapshot ID: 34abbf1b

Error::

    Error  BackupFailed  metadatabackuppolicy/test-backup  subprocess.CalledProcessError: 
        Command '['restic', '--json', '-r', 's3:http://10.106.189.174:9000/testbucket63', 
            '--verbose', 'backup', '/data']' returned non-zero exit status 1. 
            (Fatal: unable to open config file: Stat: The access key ID you provided does not exist 
            in our records. Is there a repository at the following location?
            s3:http://10.106.189.174:9000/testbucket63

Restore
-------

After every restore, an event is generated containing details about
success or failure and in the case of latter, the event will
contain relevant error message. Here are couple of sample events.

Success::

    Normal  RestoreSucceeded metadatarestore/mrtest  Restore from snapshot 5bbc8b1a completed

Error::

    Error RestoreFailed  metadatarestore/mrtest subprocess.CalledProcessError: 
        Command '['restic', '-r', 's3:http://10.106.189.175:9000/testbucket110', 
        '--verbose', 'restore', '--target', '/restore', '5bbc8b1a']' returned non-zero exit 
        status 1. (Fatal: unable to open config file: Stat: 
        Get http://10.106.189.175:9000/testbucket110/?location=: 
        dial tcp 10.106.189.175:9000: i/o timeout
        Is there a repository at the following location?
        s3:http://10.106.189.175:9000/testbucket110)

.. _Prometheus: https://prometheus.io
.. _Grafana: https://grafana.com
.. _RED: https://www.scalyr.com/blog/red-and-monitoring-three-key-metrics-and-why-they-matter/
.. _issue 26: https://github.com/catalogicsoftware/kubedr/issues/26