diff --git a/SUMMARY.md b/SUMMARY.md index 7c1e41d13..fdf3600b4 100644 --- a/SUMMARY.md +++ b/SUMMARY.md @@ -27,7 +27,10 @@ * [HTTPS](deployment/digital-ocean/https/README.md) * [Human Connection](deployment/human-connection/README.md) * [Volumes](deployment/volumes/README.md) - * [Neo4J DB Backup](deployment/backup.md) + * [Neo4J Offline-Backups](deployment/volumes/neo4j-offline-backup/README.md) + * [Volume Snapshots](deployment/volumes/volume-snapshots/README.md) + * [Reclaim Policy](deployment/volumes/reclaim-policy/README.md) + * [Velero](deployment/volumes/velero/README.md) * [Legacy Migration](deployment/legacy-migration/README.md) * [Feature Specification](cypress/features.md) * [Code of conduct](CODE_OF_CONDUCT.md) diff --git a/deployment/human-connection/deployment-backend.yaml b/deployment/human-connection/deployment-backend.yaml index a873b7bb2..51f0eb43c 100644 --- a/deployment/human-connection/deployment-backend.yaml +++ b/deployment/human-connection/deployment-backend.yaml @@ -17,6 +17,8 @@ human-connection.org/selector: deployment-human-connection-backend template: metadata: + annotations: + backup.velero.io/backup-volumes: uploads labels: human-connection.org/commit: COMMIT human-connection.org/selector: deployment-human-connection-backend diff --git a/deployment/human-connection/deployment-neo4j.yaml b/deployment/human-connection/deployment-neo4j.yaml index 4a715da76..3c4887194 100644 --- a/deployment/human-connection/deployment-neo4j.yaml +++ b/deployment/human-connection/deployment-neo4j.yaml @@ -15,6 +15,8 @@ human-connection.org/selector: deployment-human-connection-neo4j template: metadata: + annotations: + backup.velero.io/backup-volumes: neo4j-data labels: human-connection.org/selector: deployment-human-connection-neo4j name: nitro-neo4j diff --git a/deployment/volumes/README.md b/deployment/volumes/README.md index b838794d5..2d08a34cb 100644 --- a/deployment/volumes/README.md +++ b/deployment/volumes/README.md @@ -7,7 +7,13 @@ At the moment, the application needs two persistent volumes: As a matter of precaution, the persistent volume claims that setup these volumes live in a separate folder. You don't want to accidently loose all your data in -your database by running `kubectl delete -f human-connection/`, do you? +your database by running + +```sh +kubectl delete -f human-connection/ +``` + +or do you? ## Create Persistent Volume Claims @@ -19,24 +25,12 @@ persistentvolumeclaim/neo4j-data-claim created persistentvolumeclaim/uploads-claim created ``` -## Change Reclaim Policy +## Backup and Restore -We recommend to change the `ReclaimPolicy`, so if you delete the persistent -volume claims, the associated volumes will be released, not deleted: - -```sh -$ kubectl --namespace=human-connection get pv - -NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE -pvc-bd02a715-66d0-11e9-be52-ba9c337f4551 1Gi RWO Delete Bound human-connection/neo4j-data-claim do-block-storage 4m24s -pvc-bd208086-66d0-11e9-be52-ba9c337f4551 2Gi RWO Delete Bound human-connection/uploads-claim do-block-storage 4m12s -``` - -Get the volume id from above, then change `ReclaimPolicy` with: -```sh -kubectl patch pv -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}' - -# in the above example -kubectl patch pv pvc-bd02a715-66d0-11e9-be52-ba9c337f4551 -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}' -kubectl patch pv pvc-bd208086-66d0-11e9-be52-ba9c337f4551 -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}' -``` +We tested a couple of options how to do disaster recovery in kubernetes. First, +there is the [offline backup strategy](./neo4j-offline-backup/README.md) of the +community edition of Neo4J, which you can also run on a local installation. +Kubernetes also offers so-called [volume snapshots](./volume-snapshots/README.md). +Changing the [reclaim policy](./reclaim-policy/README.md) of your persistent +volumes might be an additional safety measure. Finally, there is also a +kubernetes specific disaster recovery tool called [Velero](./velero/README.md). diff --git a/deployment/backup.md b/deployment/volumes/neo4j-offline-backup/README.md similarity index 97% rename from deployment/backup.md rename to deployment/volumes/neo4j-offline-backup/README.md index 5d6d61866..3638ebc89 100644 --- a/deployment/backup.md +++ b/deployment/volumes/neo4j-offline-backup/README.md @@ -23,7 +23,10 @@ So, all we have to do is edit the kubernetes deployment of our Neo4J database and set a custom `command` every time we have to carry out tasks like backup, restore, seed etc. -{% hint style="info" %} TODO: implement maintenance mode {% endhint %} +{% hint style="info" %} +TODO: implement maintenance mode +{% endhint %} + First bring the application into maintenance mode to ensure there are no database connections left and nobody can access the application. diff --git a/deployment/volumes/reclaim-policy/README.md b/deployment/volumes/reclaim-policy/README.md new file mode 100644 index 000000000..00c91c319 --- /dev/null +++ b/deployment/volumes/reclaim-policy/README.md @@ -0,0 +1,30 @@ +# Change Reclaim Policy + +We recommend to change the `ReclaimPolicy`, so if you delete the persistent +volume claims, the associated volumes will be released, not deleted. + +This procedure is optional and an additional security measure. It might prevent +you from loosing data if you accidently delete the namespace and the persistent +volumes along with it. + +```sh +$ kubectl --namespace=human-connection get pv + +NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE +pvc-bd02a715-66d0-11e9-be52-ba9c337f4551 1Gi RWO Delete Bound human-connection/neo4j-data-claim do-block-storage 4m24s +pvc-bd208086-66d0-11e9-be52-ba9c337f4551 2Gi RWO Delete Bound human-connection/uploads-claim do-block-storage 4m12s +``` + +Get the volume id from above, then change `ReclaimPolicy` with: +```sh +kubectl patch pv -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}' + +# in the above example +kubectl patch pv pvc-bd02a715-66d0-11e9-be52-ba9c337f4551 -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}' +kubectl patch pv pvc-bd208086-66d0-11e9-be52-ba9c337f4551 -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}' +``` + +Given that you changed the reclaim policy as described above, you should be able +to create a persistent volume claim based on a volume snapshot content. See +the general kubernetes documentation [here](https://kubernetes.io/blog/2018/10/09/introducing-volume-snapshot-alpha-for-kubernetes/) +and our specific documentation for snapshots [here](../snapshot/README.md). diff --git a/deployment/volumes/velero/README.md b/deployment/volumes/velero/README.md new file mode 100644 index 000000000..e469ad117 --- /dev/null +++ b/deployment/volumes/velero/README.md @@ -0,0 +1,112 @@ +# Velero + +{% hint style="danger" %} +I tried Velero and it did not work reliably all the time. Sometimes the +kubernetes cluster crashes during recovery or data is not fully recovered. + +Feel free to test it out and update this documentation once you feel that it's +working reliably. It is very likely that Digital Ocean had some bugs when I +tried out the steps below. +{% endhint %} + +We use [velero](https://github.com/heptio/velero) for on premise backups, we +tested on version `v0.11.0`, you can find their +documentation [here](https://heptio.github.io/velero/v0.11.0/). + +Our kubernets configurations adds some annotations to pods. The annotations +define the important persistent volumes that need to be backed up. Velero will +pick them up and store the volumes in the same cluster but in another namespace +`velero`. + +## Prequisites + +You have to install the binary `velero` on your computer and get a tarball of +the latest release. We use `v0.11.0` so visit the +[release](https://github.com/heptio/velero/releases/tag/v0.11.0) page and +download and extract e.g. [velero-v0.11.0-linux-arm64.tar.gz](https://github.com/heptio/velero/releases/download/v0.11.0/velero-v0.11.0-linux-amd64.tar.gz). + + +## Setup Velero Namespace + +Follow their [getting started](https://heptio.github.io/velero/v0.11.0/get-started) +instructions to setup the Velero namespace. We use +[Minio](https://docs.min.io/docs/deploy-minio-on-kubernetes) and +[restic](https://github.com/restic/restic), so check out Velero's instructions +how to setup [restic](https://heptio.github.io/velero/v0.11.0/restic): + +```sh +# run from the extracted folder of the tarball +$ kubectl apply -f config/common/00-prereqs.yaml +$ kubectl apply -f config/minio/ +``` + +Once completed, you should see the namespace in your kubernetes dashboard. + +## Manually Create an On-Premise Backup + +When you create your deployments for Human Connection the required annotations +should already be in place. So when you create a backup of namespace +`human-connection`: + +```sh +$ velero backup create hc-backup --include-namespaces=human-connection +``` + +That should backup your persistent volumes, too. When you enter: + +``` +$ velero backup describe hc-backup --details +``` + +You should see the persistent volumes at the end of the log: + +``` +.... + +Restic Backups: + Completed: + human-connection/nitro-backend-5b6dd96d6b-q77n6: uploads + human-connection/nitro-neo4j-686d768598-z2vhh: neo4j-data +``` + +## Simulate a Disaster + +Feel free to try out if you loose any data when you simulate a disaster and try +to restore the namespace from the backup: + +```sh +$ kubectl delete namespace human-connection +``` + +Wait until the wrongdoing has completed, then: +```sh +$ velero restore create --from-backup hc-backup +``` + +Now, I keep my fingers crossed that everything comes back again. If not, I feel +very sorry for you. + + +## Schedule a Regular Backup + +Check out the [docs](https://heptio.github.io/velero/v0.11.0/get-started). You +can create a regular schedule e.g. with: + +```sh +$ velero schedule create hc-weekly-backup --schedule="@weekly" --include-namespaces=human-connection +``` + +Inspect the created backups: + +```sh +$ velero schedule get +NAME STATUS CREATED SCHEDULE BACKUP TTL LAST BACKUP SELECTOR +hc-weekly-backup Enabled 2019-05-08 17:51:31 +0200 CEST @weekly 720h0m0s 6s ago + +$ velero backup get +NAME STATUS CREATED EXPIRES STORAGE LOCATION SELECTOR +hc-weekly-backup-20190508155132 Completed 2019-05-08 17:51:32 +0200 CEST 29d default + +$ velero backup describe hc-weekly-backup-20190508155132 --details +# see if the persistent volumes are backed up +``` diff --git a/deployment/volumes/volume-snapshots/README.md b/deployment/volumes/volume-snapshots/README.md new file mode 100644 index 000000000..cc66ae4ae --- /dev/null +++ b/deployment/volumes/volume-snapshots/README.md @@ -0,0 +1,50 @@ +# Kubernetes Volume Snapshots + +It is possible to backup persistent volumes through volume snapshots. This is +especially handy if you don't want to stop the database to create an [offline +backup](../neo4j-offline-backup/README.md) thus having a downtime. + +Kubernetes announced this feature in a [blog post](https://kubernetes.io/blog/2018/10/09/introducing-volume-snapshot-alpha-for-kubernetes/). Please make yourself familiar with it before you continue. + +## Create a Volume Snapshot + +There is an example in this folder how you can e.g. create a volume snapshot for +the persistent volume claim `neo4j-data-claim`: + +```sh +# in folder deployment/volumes/volume-snapshots/ +kubectl apply -f snapshot.yaml +``` + +If you are on Digital Ocean the volume snapshot should show up in the Web UI: + +![Digital Ocean Web UI showing a volume snapshot](./digital-ocean-volume-snapshots.png) + +## Provision a Volume based on a Snapshot + +Edit your persistent volume claim configuration and add a `dataSource` pointing +to your volume snapshot. [The blog post](https://kubernetes.io/blog/2018/10/09/introducing-volume-snapshot-alpha-for-kubernetes/) has an example in section "Provision a new volume from a snapshot with +Kubernetes". + +There is also an example in this folder how the configuration could look like. +If you apply the configuration new persistent volume claim will be provisioned +with the data from the volume snapshot: + +``` +# in folder deployment/volumes/volume-snapshots/ +kubectl apply -f neo4j-data.yaml +``` + +## Data Consistency Warning + +Note that volume snapshots do not guarantee data consistency. Quote from the +[blog post](https://kubernetes.io/blog/2018/10/09/introducing-volume-snapshot-alpha-for-kubernetes/): + +> Please note that the alpha release of Kubernetes Snapshot does not provide +> any consistency guarantees. You have to prepare your application (pause +> application, freeze filesystem etc.) before taking the snapshot for data +> consistency. + +In case of Neo4J this probably means that enterprise edition is required which +supports [online backups](https://neo4j.com/docs/operations-manual/current/backup/). + diff --git a/deployment/volumes/volume-snapshots/digital-ocean-volume-snapshots.png b/deployment/volumes/volume-snapshots/digital-ocean-volume-snapshots.png new file mode 100644 index 000000000..cb6599616 Binary files /dev/null and b/deployment/volumes/volume-snapshots/digital-ocean-volume-snapshots.png differ diff --git a/deployment/volumes/volume-snapshots/neo4j-data.yaml b/deployment/volumes/volume-snapshots/neo4j-data.yaml new file mode 100644 index 000000000..7de9e19dc --- /dev/null +++ b/deployment/volumes/volume-snapshots/neo4j-data.yaml @@ -0,0 +1,18 @@ +--- + kind: PersistentVolumeClaim + apiVersion: v1 + metadata: + name: neo4j-data-claim + namespace: human-connection + labels: + app: human-connection + spec: + dataSource: + name: neo4j-data-snapshot + kind: VolumeSnapshot + apiGroup: snapshot.storage.k8s.io + accessModes: + - ReadWriteOnce + resources: + requests: + storage: 1Gi diff --git a/deployment/volumes/volume-snapshots/snapshot.yaml b/deployment/volumes/volume-snapshots/snapshot.yaml new file mode 100644 index 000000000..3c3487e14 --- /dev/null +++ b/deployment/volumes/volume-snapshots/snapshot.yaml @@ -0,0 +1,10 @@ +--- + apiVersion: snapshot.storage.k8s.io/v1alpha1 + kind: VolumeSnapshot + metadata: + name: neo4j-data-snapshot + namespace: human-connection + spec: + source: + name: neo4j-data-claim + kind: PersistentVolumeClaim