NAKIVO > Blog > VMware

VMware Disaster Recovery Best Practices

October 30, 2018

Disaster recovery is a process that includes a set of measures that are directed towards recovering the components of an infrastructure after failure has occurred. Furthermore, DR aims to minimize the negative effects that may be caused by disaster as well as ensure business continuity. In order to prepare for the possible disaster types, companies usually compose a disaster recovery plan that should be a part of a business continuity plan.

Virtual machines are the components that are under risk in a case of disaster; it is for this reason that you should prepare for disaster by developing a disaster recovery plan. This blog post explores the best practices of disaster recovery (DR) of VMware virtual environment.

Compose a Disaster Recovery Plan

A disaster recovery plan is a structured document that describes a disaster recovery process as a set of actions to be performed by the appropriate persons in a disastrous situation. Furthermore, the document determines the criteria of what is needed in order to launch the plan. Both natural and man-made catalysts can cause a disaster. A DR plan should include different recovery scenarios for different disaster types and unplanned incidents. For example, a DR plan can describe what to do in a case of a ransomware attack, a power outage, a hardware failure, an earthquake, a typhoon, etc. A DR plan can be categorized: for example, the first section could explain network recovery, the second could focus on data center recovery, while the third would explain VM recovery, etc.

Prepare Your Recovery Site

A disaster recovery site is a place that can be used by a company as a way of recovering infrastructure and workloads when a primary site that is used for production purposes fails to function. Disaster recovery sites can be hot, warm, or cold.

A hot site is a fully functional DR site that is equipped with configured ESXi servers, storage, VM replicas, and user data. If a primary site fails after disaster, a hot site is ready to be used immediately. Deployment of a hot site is costly, but provides the possibility of the fastest recovery possible.
A warm site contains some equipment such as network equipment, gateway servers, ESXi hosts, as well as storage, but may not contain VMs and user data. In this case VMs should be recovered from backups, and user data may need to be copied, too. Additional equipment and software can be installed during the disaster recovery process, thus using a warm site is a compromised solution that requires middle costs, but provides affordable recovery time.
A cold site is a DR site that only has basic infrastructure. When disaster strikes servers must be configured, storage must be deployed, VMs must be recovered, and user data may need to be extracted from backups. Using this type of DR site requires more effort to recover VMs and workloads. This recovery process takes a long time, but the price of having a cold site is the lowest as compared to other site types.

Have Backups and Replicas Created Automatically

VM backups and replicas are the most important components of disaster recovery in a VMware vSphere virtual environment. Backup includes a copy of VM data, which is stored in a safe place. Backed up data can be compressed and needs time to recover. A VM replica is an identical copy of the source VM that resides on an ESXi host, is ready to start when needed, and is used during failover. Avoid backing up the VMs manually too often, as some important changes can be missed and lost when disaster strikes. Make use of appropriate host level VM data protection software that can create VM backups and VM replicas automatically by setting up a schedule.

Use VMware Clustering Features

VMware provides clustering features such as Distributed Resource Scheduler (DRS) cluster, High Availability (HA) cluster, and Fault Tolerance (available for VMs in a HA cluster). An HA cluster helps you minimize VM downtime, while Fault Tolerance (FT) allows you to avoid downtime of VMs in a case of hardware failure. Be aware that clustering features are not a substitution for backup and replication. High Availability with Fault Tolerance and backup with replication complement each other. The point is that HA and FT cannot protect data against corruption, the deletion of files inside the VMs, unsuccessful software updates, or other software failures etc.

Use Appropriate VM Recovery Order

Virtual machines should be recovered in the appropriate order. Imagine that you have multiple VMs with different applications that have dependencies on each other. The classic example is having a VM with Active Directory Domain Controller, a VM with a database server, and a VM with a web server. The VMs must be started in the following order:

The VM with Domain Controller should be started first.
The VM with a database server starts when the VM with Domain Controller is running because a database server uses Domain Controller for user authentication.
The VM with a web server starts when the VM with a database server is running because the web server uses the database for proper operation in this case.

If you have a VM with MS Exchange mail server, that VM must start after the VM with Domain Controller because MS Exchange is integrated with Active Directory for user authentication.

Use Appropriate VM Network Configuration

A production site and a disaster recovery site may have different networks for VM connection. Virtual network adapters of VMs are connected to ports of virtual switches (vSwitches). Port groups represent different networks with network names and the appropriate addresses. If you recover a VM to a DR site, but the VM is configured for connecting to the network of a production site (which differs from the network used for VMs on a DR site), the VM network connection cannot be established. In this case, don’t forget to change the network settings of VMs when you recover the VMs at the DR site.

Prepare Your VM Storage

There must be enough free space in the storage that is used at a DR site in order to store VMs. This is the first and the most critical requirement. Storage must also provide enough performance; otherwise the business-critical services that run on the VMs may lag. If network-based storage such as NAS (Network Attached Storage) or SAN (Storage Area Network) is used, the network speed must be fast enough to cope. The storage network at a DR site must be a dedicated network that is separated from other networks.

Test Your Recovery Plan Regularly

A disaster recovery plan may look good on paper, but may be useless in a case of disaster if it is not tested in advance. Thus, be sure to test your DR plan on a regular basis. Testing allows you to check if the DR plan is workable, and if the RTO and RPO can be met. Testing also allows you to detect the disadvantages of the DR plan, and hence allows you to make adjustments to fix them.

Test your DR plan regularly to make sure that your vSphere virtual environment can be recovered. Infrastructure may change with time, and after changes occur a DR plan that was recently workable may not meet the appropriate requirements anymore. For example, some VMs may be added, IP addresses may be changed, applications may be migrated from one VM to another etc. Regular testing allows you to detect which parts of the plan should be updated after infrastructure changes have been made, in order to keep the DR plan in an efficient state.

Find the Right Site Recovery Solution

When you have composed the DR plan, find the site recovery solution that best meets your needs. In a case of using VMware vSphere, a solution should support host-level VM backup/replication, fast restore from backup, failover to a VM replica, the entire VM recovery and individual object recovery. Try to choose a suitable solution with the appropriate functionality, which would allow regular DR plan testing and updates.

NAKIVO Backup & Replication for VMware Disaster Recovery

NAKIVO Backup & Replication is a fast, reliable, and affordable VM data protection solution that can protect your VMware VMs. Among many other things, the product can perform host-level VM backup and replication, individual object recovery, instant VM recovery, and failover to a VM replica. No agents need to be installed on VMs as VMware vStorage API for data protection is used. Moreover, NAKIVO Backup & Replication includes a new Site Recovery functionality, with which you can perform disaster recovery of entire sites with (not only) VMware VMs.

Site Recovery Overview

Site Recovery is a powerful feature that helps you recover your VMs from one site to another in a case of disaster. This feature can also be used for planned VM migration between sites. You can build automated recovery workflows and run them for planned or emergency failover, as well as for testing purposes.

Site Recovery Features

Site Recovery allows you to automate and orchestrate a VM disaster recovery process. The feature includes a set of actions and conditions that you can combine into a site recovery workflow (job) according to your disaster recovery plan. These actions are:

Failover VMs. You can fail over to a VM replica (the VM replica must be created before performing the failover action).
Failback VMs. You can transfer workloads back from a VM replica stored at a DR site to a source VM stored at a production site.
Start VMs. You can start one or multiple VMs.
Stop VMs. You can stop one or multiple VMs.
Run jobs. You can run jobs (backup, replication, Flash VM Boot, etc.) created in your NAKIVO Backup & Replication instance.
Stop jobs. You can stop running jobs.
Run script. You can run a script on a machine with the instance of NAKIVO Backup & Replication, on a remote Windows machine, a remote Linux machine, a VMware VM, a Hyper-V VM, or an EC2 instance.
Attach repository. You can attach a backup repository.
Detach repository. You can detach the already attached backup repository.
Send emails. You can send an email after the appropriate action, for example, if VM failover was completed successfully.
Wait. You can wait for a defined time before proceeding to the next action.
Check condition. You can check the following conditions before proceeding to the next action: if a resource exists, if a resource is running, and if IP/hostname is reachable.

You can flexibly use the listed actions for creating different site recovery jobs for different use cases and scenarios. Click the Run Job button and all actions would be started automatically in the defined order. Site recovery jobs can be run manually in production and testing modes, but when you configure your site recovery jobs to run automatically as scheduled tasks, they are run in test mode.

Site Recovery Benefits

Site Recovery is a powerful, convenient, and intuitive feature. This feature can simplify disaster recovery for VMware vSphere virtual environments, as well as allows you to spend less effort and investment on business continuity.

To summarize the benefits of Site Recovery:

It helps you implement your complex site recovery plans in the framework of your disaster recovery strategy.
It automates a disaster recovery process.
It reduces the time spent on disaster recovery. (As a result, you have less downtime, fewer interruptions of services, and cut costs.)
Site recovery jobs can be tested automatically to detect whether your site recovery plan is up to date, as well as whether RPO and RTO can be met.
Site recovery is not a standalone feature, but is built into the powerful and universal VM data protection solution where it can be managed from a single pane of glass.
It has an affordable pricing policy. You don’t need to buy a separate license for using Site Recovery if you already have a license for the appropriate NAKIVO Backup & Replication edition.

Conclusion

The disaster recovery of a VMware vSphere virtual environment is an important process in ensuring business continuity. VMware disaster recovery best practices include the creation of a disaster recovery plan, as well as the automatic creation of VM replicas that are required for VM failover. Using VM backup and replication in addition to vSphere clustering features is recommended. Define your VM recovery order, prepare your disaster recovery site (including the network and storage components), make sure to test your disaster recovery plan regularly, and use a suitable data protection solution that supports host-level VM backup, replication, and recovery.

NAKIVO Backup & Replication is a universal VM data protection solution with support for VMware virtual machines. Site Recovery is a powerful new feature that is included in NAKIVO Backup & Replication since version 8.0. Site Recovery allows you to implement your disaster recovery plan by creating automated site recovery jobs. This useful feature helps you orchestrate and automate a disaster recovery process, recover VM data fast as well as ensure a high level of data protection. Download NAKIVO Backup & Replication with Site Recovery and try the product in your VMware vSphere environment.