Monday , February 27 2017

Exchange Server–Disaster Recovery – Reference Architectures

I am planning to cover all Disaster Recover Scenarios in this Article and Office 365 Architectures as well.

Lets dig some basics and go back some years so we can have a better understanding over this article. Exchange servers have evolved as a superior mail system over these years and the replication technology comes with it is more stable and reliable.if you configure it on the right way you can rely on it and there is no need to invest on third party replication technologies to attain a better RPO (Recovery Point Objective) and RTO (Recovery Time Objective). Adding Cloud based feature sets and hybrid technologies make it more feature rich technology. Now every feature added to this technology have been tested with office 365 which has millions of mailboxes over the cloud. Over the past era x86 systems and slow connectivity made the architectures more stretched and complex. Modern connectivity and cheap storage/hardware systems makes exchange configuration more affordable bringing back to a single server architecture.Of course x64 systems doesn’t limit us on the performance.

Hybrid Configurations are more effective . Moving the IT Overhead to cloud for large user base and for Small user base its more affordable on spending on the hardware itself. For Every Enterprise Disaster Recovery is a must . Proper Planning is always required in terms of business continuity .

Lets starts with basics and few recommendations on your Exchange Server Disaster Recovery Scenarios –

First Turn ON your Datacenter Activation Coordination (DAC) mode Which will prevent split brain syndrome at the application level ,  Datacenter Activation Coordination mode (DAC) cannot be used only only when you use third-party replication Software.

Basics –

  • Odd number of nodes in a Cluster, Witness does not have a vote.
  • Even Number of nodes in a Cluster ,witness have a vote.
  • From Windows Server 2012 R2 – Dynamic Quorum – Cluster decides whether to use the witness vote based on the number of votes available in the cluster.

For Dummies – Windows Cluster Voting system is to decide the Cluster will be up or Down. For a Two Node Cluster n/2 + 1 is 2/2 +1 = 2 , Means Maximum of 2 votes has to be up for Cluster to remain active . For a Three Node Cluster n/2 is 3/2 = 1.5 is 2.Means Maximum of 2 votes has to be up for Cluster to remain active

There is a common misconception that if you have a two node cluster 1 node and witness going offline the last node can dynamically change the quorum keep the databases online. Oh wait. Dynamic Quorum Doesn’t support simultaneous node failures. For Example in a two node cluster it cannot sustain  1 node and 1 witness going down.

How to see my votes for a 2 Node Cluster using Witness  –

Note : If you are using Cluster without cluster administrative access points (IP less Database Availability Group (DAG))  it means you cannot use the GUI to manage or Troubleshoot the cluster. you can use only PowerShell as of now.

Get-ClusterNode | FT Name,DynamicWeight,State –AutoSize
(Get-Cluster).WitnessDynamicWeight

image

Having a Witness server on a third site was a misconception in Exchange 2010 but not any more. In Exchange 2013 and Exchange 2016 its recommended to have witness on the third site to achieve Active/Active Datacenter.

We will see a active/active site for a large enterprise utilizing a virtual infrastructure and Load Balancing appliances. As Exchange 2013/2016 supports unified URL,Like Mail.domain.com , Across site1 and site2 . For Anti-spam Appliances its always recommended to us DNS Round-Robin Methods for High Availability with Same DNS Preference MX Records. Load Balancers with reverse proxy capabilities can replace your existing TMG if you have one.

Suitability – For an Enterprise going Active/Active is only when the type of business requires such availability and the link between the sites are affordable and reliable.  Having equal number of active users in each site makes this design more suitable.

Failure Scenarios –

  • When Site 1 Fails – Site 2 Takes over as it has the Witness
  • When Site 2 Fails – Site 1 Takes over as it has the Witness
  • When Site 1 and Witness Fails – Site 2 can be restored using Datacenter Switchover Methods. Using Start Stop DAG PowerShell Commands.

 

  • LTM (Local Traffic Manager – Hardware Load Balancer.
  • With BIG-IP DNS, users are directed to the nearest data center based on geo location or policies that will provide the best application experience.

 

Active – Active Site

We will see a active/Passive site for a Mid size enterprise utilizing a virtual infrastructure and Load Balancing appliances. In this scenario second site will be utilized only in the time of the disaster , As Exchange 2013/2016 supports unified URL across sites ,Like Mail.domain.com. If you have anti spam appliance licenses  with are CAL based and not server based you can build another set of Anti spam appliances and keep it off and turn it on when the disaster occurs (Changing the Mx Records manually). Most of the cases people don’t build anti spam appliance as the disaster recovery site is only a temporary solution to keep the messaging system active and DNS change is manual and the activation of the secondary datacenter consists of a series of steps .

Documenting this manual procedure will be used for Disaster Recovery planning and for Compliance Audits which will certainly define your planned RPO (Recovery Point Objective) and RTO (Recovery Time Objective).

Note : I just placed the load balancer in the below diagram as if more users are in the primary site you can always have two nodes in the primary site and one node in the secondary site as there is no need of witness(3 node – no witness needed).

Suitability – For an Enterprise going Active/Passive is more common.  Active users are more in a single site or Services are hosted from single datacenter.

Failure Scenarios –

  • When Site 1 Fails – Site 2 waits for an Administrator to manually Activate the datacenter. Using Start Stop DAG PowerShell Commands doing Public DNS changes manually.
  • When Site 2 Fails – Site 1 remains active as primary datacenter has the majority of the nodes.

 

  • LTM (Local Traffic Manager – Hardware Load Balancer.  Hardware load balancer will be utilized only when you have two nodes in the primary datacenter.

Active – Passive Site

We will see a single site Architecture for a small enterprise utilizing a virtual infrastructure . In this scenario second site 2 will be utilized by taking a manual offsite backup where one cannot afford a disaster recovery site, Where in the time of disaster he can restore the system and databases using a backup software.For Anti-spam Appliances its always to leave it with Round-Robin Methods for High Availability with Same DNS Preference. Now if you don’t want to opt or invest on a On-premises Anti-spam servers you can utilize cloud based anti spam which comes at a subscription cost.

Documenting this manual procedure will be used for Disaster Recovery planning and for Compliance Audits which will certainly define your planned RPO (Recovery Point Objective) and RTO (Recovery Time Objective).

Suitability – For a small Enterprise .  Most of these type of Customers have already migrated to Office 365 to save Operational and management costs.

Failure Scenarios –

  • When Site 1 Fails –  Waits for an Administrator to rebuild the backup software and manually recover using the offsite backup.

Active Site

It’s a cloud based site for a large enterprise with Single Sign On utilizing a virtual infrastructure .

Suitability – For an large Enterprise – Adopted cloud and manages their Users from their On-Premises Hybrid Exchange Server using Single SignOn.

ADFS (Active Directory Federation Services) Also plays a major Role on Enterprises who doesn’t want to sync their Passwords to the cloud. But if ADFS (Active Directory Federation Services) being down Users cannot Login to the  cloud as authentication happens locally to the users.

Failure Scenarios –

  • When Site 1 Fails – Cloud has all the data only ADFS Fails over Across the site. (Hybrid and Azure AD Connect has to be rebuilt/restored)
  • When Site 2 Fails – No Impact
  • When Cloud Fails – Wait from the cloud provider to resolve as Cloud Providers Provide Standard SLA.

To have a disaster recovery for the Active Directory,We can always stretch Active Directory a cloud based site like Azure.

  • LTM (Local Traffic Manager – Hardware Load Balancer.
  • With BIG-IP DNS, users are directed to the nearest data center based on geo location or policies that will provide the best application experience.

 

Cloud Based - Active – Active Site

It’s a cloud based for a mid size enterprise with Password Synchronization utilizing a virtual infrastructure .

Suitability – For an small and mid size Enterprise – Adopted cloud and manages their Users from their On-Premises Server.

There is a common misconception that without ADFS (Active Directory Federation Services) Single Sign cannot be achieved. Note : ADFS gives us various advantages when it has to be shared across organizations .

Failure Scenarios –

  • When Site 1 Fails – Cloud has all the data (Azure AD Connect has to be rebuilt)
  • When Cloud Fails – Wait from the cloud provider to resolve as Cloud Providers Provide Standard SLA.

To have a disaster recovery for the Active Directory,We can always stretch Active Directory a cloud based site like Azure.

 

 

CloudBased Site

 

Hope this Article was informative. Looking forward to add more Architectures in the same Article.

Article is open for Feedbacks.

About Satheshwaran Manoharan

Satheshwaran Manoharan is an Microsoft Exchange Server MVP , Publisher of CareExchange.in Supporting/Deploying/Designing Microsoft Exchange for some years. Extensive experience on Microsoft Technologies.

Check Also

Control Removable Storage Devices via Group Policy

Scenario 1 – Deny all type of Storage devices. Scenario 2 – Deny all type ...

One comment

  1. Thank you.

Leave a Reply

Your email address will not be published.