From the field: The Case of the Domain Controller that would not function after an Azure Site Recovery test failover

Windows TroubleshootingLast week, I was on route with Darryl van der Peijl, one of my colleagues and a Hyper-V Most Valuable Professional (MVP).

We visited a customer that had some trouble with one of their Domain Controllers in combination with Azure Site Recovery (ASR): After a test failover, the Domain Controller would not function properly.

 

The situation

The customer has an Active Directory Domain Services environment consisting of a root domain and nine child domains for each of the countries they are doing business in.

To facilitate a test environment, they implemented Azure Site Recovery (ASR) to perform test failovers of the servers crucial to their testing needs. The test environment can not have an active connection to the production environment. You’ve guessed it: They need Domain Controllers from their root domain and each of the child domains in their test environment and they need it to be representative for the production network and be fairly op to date.

The root domain and most of the child domains are fairly old domains that were setup initially using Windows 2000 Server. I suspected this because of the (empty) root domain and got that verified. The root domain and each of the child domains host more than one Domain Controller. One Domain Controller (in Germany) is running Windows Server 2003.

 

The issue

When the customer performs a test failover to produce the Domain Controller for the test environment, most of the times the Domain Controller would not function properly. Some Domain Controllers would start without problems, some would have problems some of the time and others would have problems all the time.

When a Domain Controller would not function properly, the SYSVOL and NETLOGON shares would not surface and the Domain Controller would have a lot of errors in its Event Viewer logs.

 

Our troubleshooting

From an Active Directory point of view, Azure Site Recovery (ASR) is a host-based backup and restore solution. Recalling the information in my Whitepaper on Host-Based Backups and Restores of Domain Controllers from four years ago, these types of backups and restores would need to be Active Directory-aware.

Going through the logs on either the production and the failovered Domain Controller did not yield the Event-ID 1917 (indicating a successful host-based backup) and Event-ID 1109 (indicating a successful restore) events I need for an Active Directory-aware backup and restore.

The backups are not Active Directory-aware.

We noticed the value for Frequency of application-consistent snapshots was set to Never.

VMMCloudSettings

We configured application-aware backups for the entire VMM Cloud by defining values for Frequency of application-consistent snapshots and Retain recovery points for (hours): so in the true Disaster Recovery scenario (not this test-failover scenario) the customer would have the opportunity to restore a consistent backup.

Note:
You cannot configure these settings when a test failover is in progress.

However, Darryl quickly mentioned that for Azure Site Recovery (ASR) test-failover it is not possible to select the snapshot to return to…

This is not the solution.
(although it helps)

 

The cause

We had a call with the Azure Site Recovery (ASR) team, and one of the Program Managers (PMs) made an interesting quote that is in line with the messaging from the Product Team during events and other presentations:

We make block-based snapshots. We do not care what is inside the virtual machine, because it simply works.

I’ve heard quotes like this before and what they actually mean is that services like Azure Site Recovery (ASR) actually rely on the robustness of the application inside the virtual machine for everything to work.

This customers Active Directory environment, however, lacks one big robustness feature because of its origins: it still uses the NT File Replication Service (NTFRS) to replicate the System Volume (SYSVOL).

The newer Distributed File System Replication (DFS-R) is more robust, scalable and has better replication performance than FRS. DFS-R for SYSVOL is available since the Windows Server 2008 Domain Functional Level (DFL) and adds a lot of robustness to SYSVOL replication.

Active Directory is still using FRS for SYSVOL migration.

I asked about the test setup of the Azure Site Recovery (ASR) team and they acknowledged testing Active Directory Domain Controller failovers with Windows Server 2008, but using freshly installed Windows Server 2008-based Domain Controllers, not migrated Domain Controllers or Active Directory domains originally running Windows 2000 Server or Windows Server 2003…

 

The solution

We exchanged the Windows Server 2003-based Domain Controller with a Windows Server 2012 R2-based Domain Controller in Germany and upgraded the Domain Functional Levels (DFLs) throughout the Active Directory Forest to Windows Server 2008.

We then performed the FRS-to-DFSR migration for SYSVOL replication in all the domains throughout the Active Directory Forest.

Then, after extensive testing by the customer, the Domain Controllers would function without problems after Azure Site Recovery (ASR) test-failovers.

 

Concluding

When you want to use Azure Site Recovery (ASR) with Domain Controllers, make sure you are making Active Directory-aware (application-consistent) snapshots.

When running older Active Directory environments (dating back to Windows 2000 Server and/or Windows Server 2003 based Domain Controllers and Functional Levels, make sure you’ve performed your FRS-to-DFSR migration for SYSVOL replication.

Related blogposts

Migrate FRS to DFSR
NTFRS Depricated with Windows Server 2012
Transitioning your Windows Server 2003 Domain Controllers to Windows Server 2012
SYSVOL FRS to DFS-R Migration Guide available

Further reading

SYSVOL Replication Migration Guide: FRS to DFS Replication
Streamlined Migration of FRS to DFSR SYSVOL
SYSVOL migration from FRS to DFSR – Whitepaper Released
DFS Operations Guide: Migrating from FRS to DFS Replication
Microsoft Azure Site Recovery: Your DR Site in Microsoft Azure
Protect Active Directory and DNS with Azure Site Recovery
The ins and outs of the Windows File Replication Service

2 Responses to From the field: The Case of the Domain Controller that would not function after an Azure Site Recovery test failover

  1.  

    Sander, great post. You saved the day!
    I’ve been struggling to get our DR site online, and our DCs were having the same symptoms as what you described here. I migrated our DCs to DFS-R, and now everything works great. THANKS!!

  2.  

    So in 2015, Server 2012 had been out for 3 years, and a business wanted to shove their legacy domain into Azure? Wow.

    Great argument for telling businesses that they can’t get the shiny toys without fixing the fundamentals.

leave your comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.