From the Field: the Case of the non-replicating Domain Controllers on Cisco UCS Blades

Reading Time: 3 minutes

Windows TroubleshootingThis morning I was asked to troubleshoot a situation where, in an environment with ten Active Directory Domain Controllers, two Domain Controllers would no longer replicate with each other. They would replicate with other Domain Controllers just fine, just not between themselves.

I troubleshot the problem, and I am sharing my experience in case you find yourself in a similar situation.

 

The situation

This networking environment encompasses several sites. In Active Directory Domain Services, each site has been assigned an Active Directory Site. Per site several Domain Controllers take care of authentication.

Some Domain Controllers in this environment are physical, some are hosted on VMware vSphere and others are hosted on top of Hyper-V. The environment was setup pretty recent with Cisco UCS and NetApp hardware, offering synchronous replication between two datacenter sites.

During the design phase, a architect mentioned that the environment needed physical Domain Controllers, so per datacenter site a blade from the Cisco UCS implementation was setup as Domain Controller. Besides having oodles of RAM (160GB), these two Windows Server 2012 R2-based Domain Controllers have two 20Gb/s Network Interface Cards each.

 

The problem

After a maintenance weekend, where all hardware drivers were updated on all blades, the two physical Domain Controllers stopped replicating. All other Domain Controllers replicated without problems, even with the two Domain Controllers that could no longer replicate.

From a graphical point of view, this resembled the following situation:

Graphical overview of the non-replication Domain Controllers (click for original drawing)

 

My troubleshooting approach

I used the free Active Directory Replication Status tool to make an inventory of the situation, to make sure the situation is as described. Also, this tool allowed me to quickly retrieve the replication status after actions, to see if they make a difference. In Errors only view, the tool pointed out Replication error 1256: The remote system is not available and Replication error 1722: The RPC Server is unavailable.

I then used version 2 of the PortQry Command Line Port Scanner to query the TCP and UDP ports in use by Active Directory Domain Services. When I got to the tests of TCP 389 and UDP 389, I noticed some weird behavior; UDP 389 would return LDAP information without problems, but TCP 389 would timeout…

Of the 10 reasons causing Error 1722, obviously, the network-related causes applied.

   

The solution

I checked the networking settings, and found the two 20Gb/s Network Interface Cards (NICs) were teamed using the built-in NIC Teaming feature in Windows Server 2012 R2.

Digging into the NIC Teaming settings, I found the NIC Team on both Domain Controllers was configured with the Dynamic load balancing mode:

NIC Teaming Settings in Windows Server 2012 R2 (click for original screenshot)

The Dynamic setting on the NIC Team enabled the two Domain Controllers to balance both the inbound and outbound traffic on the two Network Interface Cards (NICs).

I changed the NIC Team configuration on both the Domain Controllers from Dynamic to Address Hash. This setting enables affinity, allowing traffic to flow more steadily on one of two Network Interface Cards (NICs) towards other hosts.

With this setting, the two Domain Controllers could communicate with each other. Both TCP 389 and UDP 389 returned LDAP queries and the Active Directory Replication Status Tool soon reported no more replication errors.

 

Concluding

Sometimes, errors in Active Directory replication do not come from Active Directory itself, but are caused by one of the OSI-layers on the systems hosting it.

In other news: Troubleshooting Domain Controllers for an environment with roughly 1000 accounts, equipped with 160GB RAM is pretty funny, even on a Monday morning.

Tools I used

Active Directory Replication Status tool 
PortQry Command Line Port Scanner 

Related KnowledgeBase Articles

2102154 Troubleshooting AD Replication error 1722: The RPC server is unavailable  
2200187 Troubleshooting Active Directory operations that fail with error 1256: The remote system is not available.

One Response to From the Field: the Case of the non-replicating Domain Controllers on Cisco UCS Blades

  1.  

leave your comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.