Troubleshooting Active Directory Domain Services is fun.
Today, I cover a more esoteric Active Directory troubleshooting case about an overloaded Domain Controller holding the Primary Domain Controller Emulator role.
The cause has nothing to do with Active Directory, of course, but I was called in because the machine affected was a Domain Controller.
About the Primary Domain Controller Emulator
Back in the days of Windows NT 4 Server, Microsoft offered redundancy for server-based Security Account Management (SAM) through Primary Domain Controllers (PDCs) and Secondary Domain Controllers (BDCs). BDCs offered read-only access to the SAM database in Windows NT 4 Server.
This was before Active Directory and Windows 2000 Server, where the multi-master model and the concept of Flexible Single Master Operations (FSMO) roles were introduced. To maintain the concept of the Primary Domain Controller, a domain-wide PDC Emulator role was introduced. This role can be transferred between Domain Controllers at will or seized in case of fuss.
The Domain Controller holding the Primary Domain Controller (PDC) emulator Flexible Single Master Operations (FSMO) role, performs these additional tasks, when compared to all the other Domain Controllers in the Active Directory domain:
- Password changes performed by other Domain Controllers in the Active Directory domain are replicated preferentially to the PDC emulator.
- If a logon authentication fails at a given Domain Controller in an Active Directory domain due to a bad password, the Domain Controller will forward the authentication request to the PDC emulator to validate the request against the most current password. If the PDC reports an invalid password to the Domain Controller, the Domain Controller will send back a bad password failure message to the user.
- Account lockout is processed on the PDC emulator.
- The Domain Controller with the PDC emulator FSMO role, by default, functions as the authoritative source of time in the Active Directory domain.
- The Domain Controller with the PDC emulator FSMO role fulfills the role of the PDC in the NetLogon Remote Protocol methods. Therefore, the Domain Controller with the PDC emulator FSMO role must support and perform all PDC specific functionality specified in that section. Every other Domain Controller must not perform this functionality.
It’s safe to say, under normal circumstances, the Domain Controller holding the Primary Domain Controller (PDC) emulator Flexible Single Master Operations (FSMO) role is the busiest Domain Controller of all.
The situation
The customer has fifteen Domain Controllers, all part of one Active Directory domain in one Active Directory forest. All Domain Controllers are virtual machines, hosted on VMware vSphere.
The issue
The Domain Controller holding the Primary Domain Controller (PDC) emulator Flexible Single Master Operations (FSMO) role peaked at 100% CPU often, while other Domain Controllers didn’t.
The admins were notified of these utilization peaks and added another virtual processor to the virtual Domain Controller and rebooted, but the machine kept feeling sluggish and the admin kept receiving high CPU notifications.
My troubleshooting
This is to be expected.
I fully expected the PDC Emulator to be burdened more than other Domain Controllers, because of the extra tasks this Domain Controller has to perform.
We can fix it with DNS Priority.
The normal method of coping with this issue is by using DNS Priority.
This way, an AD admin can specify a value for the DNS weight for the DNS SRV record for the Domain Controller holding the Primary Domain Controller emulator (PDCe) Flexible Single Master Operations (FSMO) role so high, artificially, that this Domain Controller would be unlikely to receive authentication requests, unless no other Domain Controllers are available. By default, the value is set at 0. Setting priority extremely high, say 100 or 200, significantly reduces the chances the PDC Emulator will get authentication requests.
Note:
Some legacy applications may be written to specifically contact the PDC for the domain, and might not be impacted by DNS Priority.
But wait, there’s more…
When I dug a bit deeper with the Windows Task Manager, I noticed that only CPU0 was being overutilized. The other virtual processor was just doing its thing, but was more or less idling.
Could this be a TCP setting?
I ran a little Windows PowerShell one-liner to get the network interface card properties:
Get-NetAdapterrss –Name "Internal"
We received an error, indicating that no MSFT_NetAdapterRssSettingData exists for the network interface card. Apparently, Receive Side Scaling (RSS) is off…
That’s strange…
Why would Receive Side Scaling (RSS) be off? No wonder CPU0 is overloaded; the network interface card settings tell it to only use this CPU and not the others.
The cause
As it turns out, the Windows Receive Side Scaling (RSS) feature is not functional on virtual machines running VMware Tools versions 9.10.0 up to 10.1.5. This issue, apparently, has been plaguing VMware vSphere-based virtual Domain Controllers for quite some time and VMware has been working on the issue since March 23, 2017…
However, my friends at Veeam were aware of the issue. Probably due to the fact that backup requires quite some more data over the network, compared to normal Active Directory operations, they have encountered this issue more often, than I have.
Their advice is to upgrade the VMware tools to version 10.2.5 or beyond, to gain VMXNET3 driver version 1.7.3.8. One of the caveats they found was that this driver version enables RSS and Receive Throttle settings, by default – but only for new VMware Tools installations on new virtual machines. If you upgrade an existing VMware Tools install, these settings will remain as is.
The settings they advice are to enable Receive Side Scaling (RSS) and set the Receive Throttle to 30.
The solution
As a solution, we performed a couple of actions, mostly in a maintenance window:
We performed a test restore of the latest back-up of the Domain Controller, so we were certain we could restore the Domain Controller even in the case of completely borked networking settings.
We upgraded the VMware tools version to 10.2.5 on the Domain Controllers holding the PDCe FSMO role and rebooted the server.
We enabled Receive Side Scaling (RSS) on all capable network interface cards (NICs) on the virtual Domain Controller, using the following Windows PowerShell one-liner:
Enable-NetAdapterrss –Name "*"
We, then, restarted the Domain Controller for a second time.
We repeated the above five steps for all Domain Controllers throughout the Active Directory domain.
Then, on the Primary Domain Controller Emulator, we changed the registry. We made changes to the following two registry keys:
- HKLM\System\CurrentControlSet\Services\Netlogon\Parameters\LdaPSrvWeight
We changed the DWORD value for this key to 50. (default value is 100) - HKLM\System\CurrentControlSet\Services\Netlogon\Parameters\LdaPSrvPriority
We changed the DWORD value for this key to 100. (default value is 0)
We then rebooted the Primary Domain Controller Emulator.
Hat tip
Hat tip to Anton Gostev from Veeam for pointing in the right direction in his weekly Veeam Community Forums Digest.
Login