Troubleshooting stories from the field are the best. That’s why I like writing them down. Although, sometimes they might appear as straight cases of schadenfreude, I feel there are lessons to be learned for anyone, if you’re willing to look closely and listen carefully.
Last month, I experienced an issue with all four Domain Controllers for an organization randomly crashing.
The situation
The customer has an Active Directory Domain Services environment, consisting of one Active Directory domain. The implementation has four Domain Controllers in total; two virtual Domain Controllers and two physical Domain Controllers. All Domain Controllers run Windows Server 2012 R2 and are up to date in terms of Windows Updates.
The issue
The Domain Controllers would all randomly fail. Event ID 1000 is triggered right before a reboot citing that lsass.exe had failed. The interesting portion of the XML of the event is shown below:
<System>
<Provider Name="Application Error" />
<EventID Qualifiers="0">1000</EventID>
<Level>2</Level>
<Task>100</Task>
<Keywords>0x80000000000000</Keywords>
<Channel>Application</Channel>
<Security />
</System>
<EventData>
<Data>lsass.exe</Data>
<Data>6.3.9600.17415</Data>
<Data>545042fe</Data>
<Data>ntdll.dll</Data>
<Data>6.3.9600.19678</Data>
<Data>5e82c88a</Data>
<Data>c0000374</Data>
<Data>00000000000f1ce0</Data>
<Data>24c</Data>
<Data>01d668f2cb1bf7c2</Data>
<Data>C:\Windows\system32\lsass.exe</Data>
<Data>C:\Windows\SYSTEM32\ntdll.dll</Data>
<Data>b69ce08a-d5f0-11ea-814b-0050568d1dbd</Data>
<Data />
<Data />
</EventData>
Memory dumps and mini dumps were not available on any of the Domain Controllers, even though the pagefile settings are default, pagefile.sys exists on the system drive and the system drive has ample free space left to write dumps.
Our troubleshooting
The Local Security Authority Subsystem Service (LSASS) is responsible for enforcing the security policy on the system. It verifies users signing in to a Windows or Windows Server, handles password changes, and creates access tokens. It also writes to the Windows Security Log. Forcible termination of lsass.exe will result in a restart of the Domain Controller. The restarts are the actual recovery process, not the problem.
When a program, application or service crashes, Windows Server records data. Even though memory dumps were not available, there will be Windows Error Reports. We checked the following location:
C:\ProgramData\Microsoft\Windows\WER\ReportQueue
Here we found the error reports. Unfortunately, almost all of the *.hdmp files were corrupted. We got lucky on one of the Domain Conrollers with a memory.hdmp file, that was readable with WinDBG.
We analyzed the file and found the following information:
ExceptionAddress: 00007ffb24141ce0 (ntdll!RtlReportCriticalFailure+0x000000000000008c)
ExceptionCode: c0000374
ExceptionFlags: 00000001
NumberParameters: 1
Parameter[0]: 00007ffb2417ed40PROCESS_NAME: lsass.exe
ERROR_CODE: (NTSTATUS) 0xc0000374 – A heap has been corrupted.
EXCEPTION_CODE_STR: c0000374
EXCEPTION_PARAMETER1: 00007ffb2417ed40
ADDITIONAL_DEBUG_TEXT: Followup set based on attribute [Heap_Error_Type] from Frame:[0] on thread:[PSEUDO_THREAD] ; Followup set based on attribute [Is_ChosenCrashFollowupThread] from Frame:[0] on thread:[PSEUDO_THREAD]
FAULTING_THREAD: ffffffff
STACK_TEXT:
00000000`00000000 00000000`00000000 PwdFilt!unknown_function+0x0
PwdFilt.dll is a library that allows organizations to dictate their own password complexity rules for password changes and password resets, if the same file contents are available on all Domain Controllers. It can also be used to syphon clear-text passwords at the time of change.
When we looked at C:\Windows\System32\PwdFilt.dll, we noticed that the file didn’t have a Microsoft signature, but a signature by Authasas B.V.. This originally Dutch organization was bought by Micro Focus International in 2015 and subsequently merged with the NetIQ solutions.
The timestamps on the file refers to May 29th, 2015. A quick call to the NetIQ representative confirms the versions running on the Domain Controllers are the latest versions available of Authasas, as the solution has been end of life for a couple of years. The driver is used to send the passwords to the Authasas solution for Single Sign-on from several specific endpoints, where people sign in with a smart card and then get signed in to a Remote Desktop Services (RDS) host with their Active Directory credentials.
Several projects are in motion at the organization to eliminate the systems that require the Authasas functionality. All the other endpoints already used the newer and still supported NetIQ solutions.
The cause
At this time, we concluded that the PwdFilt.dll file from Authasas crashed lsass.exe on a Domain Controller leading to the subsequent reboot of the Domain Controller.
As the password change was attempted at a subsequent Domain Controller at a later time, that Domain Controller crashes too.
The solution
We opted to isolate the Authasas functionality:
- We placed the systems that were dependent on the Authasas functionality in a newly created Active Directory site by defining subnets that contain these systems.
- For this site, we created two new Domain Controllers.
- We updated the systems with the latest Windows Updates and copied off the PwdFilt.dll file before installing Authasas.
- We installed Authasas on the two new Domain Controllers.
- We disabled the Bridge all site links setting.
- We enabled replication notifications on the site links.
- Then, we removed the Authasas solution from the previous four Domain Controllers and overwrote the PwdFilt.dll.
All six Domain Controllers acted fine afterwards.
Login