Troubleshooting stories from the field are the best. That’s why I like writing them down. Although, sometimes they might appear as straight cases of schadenfreude, I feel there are lessons to be learned for anyone, if you’re willing to look closely and listen carefully.
Last month, I experienced an issue with an AD FS farm, consisting of two load-balanced Web Application Proxy servers and two load-balanced AD FS servers. This AD FS farm seemed to completely collapse every now and then. The organization was so frustrated with the instability of the AD FS farm and the inability of the COVID-stricken IT team to fix it, that people even started blaming old bread in the cantina on AD FS.
The customer has an Active Directory Domain Services environment and a connected AD FS farm, consisting of two load-balanced Web Application Proxy servers and two load-balanced AD FS servers. A KEMP Virtual Loadmaster VLM-200 operates as the load balancer for AD FS and several other servers, including on-premises SharePoint Server and Exchange Server implementations.
Just like any other organization, they had deployed many laptops in the past months to help employees work from home.
The AD FS farm would randomly fail. When it failed, connecting to the AD FS farm would result in browser errors like ‘Service unavailable’ and ‘Connection refused’.
Federated applications, connected and published through the AD FS farm would provide browser error messages, too, when trying to sign in:
We started with troubleshooting the AD FS farm. We checked:
- Windows Updates
- Time differences between the Web Application Proxy servers and the AD FS servers
- Name resolution and HOSTS file configurations
- Protocol hardening on the servers
We found some misconfigurations. We changed the A record in the internal DNS zone to no longer point to the VIP on the KEMP LoadMaster for the Web Application Proxies, but to point to the VIP for the AD FS servers themselves. We also hardened the security channel protocols and cipher suites, to rule out any protocol mismatches.
What stood out during our troubleshooting is a particular EventID on the Web Application Proxy servers:
These events with EventID 224 would randomly appear in the AD FS\Admin log of the Web Application Proxy servers, indicating the AD FS servers could not be reached. The Web Application Proxy would retry making the connection every minute going forward. Then, when it as suddenly successful, EventID 245 would show and the Web Application Proxy would appear fine for an hour using the information it received. Then, it tried again and would succeed (EventID 245) or fail (EventID 224).
We altered the HOSTS files on the Web Application Proxy servers to directly point to the IPv4 addresses of the AD FS servers, bypassing the load balancer for this communication. The result was that the events no longer showed up in the event logs of the Web Application Proxy servers, federation would work fine for people using devices connected to the VPN, but it would still randomly fail for people working from home without a VPN.
We determined the cause of the instability of the AD FS farm was due to the load balancer.
The LoadMaster VLM-200 model is limited to 200 Mb per second throughput and 200 SSL transactions per second (TPS). This particular model was discontinued by KEMP in February 2020.
We upgraded the KEMP LoadMaster VLM-200 to a KEMP LoadMaster VLM-500. The AD FS Farm remained stable within the new limits this virtual appliance has to offer (500Mb per second throughput and 500 TPS).
AD FS problems may not always be a Microsoft problem. In this case, it helped that we know our KEMP technologies, too.