So I was contacted by a panicking client. It seemed that all of his Outlook clients could not connect to Office 365 anymore. That meant investigating what was wrong. Upon inquiring he admitted that he changed some DNS settings the day before. But only the SPF record. That didn’t explain connection issues with their Outlook clients. Naturally the first thing I checked where there public DNS settings. There did not seem to be anything out of the ordinary there, apart from a tiny mistake in the SPF record they created. That did not present any insight in what might was causing this problem.
A couple of days earlier we did renew the Exchange certificate on the on premise Hybrid server. But since the auto-discover DNS records where pointing at Office 365 this should not be a problem.
So I turned to my trusty connection testing toolset provided by our friends at Microsoft: https://testconnectivity.microsoft.com/
There on the Office 365 tab I ran the Outlook connectivity test. The following picture is a screenshot of a part of the test outcome. Funny thing that HTTP 503 error for the Office 365 auto-discover service.
A little web research suggested to recreate the federation link with Office 365. I felt that would be a little bit of exaggeration. What else could be responsible for an Office 365 service being unavailable? Needless to say, I tested another tenant. That one seemed to have no problem what so ever. Then it hit me. A quick question to their administrator if anything had changed at the Federation servers of domain controllers confirmed this. Yes, updates where installed, zero reboots given. Great, go reboot those machines….
5 minutes later I got a very happy and relieved sysadmin on the phone confirming that everything was working again. He also informed me they where able to log into on premise servers again. He forgot to mention that fact in an earlier conversation…..*sigh*
If you cannot log into your federated Office 365 environment, check your Domain Controllers and Federation servers. Something might be out of order there.
When I was at a client the other day I encountered the following:
As you can see the Exchange environment in itself already contains a single point of failure. Namely the Exchange-01 server who solemnly functions as a client access and transport hub. The two database servers however are both made high available through the use of the failover-cluster feature introduced in Windows server 2008. This in itself is a good idea. Beside the fact that this way you can create redundancy within your database hosts this also allows you to use multiple redundant databases on both servers in a database availability group. You can even reboot one in the middle of production. For instance to update some compromised certificates. The production reboot should notify clients to restart their outlook, but hey, your exchange is safe and up to date again.
It is a bad idea to install this failover cluster on a failover VMware cluster. The problem arises when an actual failover needs to take place. In a perfect world (where you wouldn’t even need failover since your servers would never break) failover would happen automatically if one server for whatever reason stops functioning. In the case of my client something very interesting happens.
First the database is going to be transferred to the other exchange server. All is well. At the same time, VMware steps in and fails over the Exchange server to another host or whatever it is that VMware does to keep guests alive and restores the system to it’s previous state. So the server that went down is restored with database connection while the windows failover-cluster transferred database access to the other exchange database server. With both servers wanting to access the database neither will be able to and that’s when your exchange database failover-cluster fails. This usually results in a lot of people calling the helpdesk to ask why they can’t access their mail.
This is not due to the fact that either VMware or Failover-clustering is a poor feature, this is because someone implemented a solution without proper testing.
So if you want to make Exchange redundant, only use one method and not two or more stacked methods or it will come around to byte you like an attack dog.