Active Directory as network service has (at least IMO) one great advantage (some problems can be pointed as well ) – relative simplicity of building service which will be fault tolerant service. With proper design and maintenance it takes some efforts to break AD as a service. It provides:
- multiple directory replica with multi-master replication
- DC location mechanisms which can be used by client to find other DC in case of single machine failure
I started to think about it after at least few post on different forums (yes I still waste my time on helping others 🙂 ) where some people asked about how to proceed with AD disaster recovery in case of DC failure. What was common was that they were planning for DC failure in environment with ONLY ONE DC. Different approaches were taken, mostly incorporating some virtualization solutions but simplest solution, to add additional DC was often omitted. Why?
So lets do simple exercise and think how simplest recovery procedure will look like in case we have only single DC(1). Big day comes …
- our one and only DC fails and we are starting to experiencing problems and outages in our network
- If this isn’t hardware failure or we have similar hardware we are restoring backup or installing OS from the scratch
- If we don’t have spare hardware we waste some time to find one and install or restore OS
- We are restoring our directory from backup, going through all necessary procedures and after 2-3 hours we are back in business.
During those hours:
- Our users are experiencing problems with accessing network resources
- If our mail system is integrated with directory we might be cut off from mail system
- If our internet access is based on AD authorization (proxy) even internet newspapers are out of options:
- Minesweeper still is a solution 🙂
Of course these points are not including that:
- we have to start to deal with failure right away because it is affecting our business
- <put some name or title here> is standing above our head and is demanding to bring business back on-line
- we are assuming that we are perfectly calm and panic is not something which clouds our action :).
In best case single DC failure is causing few hours outage for entire organization. If this organization would have additional DC what would it change to this scenario? When one DC will fail:
- probably nobody will notice it as another DC(s) in the network should take care about handling client requests
- Developers: please don’t hard code DC addresses or names in apps.
- Responsible administrator has a time to finish his coffee and sandwich and read DR procedures (of course if there is one) to decide which procedure should be applied in this particular case.
- Selected DR procedure is applied i environment and everything gets back to normal operations.
Main difference here is that we don’t have to react to something which disrupts out business but we are dealing only with single infrastructure element failure. Of course additional advantage is that <put some name or title here> is not standing behind us all the time (however we should incorporate procedure to inform him in our DR procedure).
So .. these things are obvious however what I see, especially from people from small and medium organizations is that simplest approach in this case is often abandoned and some fancy and complex solutions which incorporates virtualization, snapshots etc are considered as a solution. With all clustering for SQL data, load balancing for web apps etc often crucial element which is directory service is being treated lightly.
And this might be all but …
… yeah .. VIRTUALIZATION. It is common buzz word of current time for all IT guys. Often I see that virtualization is being abused as some kind of golden solution to every problem. Of course we can use virtualization for DCs, however I don’t see that this is a perfect strategy for DR:
- DC recovery from snapshots is not supported, not recommended and if You want to use it You have to know how to deal with it.
- With single DC, even virtualized we still will experience outages, what might be achieved is that recovery time might be shorter.
So virtualization … yes … but not for all DCs (still keep some DCs for each domain on metal box) and do not treat virtualization as main disaster recovery strategy, especially if you want to relay on snapshots or some similar technology.
What do You think about it?
(1) Of course I’m not pretending to describe entire scenario and this isn’t only scenario which should be covered in our DR plan for DS. I just used this very simplified scenario description as an example.