Working with High Performance Computing resources (scale up vs. scale out in the Microsoft World)

Reading Time: 3 minutes

hpns_servidor_web_160x196px This week I got the chance to work with some x86 High Performance Computing hardware at the Donald Smits Center for Information Technology (CIT) at the University of Groningen.

Their hardware typically contains more than 64 logical processors and TBs of RAM. 10Gb/s interconnects and loads of Solid State Disks (SSDs) are also not very uncommon in their daily routines. I had the chance to work with systems like Dell’s PowerEdge R910 G11 and HP’s ProLiant DL980, which were labeled as spare hardware. With a 40k € price tag, each, this is no cheap pile of iron.

Installing Windows Server on these systems is easy, since Windows Server Datacenter runs on hardware with 256 logical processors and 2 TB of RAM for almost three years. (Windows Server 2008 R2 RTM’d on July 22, 2009)

High Performance Computing hardware is useful in ‘number crunching’ and ‘database’ scenarios. In the first scenario a multithreaded application is used on the hardware to analyze data. Commonly this data amounts to PBs per week/month. In the second scenario the hardware is used as a database solution.

A third viable scenario would be a highly-efficient x86 virtualization solution. You can build this on top of Windows Server 2012 Datacenter Edition with Hyper-V. With Hyper-V guests now capable of addressing 32 logical processors and 1 TB RAM, each, running a couple of these VMs can easily account for the purchase of typical High Performance Computing (HPC) hardware.

I would advice against using HPC hardware as virtualization hosts.

First, let me address the use cases for number crunching: For most of these cases Microsoft offers Windows High Performance Computing (HPC) Server. In this version of Windows Server, a large amount of standard hardware servers (“compute nodes”) are combined with the use of a supervisor server (“head node”) into a high performance cluster (with or without the help of additional “broker nodes” and “Job schedulers”). The method used is scale-out. If you have serious High Performance Computing (HPC) needs, you can use lots of HP DL980s as compute nodes.

Now, for virtualization. Although a single HP DL980 server is more than capable of running the VMs for a typical office automation setup, this is not ideal for the following reasons:

When the server malfunctions, all VMs will malfunction or go offline. This can be solved by adding a second server and shared storage and transform the two servers into a Windows Cluster. To avoid a “Dutch Cluster”, though, the second node will not be active to allow the VMs to resume/replica onto the second server. This is a terrifying amount of waste of resources. With more smaller nodes, a smaller amount of resources is needed for redundancy, and thus a smaller amount of resources is wasted. The Dell R910 we saw recently encountered memory problems and ran more safely in memory-redundancy mode. More DIMMs in use simply increases the chance of one of them failing.
You might assume a bigger piece of iron scales like a smaller piece of iron, but it doesn’t. A Server like a HP DL980 uses the same type of RAM (DDR3) as its little brothers. It simple takes a lot more DIMMs. (128 DIMMs to be exact) While eating up 128GB RAM takes a little over 8 seconds with DDR3 RAM, eating up 1TB RAM may take up to a minute. Also, booting a serious piece of iron takes significantly longer than a simple piece of iron. Booting the HP DL980 with its 160 processors takes 18 minutes without memory checks and 32 minutes with these checks on.

Both these reasons show a bit of the reasoning why you are better off using off-the-shelf lower-cost servers. The first bullet shows the problems with scale-up in terms of redundancy and the increased chance of hardware failures. The second bullet shows the problems with scale-up in terms of scaling.

It’s economically more sound to scale-out than scale-up.

In a lot of scenarios, Windows Servers will not be running more than 64 logical processors. Only showing 64 processors in the new Task Manager in Windows Server 2012 is one of the outcomes of this way of thinking. Also, the tagline "the power of many, the simplicity of one" for the improved Windows Server 2012 Server Manager falls into place for me.

a blog by Sander Berkouwer