Lessons Learned (Part 1: Remembering Waldo)

29 October 2009

Intro

I've spent the last couple of months trying to help improve performance and scalability of a large web-based system. Initially, the application could barely handle more than a handful of concurrent users, which was far away from the launch target of several million users per day. This will probably be the first in a series of posts, in which I'd like to talk about some of the more interesting challenges we've faced.

A bit of context first: The application was written in C#. The main technologies involved were WCF, MSMQ and SQLServer. Roughly speaking, the app consisted of a presentation tier (IIS/MVC), a business logic tier (WCF) and a data tier (SQLServer). Application logic was exposed through a large number of WCF service endpoints. Each service endpoint, in turn, exposed a similarly large number of fairly fine-grained service operations. Essentially, there were three groups of clients that consumed the exposed services: the presentation tier, other WCF services inside the business logic tier and a number of mobile clients (which I won't talk about here).

Intra-Tier Communication and Horizontal Scaling

Inside the business logic tier, there was quite a lot of communication going on between the individual WCF services. Initially, many of these calls were routed through the WCF stack. The rationale behind this initial design decision was so that - if needed at a later stage - some services could be run individually on separate machines.

It seemed unlikely, however, that this would ever happen. The logical conclusion of this thought was that services would communicate with each other via network calls, even though they would be running inside the same process. And even if the WCF service layer would be partitioned and distributed onto separate machines, what would happen if this still wouldn't give us the desired performance? Imagine you've got three WCF services: s1, s2 and s3. Assuming that s3 is the most hardware-hungry one, we could deploy s1 and s2 together on one machine and s3 separately on a dedicated machine. What happens, though, if the hardware onto which s3 is running is still insufficient? At this point, we could start scaling the service out horizontally by adding a load balancer and more machines each running a copy of s3. So, if we probably need to scale out horizontally anyway at some stage, what's the point of adding the overhead and complexity of network calls between services if they can be run in same address space? To emphasise this point, we measured how many WCF service calls we can make in a given period (using net:TCP binding) and compared this against making direct in-memory calls to the same service instance. The not unexpected result: throughput for the latter was about 350 times higher. Consequently, we went through the codebase and replaced WCF service calls with normal method invocations wherever possible.

In a recent email conversation, my colleague Martin Fowler drew an interesting analogy: "It's interesting that there continues to be this desire to distribute different functionality onto different nodes in the name of scalability when often the better route is to put all nodes in the same process and cluster the resulting app. This was exactly the wrong thought that distributed objects suffered from."

Inter-Tier Communication and Horizontal Scaling

The presentation tier was physically separated from the WCF services running in the business logic tier. Consequently, communication between the two tiers had to happened over the network.

Let's go back in time a little. Back in 1994, Waldo et al. wrote their excellent seminal paper called "A Note on Distributed Computing". In it, they argue that there are fundamental differences between in-process and intra-process calls in terms of latency, concurrency, partial failures scenarios etc. In the past, RPC systems have tried to abstract these differences away and make developers believe that there's no difference between calling an object in the same memory space or executing a procedure on a remote machine.

Don't get me wrong, I think that WCF is actually a pretty cool platform but, unfortunately, it also encourages people to continue building RPC apps in cases where other solutions might be more favorable. In fact, it makes it horribly easy to take a bunch of classes and expose them as remote objects. Calling them services doesn't mean that your application has now magically become service-oriented. Also, it doesn't change the fact that these now-called services are still remote objects including all the flaws that Waldo talked about.

Indeed, we've had to fix a lot of code where developers happily looped over hundreds or thousands of items in order to retrieve some data, unaware that in each iteration they were making a network call. It goes without saying that this had a significant impact on the performance of the system. After removing all unnecessary calls, we measured the overhead incurred by network communication again. On average, still more than 30% of our total service execution time was network overhead (i.e. serialisation, WCF, TCP, network latency...).

Looking at the chatty WCF service interfaces and the tight coupling with the code in the presentation tier it occurred to me that, in actual fact, we were not really building a distributed application but we were distributing an essentially monolithic application. Unfortunately, we didn't actually manage to change this one. I'm convinced, though, that a better way is to deploy two tightly coupled tiers together in the same process and then, again, scale out horizontally.

software development 16