Keeping the Messenger service running—on a massive scale

SGT Oddball · Mar 2, 2010

If you read Jeff Kunins' post a few weeks ago about the growth of Windows Live Messenger in the last decade, you already know that Messenger's growth has been pretty phenomenal. With such growth comes passionate feedback from customers, and a unique set of challenges around supporting those customers with a reliable service. My name is Russ Arun, and I'm a Group Program Manager for Windows Live, focusing on the server side of Messenger. In building the Messenger service we've always focused on these core principles:

Scale - making it easy to support more customers using a set pattern
Reliability - making the system redundant where needed
Efficiency - delivering the best service for the least cost

In this post, I’d like to focus on some history, basic architecture, and insights into how we deploy Messenger.

Back in the day

The Messenger service in its early days was built on a Unix variant system. Each time we had to upgrade the Messenger servers, we had to bring the service down, install the upgrade, and then bring the service back up. Of course, we would do this at around midnight Pacific Standard Time. As Messenger became a global service, this became untenable as it was right in the middle of the day for our customers in Europe and Asia. We also had our share of issues, and we learned as we went along.

I remember one upgrade—this was 6 or 7 years ago—when bringing the service back up caused issues for our customers for a much longer period of time than we would have liked. As you can imagine this was not a good day for the team or for our customers. But we learned from our mistakes. At this critical juncture in the evolution of Messenger, we added a new core principle to our earlier list: what we call “no cloud down.”

No cloud down

No cloud down basically means that the "cloud" servers (where information about your IM connections are stored) are never all down at the same time, so your service is never interrupted. To help us achieve this goal, first we moved all Messenger activity to Windows-based servers. We worked to avoid cascading failures from affecting the system as a whole, by making various parts of the service redundant. And as with the Hotmail backend architecture, we made it easy to build more capacity by using “clusters” of servers that can be deployed in a single data center or across multiple data centers to service all the traffic. We also made the Messenger client more resilient to network-related issues.

Three types of servers

To get an idea of how Messenger is put together, it is simplest to start from the original architecture of Messenger. We basically had 3 server types:

Connection servers
Presence servers
Switchboards

When you sign in to Messenger, your Messenger client talks to the connection server, which holds the session state. After a successful sign-in, a “subscription” is established on the presence server, which keeps track of your online presence and fans this information out to all of your online Messenger contacts. While you and your friend are chatting, the switchboard provides the meeting point where your messages are exchanged. Although Messenger has evolved over time, this basic architecture remains the same.

The number of connection servers is scaled to the number of users simultaneously logged into the system (its scale is network-connection bound). Approximately 13% of our customers are connected to the service at any one time, but that number has been going up rapidly because of the growth in always-on mobile connections, and the addition of various web sites that now integrate Messenger (for example, Hotmail).

Presence servers are scaled to the amount of data we have about user presence (in what we call the “presence document”) and the number of Messenger contacts (also known as subscriptions) to whom we need to communicate any change of state. The presence server is mainly memory bound; it scales with the complexity of information in the presence document multiplied by the number of users hosted on each presence server.

The scale of the switchboard is bound by the number of users simultaneously having IM conversations. This simplicity in design lets us scale up reasonably well. For example, if we see an increase in the average number of Messenger contacts that users have, we scale up by increasing the number of presence servers, or increasing the memory in each presence server. If a presence server goes down, in this simple design, we can recreate the subscriptions from all the connected Messenger clients.

To decrease the latency of bringing up new presence servers, we've optimized in various ways, but the core architecture based on these three server types remains the same. We've also done various optimizations to reduce the load on our service, where possible. For example, file transfers (when you send a file to someone else in Messenger) happens P2P (it goes directly from one computer to another computer without ever being stored on our servers). As we've added other features and expanded Messenger to span multiple data centers, we've kept true to the simplicity of the original design and retained our ability to scale by each server type.

Working together

We've also worked hard to make the client and server work well together, while taking into account the realities of the Internet. For example, when a Messenger client logs into our identity system, it gets a ticket with a timeout, or “lease,” which expires at a specific time. During the term of the lease, if the Windows Live ID identity system has any service-related issues or the network connection from the client to the service drops, the ticket allows the client to continue to access the Messenger service without the user having to sign in again, as long as the lease has not expired. This allows the client to provide our customers with the necessary service, while preserving the authentication guarantee from our Windows Live ID identity system.

Lately, our Messenger service for the web is being adopted more than ever before, both by our own Windows Live Hotmail service, and by various non-Microsoft websites. For ease of integration with non-Microsoft services, developers can use the Messenger Web Toolkit, which comes with an SDK (Software Development Kit) that allows any web service to add a Messenger web bar to any relevant page. This adds to the value of the services who adopt it, by allowing their customers to chat with Windows Live customers without leaving their site.

Messenger also provides federation (interoperability) with enterprise networks using Microsoft Office Communications Server (OCS). We make the connection between Windows Live and the selected enterprise network using the SIP protocol, the same method we use to connect Windows Live Messenger with Yahoo! Messenger. These federation services bring networks of massive scale together. Messenger servers can also talk to multiple end-points: desktop, web, and mobile. Many higher end mobile phones have Windows Live Messenger mobile clients available. It is also possible to communicate from Messenger to a mobile phone through SMS (text messaging). Someone signed into Messenger on their desktop can add a phone number to a Messenger contact (in selected countries). The desktop Messenger user can then send IMs to their friend's mobile device, and the friend can reply from their phone via SMS, with each person using the method that works best for them. To enable this, Messenger is integrated with the SMS network.

Careful planning for updates

Given all of the interoperability requirements, you can imagine the carefully choreographed operation it takes to update the Messenger desktop client. Each update has to interoperate with multiple versions of Messenger, in multiple languages, and in multiple countries. The bandwidth needed for such an update is a sizable portion of the overall bandwidth consumed on an average day by all of Microsoft. This of course means that the Messenger servers have to support multiple protocol dialects and multiple versions of a particular dialect at a time. Similarly, upgrading any Messenger server type requires that the upgraded servers continue to work across the rest of the server types and honor all the supported protocols.

The current Messenger system, which supports more than 300 million customers, runs on a few thousand servers spread across multiple data centers worldwide. Apart from the three server types mentioned above, we also have servers that automatically manage the system to reduce human intervention to a minimum. Since Messenger is mostly stateless, automated management of Messenger is easier than for other, "stateful" services. The same quality makes Messenger more efficient in the cost of running the system.

I hope that I've been able to convey a sense of the architecture of the Messenger service, how we are able to scale it by server type, and some of the nuances of running the service.

Russ Arun
Windows Live Messenger

Russ Arun was recently honored for Outstanding Technical Leadership by members of the Microsoft Technical Community Network (TCN). The award recognizes Russ’s role in spearheading dramatic improvements in the Window Live Communications Platform.

More...