404 Skype Not Found

Centralised architectures can always cause trouble. Not that this is a point in distributed systems' favour, necessarily; look what just happened to Skype, which has suffered a whole day's outage.

We at Telco 2.0, as you may know, are actually a group intellect, structured rather like the brain of a large cephalopod. Rather than one single brain, there is a node for each tentacle, the whole being interconnected by the highest-bandwidth nerve fibres known in nature. Unlike the squid, the Telco 2.0 team uses Skype quite heavily in order to maintain coherence among its multiple cerebellums (cerebella?), so we may be forgiven for feeling a little sporky. We've been debased to using Google Talk for much of the day.

Telco 2.0 in its natural habitat
Telco 2.0 in its natural habitat

So all day, access to Skype has been to all intents and purposes impossible, starting around 1000 hours GMT. The pathology takes the following form; on start-up, the Skype client successfully registers on the network (often with considerable delay), but rapidly logs-off again, and struggles to reconnect. During the brief intervals of successful operation, the number of logged-in users is very low; between 100,000 and 320,000 according to our own observations.

What was up? Surely the nature of a peer-to-peer network means that there is no single point of failure? Well, everyone speculated, so why not us too?

Skype's architecture is supposed to eliminate single points of failure

Skype is one of the most decentralised of decentralised systems. Much is secret about its workings, but it is well-known that some fraction of end-users act as "supernodes", which all carry part of a distributed directory of Skype names and their current IP addresses. These also act as proxies for users behind firewalls who can't connect directly. The problem of finding a supernode out there is solved by hard-coding the IP addresses of seven "super-supernodes" into the Skype client. As all supernodes know the locations of all other supernodes, once the client has contacted one of the seven, it can be handed off to a topologically handy node.

Skype user names are issued by a central server. This generates various cryptographic keys used to authenticate users to each other and to supernodes, as well as to encrypt bearer traffic. Everything is always encrypted. When a Skype client starts up, it tries to contact a super-supernode and presents its credentials. What happens then is not entirely clear; it is suggested that the supernode then carries out some sort of logon process with a central server. As the login details go first to the supernode, and this has the crypto necessary to authenticate the user, one wonders why this would be so.

So what have we learned?

Don't be religious about any particular technology sounds good. IMS may be horribly over-centralised, but Skype may just have some similar pathologies. And the king of centralised telco engineering -- the PSTN -- is still the world standard for reliability. As long as you solve the user's problem, nobody will care what technology you use. Until it breaks...

Update: As many people suspected at the time, it was the Windows patch whatdunnit.