For the truly curious or geeky among our beloved spacefans, registered users and subscribers, here's the full story of the service outage that happened on Sunday 3/14 and part of Monday. I write this not because I expect that many will read it in full, but because I believe that being transparent about these events is important to create trust between HOS and users of our web service. You'll also get a good idea of what it takes to set up and operate a streaming service.
We did a normal update on the site on Friday afternoon to add new show #904, a Celtic-themed program titled "The Turning Tide." It went very smoothly, with the new banner, the Flickr image gallery, the blog post and the Promo Podcast RSS feed update all coming online within a few minutes of the optimal time (6pm Pacific.)
After 9 years of running a live web service we no longer take anything for granted, so I streamed the entire show and confirmed that everything was working as expected. Happy St. Paddy's day, spacefans!
...came and went with no problems at all, but since Saturday night/Sunday morning was the change to Daylight Time, we double checked the system shortly after midnight to make sure our three sets of servers were all on the same time page, so to speak. Everything was chill, and we even indulged ourselves in a bit of telly before heading off for a good night's catch-up sleep.
SUNDAY MORNING 3.14
....the first email arrived from a subscriber around 9:45 AM, letting us know that our player was responding slowly. We couldn't replicate it; all seemed well from here. A couple of hours later, more messages arrived saying that Flash player was stuck on "Trying connection on port: 1935."
We could replicate that. If you waited long enough, the player would load eventually, but sometimes it could take five or ten minutes. Then you had to wait for the music and the "metadata" — the text information that identifies what's playing. The rest of the site was behaving perfectly. That indicated a problem with our Flash Media Server, not our web server.
We rallied the troops: our webmaster and our near-resident programmer were pulled out of Sunday morning leisure onto active duty. We called emergency tech support for our Content Delivery Network (CDN) and put in a "trouble ticket."
The trouble ticket alerts the technical personnel who actually run the physical computers and their associated server software. It triggers email and/or IMs to everyone who can contribute to a fix. Within a half hour, several people were working on analyzing what was causing the slowdown, including the owner and alpha geek of our CDN.
A Content Delivery Network provides bandwidth and streaming services or "hosting" of servers for companies like us. Our CDN maintains our Flash Media Servers and the associated database servers that provide all the detailed information that makes the site run.Our web site and many others depend on a combination of several different kinds of servers to function. A group of three is common — a web server, a database server, and for media sites like ours, a content delivery or streaming server like our Flash Media Server (FMS). If a third party provides e-commerce or media security services, yet another server is linked to the group.
Each of these servers resides in a large, secure Data Center operated by an enterprise level hosting company serving multiple Internet service providers (ISPs) and CDNs. To avoid problems local to any one data center, each major function has a back-up or "failover" server, often physically thousands of miles away. Distance hardly matters when huge Internet fiber optic lines connect major data centers at the speed of light.
In our case the web server is in Texas, the database and content servers are deployed in pairs in Chicago and Northern Virginia, and our e-commerce provider's secure data center is in Los Angeles. The matched pairs of content and database servers serve as backups and emergency spares for one another, and also can share the work should our main streaming server ever have more load than it can handle. (So far this has never happened — we have lots of server headroom at HOS.) Last but not least, when you sign in to HOS.com, our e-commerce provider's server has to identify you and the rights associated with your account. All these servers work together in tenths or hundredths of a second to get the job done.
Back at HOS, a couple of hours passed and things were not clearing up. Since the site was not operational and none of our efforts revealed the source of the problem or a cure for the slow down, we put up a "technical trouble" screen on the home page so users arriving did not have to discover the situation on their own and waste more time writing up problems we already knew about.
That would take some of the heat off our support email. It also would give our CDN time to redirect traffic to our backup set of streaming and database servers. However, when the CDN tested our alternate servers, the same problem appeared. It looked like the problem had been replicated to the spares along with the working data — definitely not the result we had prepared for. Whatever and wherever this issue was, it was nothing we'd ever seen before.
Our webmaster, our programmer, and our CDN's head tech spent all afternoon and evening testing and checking all possible systems to try to figure out what went wrong and where. The tests were as frustrating as the site's behavior: the CDN could find nothing in their diagnostic systems that even indicated there was a problem, let alone a cause for the one we had!
By 9PM Pacific, our Flash programmer started to run another, more sophisticated round of diagnostics to verify that nothing was wrong with our Flash server. It had been running perfectly for several months, and nothing had changed since the recent player update, which went completely without incident. After two hours of probing, he found a very obscure anomaly buried several levels down in the operation tree.
Here at last was the problem. When our Flash server needs a piece of information from our live online database, it sends a request to an associated database server on another machine in the same data center. As a security feature, the database server checks to make sure the request is coming from either our streaming server or our web server and not from, say, a Russian hacker.
It does this by a process known as "reverse DNS lookup." Essentially, it checks the numerical Internet address of the requesting machine against the Internet domain name directory to make sure they agree. If they do, it processes the request and returns the data to the Flash server. If the address of the request doesn't check out, FMS ignores it. The whole process normally takes a small fraction of a second.
Using special testing software, our Flash programmer saw that database requests were not being answered, but were stacking up in an incomplete state, so the request was either answered late or not at all. Without the data it needed, the Flash player could not finish loading. He wrote a report and posted it an internal bulletin board that sends email to everyone involved. At that point we all realized we had a problem that was not going to be solved that night and would require assistance from the main data center staff in the morning.
Now it was nearly 1 A.M. Monday morning. Because our FMS testing showed that 20% of requests were still being fulfilled, we decided to take down the trouble page and allow those who could get through to use the site. We put up a note warning about the problem and apologizing for the delays and slow behavior.
MONDAY 3. 15: THE HIDDEN DEPENDENCY
By noon on Monday, our CDN was able to switch over to our alternate streaming and database servers successfully. It turned out they had been running properly all along. The CDN's original cutover failed because it switched only half the pair: the alternate media server was still making inquiries to the impaired database server, so the results were the same. Our Flash programmer realized this and reminded the CDN techs of an esoteric requirement of the Flash Media Server that they had failed to satisfy.
Once they fixed this, we were back on the air with normal site operation using our backup servers. We put up another notice on the home page, and after 27 hours, started to breathe again.
Around the same time our CDN techs figured out the cause of the problem: the Domain Name Servers (DNS) in the Chicago data center were malfunctioning on Sunday.
DNS servers are deep Internet plumbing, maintained by primary service providers around the world. They are several levels removed from anything we normally see, interact with or control and are generally assumed to work 24x7, which is why we didn't suspect them at first.
We still don't know why or how it happened, but we immediately took steps to eliminate the "hidden dependency" We solved the security issue reverse DNS lookups are trying to protect against another way, and removed the reverse DNS routine on the database server. This effectively prevents this particular problem from recurring.
To improve the switch from normal to backup server operation, we're putting in a new "push-button" automatic script, so a partial cutover cannot happen again.
We're still trying to get an answer about when and why the Chicago DNS servers failed to perform in their usual reliable manner. Since their operation is controlled by a large company two levels above us with whom we do not have a direct relationship, we may never be able to get the full story. If we do, we'll update this post with the details.
WHAT HAVE WE LEARNED?
After 9 years of operation as a web service, we have learned many things about what it takes to make a web site fast, reliable and trouble free. Some lessons, like this occurence, we've learned the hard way. There always seems to be something you don't know about, don't even know you don't know about, or can't control. Needless to say, this is deeply frustrating to all of us at HOS, and deeply inconvenient to all of you, who just want to listen to good music, not listen to geeky reasons why the site isn't working. Our sincere thanks to all the dedicated engineers who make this stuff work and sacrificed a goodly chunk of their weekend to figure this out.
I wish we could say this is the last time it will happen, but in all honesty we can't. What I can say is that the odds of it happening are getting longer all the time.
:: Stephen Hill