For the truly curious or geeky among our beloved spacefans, registered users and subscribers, here's the full story of the service outage that happened on Sunday 3/14 and part of Monday. I write this not because I expect that many will read it in full, but because I believe that being transparent about these events is important to create trust between HOS and users of our web service. You'll also get a good idea of what it takes to set up and operate a streaming service.
:: SH
FRIDAY 3.12
We did a normal update on the site on Friday afternoon to add new show #904, a Celtic-themed program titled "The Turning Tide." It went very smoothly, with the new banner, the Flickr image gallery, the blog post and the Promo Podcast RSS feed update all coming online within a few minutes of the optimal time (6pm Pacific.)
After 9 years of running a live web service we no longer take anything for granted, so I streamed the entire show and confirmed that everything was working as expected. Happy St. Paddy's day, spacefans!
SATURDAY 3.13
...came and went with no problems at all, but since Saturday night/Sunday morning was the change to Daylight Time, we double checked the system shortly after midnight to make sure our three sets of servers were all on the same time page, so to speak. Everything was chill, and we even indulged ourselves in a bit of telly before heading off for a good night's catch-up sleep.
SUNDAY MORNING 3.14
....the first email arrived from a subscriber around 9:45 AM, letting us know that our player was responding slowly. We couldn't replicate it; all seemed well from here. A couple of hours later, more messages arrived saying that Flash player was stuck on "Trying connection on port: 1935."
We could replicate that. If you waited long enough, the player would load eventually, but sometimes it could take five or ten minutes. Then you had to wait for the music and the "metadata" — the text information that identifies what's playing. The rest of the site was behaving perfectly. That indicated a problem with our Flash Media Server, not our web server.
We rallied the troops: our webmaster and our near-resident programmer were pulled out of Sunday morning leisure onto active duty. We called emergency tech support for our Content Delivery Network (CDN) and put in a "trouble ticket."
The trouble ticket alerts the technical personnel who actually run the physical computers and their associated server software. It triggers email and/or IMs to everyone who can contribute to a fix. Within a half hour, several people were working on analyzing what was causing the slowdown, including the owner and alpha geek of our CDN.
A Content Delivery Network provides bandwidth and streaming services or "hosting" of servers for companies like us. Our CDN maintains our Flash Media Servers and the associated database servers that provide all the detailed information that makes the site run.
Our web site and many others depend on a combination of several different kinds of servers to function. A group of three is common — a web server, a database server, and for media sites like ours, a content delivery or streaming server like our Flash Media Server (FMS). If a third party provides e-commerce or media security services, yet another server is linked to the group.Each of these servers resides in a large, secure Data Center operated by an enterprise level hosting company serving multiple Internet service providers (ISPs) and CDNs. To avoid problems local to any one data center, each major function has a back-up or "failover" server, often physically thousands of miles away. Distance hardly matters when huge Internet fiber optic lines connect major data centers at the speed of light.
In our case the web server is in Texas, the database and content servers are deployed in pairs in Chicago and Northern Virginia, and our e-commerce provider's secure data center is in Los Angeles. The matched pairs of content and database servers serve as backups and emergency spares for one another, and also can share the work should our main streaming server ever have more load than it can handle. (So far this has never happened — we have lots of server headroom at HOS.) Last but not least, when you sign in to HOS.com, our e-commerce provider's server has to identify you and the rights associated with your account. All these servers work together in tenths or hundredths of a second to get the job done.
SUNDAY AFTERNOON
Back at HOS, a couple of hours passed and things were not clearing up. Since the site was not operational and none of our efforts revealed the source of the problem or a cure for the slow down, we put up a "technical trouble" screen on the home page so users arriving did not have to discover the situation on their own and waste more time writing up problems we already knew about.
That would take some of the heat off our support email. It also would give our CDN time to redirect traffic to our backup set of streaming and database servers. However, when the CDN tested our alternate servers, the same problem appeared. It looked like the problem had been replicated to the spares along with the working data — definitely not the result we had prepared for. Whatever and wherever this issue was, it was nothing we'd ever seen before.
Our webmaster, our programmer, and our CDN's head tech spent all afternoon and evening testing and checking all possible systems to try to figure out what went wrong and where. The tests were as frustrating as the site's behavior: the CDN could find nothing in their diagnostic systems that even indicated there was a problem, let alone a cause for the one we had!
SUNDAY EVENING
By 9PM Pacific, our Flash programmer started to run another, more sophisticated round of diagnostics to verify that nothing was wrong with our Flash server. It had been running perfectly for several months, and nothing had changed since the recent player update, which went completely without incident. After two hours of probing, he found a very obscure anomaly buried several levels down in the operation tree.
Here at last was the problem. When our Flash server needs a piece of information from our live online database, it sends a request to an associated database server on another machine in the same data center. As a security feature, the database server checks to make sure the request is coming from either our streaming server or our web server and not from, say, a Russian hacker.
It does this by a process known as "reverse DNS lookup." Essentially, it checks the numerical Internet address of the requesting machine against the Internet domain name directory to make sure they agree. If they do, it processes the request and returns the data to the Flash server. If the address of the request doesn't check out, FMS ignores it. The whole process normally takes a small fraction of a second.
Using special testing software, our Flash programmer saw that database requests were not being answered, but were stacking up in an incomplete state, so the request was either answered late or not at all. Without the data it needed, the Flash player could not finish loading. He wrote a report and posted it an internal bulletin board that sends email to everyone involved. At that point we all realized we had a problem that was not going to be solved that night and would require assistance from the main data center staff in the morning.
Now it was nearly 1 A.M. Monday morning. Because our FMS testing showed that 20% of requests were still being fulfilled, we decided to take down the trouble page and allow those who could get through to use the site. We put up a note warning about the problem and apologizing for the delays and slow behavior.
MONDAY 3. 15: THE HIDDEN DEPENDENCY
By noon on Monday, our CDN was able to switch over to our alternate streaming and database servers successfully. It turned out they had been running properly all along. The CDN's original cutover failed because it switched only half the pair: the alternate media server was still making inquiries to the impaired database server, so the results were the same. Our Flash programmer realized this and reminded the CDN techs of an esoteric requirement of the Flash Media Server that they had failed to satisfy.
Once they fixed this, we were back on the air with normal site operation using our backup servers. We put up another notice on the home page, and after 27 hours, started to breathe again.
Around the same time our CDN techs figured out the cause of the problem: the Domain Name Servers (DNS) in the Chicago data center were malfunctioning on Sunday.
DNS servers are deep Internet plumbing, maintained by primary service providers around the world. They are several levels removed from anything we normally see, interact with or control and are generally assumed to work 24x7, which is why we didn't suspect them at first.
We still don't know why or how it happened, but we immediately took steps to eliminate the "hidden dependency" We solved the security issue reverse DNS lookups are trying to protect against another way, and removed the reverse DNS routine on the database server. This effectively prevents this particular problem from recurring.
To improve the switch from normal to backup server operation, we're putting in a new "push-button" automatic script, so a partial cutover cannot happen again.
We're still trying to get an answer about when and why the Chicago DNS servers failed to perform in their usual reliable manner. Since their operation is controlled by a large company two levels above us with whom we do not have a direct relationship, we may never be able to get the full story. If we do, we'll update this post with the details.
WHAT HAVE WE LEARNED?
After 9 years of operation as a web service, we have learned many things about what it takes to make a web site fast, reliable and trouble free. Some lessons, like this occurence, we've learned the hard way. There always seems to be something you don't know about, don't even know you don't know about, or can't control. Needless to say, this is deeply frustrating to all of us at HOS, and deeply inconvenient to all of you, who just want to listen to good music, not listen to geeky reasons why the site isn't working. Our sincere thanks to all the dedicated engineers who make this stuff work and sacrificed a goodly chunk of their weekend to figure this out.
I wish we could say this is the last time it will happen, but in all honesty we can't. What I can say is that the odds of it happening are getting longer all the time.
:: Stephen Hill
I've got to say, as a software tester and test manager of 20 years' standing, this was the most lucid explanation of a service fault I've ever seen. Many thanks!
Posted by: Chris Hansen | 17 March 2010 at 03:37 AM
Thank you for taking the time to 'splain this to us.
I must tell you I was not worried when I saw your message on the page. I knew then, as I know now, that you were working on the problem, and that the problem would be solved.
While many of your team was frustrated, confused, irritated and befuddled by your situation, I was sending good vibes your way. Hope some of them reached some of you.
Continued Safe Journeys, and Peace.
Posted by: Michael Cary | 17 March 2010 at 05:14 AM
love you guys here in Milwaukee...Slainte! BUT no log in function today AT ALL!
X O
Chrisanne
Posted by: chrisanne | 17 March 2010 at 08:21 AM
To Stephen:
Thank you for explaination and taking the time to explain it.
It is truly refreshing to have that much transparency.
Thanks all for sucessuful Bug Hunt
Posted by: J Sadler | 17 March 2010 at 09:31 AM
Reply to Chrisanne: All's well here with SIGN IN and everything else. LEAVE HOS.com and empty your browser cache and delete cookies, and all will be well. Please contact [email protected] for personal assistance. I think I found your account and sent you instructions.
Posted by: Leyla Hill | [email protected] | 17 March 2010 at 11:13 AM
I've been a big fan of the Hearts of Space series from quite a ways back, and as I've become more and more familiar with the website and the online services you now offer, my appreciation has only grown. This incident is a case in point. Thanks for keeping your listeners informed. Since you took the trouble of actually detailing the whole incident in the blog, I took the trouble of reading it. Keep up the great work! We understand that stuff happens sometimes ...
Posted by: Frank Florian | 17 March 2010 at 01:52 PM
How can we be upset? We're HOS fans, the most relaxed and well adjusted people on Earth.
No worries and I enjoyed the program once it got through inner space to my mind space.
John Paul
Posted by: John Paul | 17 March 2010 at 06:26 PM
Stephen & Crew:
Thank you for the explanation. No need for transparency, you do more than enough to earn a glitch 'er two. Sorry I did not experience it first hand since I am a bit of a slacker listener, dropping in when I need my HOS fix. BTW, I am just geeky enough to love the story. Weird, I know, but Hey, you've got all kinds of listeners. Sorry.
BTW, anybody ever notice that Stephen sounds like my 50's fav, Rod Serling? No? Well, it's just me.
Till the next time,
Jim
Posted by: jim | 17 March 2010 at 06:32 PM
It was my fault. For the first time since subscribing nearly a year ago, I didn't tune into the Sunday show. My personal activity level exceeded my available time. Had I connected, it would have worked. My humble apology.
Okay, now for the seriousness. Thank you Stephen and Crew. As Chris Hansen said, it was lucid, and all of us appreciate that. Funny thing is I see things like this often. The company for whom I work has two main sites, and runs proxy servers at each site. While not the same as having two separate DNS suites, there are times when one facility's proxy is down and the other simply screams. When that happens, I switch to the other, which is hundreds of miles away and I'm again on my virtual way.
Thank you very much for making the weekly show available for the week. I needed it, and I'm sure others did too.
Best regards,
Graham
Posted by: Graham | 17 March 2010 at 08:02 PM
Thank you for the explanation!! I love reading why and you explain it so well!! One of the best shows on, IMHO.
It continues to amaze me, no matter how great the desire to be otherwise, we are inter-dependent not independent! Be it on people directly, or people through other sysytems!! This case is just one more example of our need for each other and things we don't even know!!
Kudos to your team, your commitment, and your fans!
Posted by: Jonathan Schrag | 18 March 2010 at 05:24 AM
SH:
You da man.
Get this:
All this, plus geek understanding of reverse DNS lookup, and ability to describe it clearly for non-geeks.
Bet he rides a a bicycle, too.
Thank you, SH.
--d
Posted by: David | 18 March 2010 at 08:26 AM
Last Sunday was very quite around this place while I packed for a week long trip. I got back in town Thursday and was quite happy to have my beloved HOS back on line.
Thank you taking the to time explain the intricasies of internet streaming.
Petra
Posted by: Petra Lynn Hofmann | 20 March 2010 at 07:07 AM
Hey being in the IT field i know things like this happen from time to time, but it was very good of you to take the time to let all your fans know what caused the issue. Sounds like you have a great tech team there.
Posted by: Tony Scott | 20 March 2010 at 08:29 AM
What a remarkable explanation. I think I almost understood all of it. No worries though - I knew you would be back as soon as possible, and I played CD's of music I have found through your program. I love HOS and have for many years. This may be posted somewhere and I just haven't seen it, but who does the beautiful banner graphics every week? They always fit the program so well.
Live long and prosper, space guys and gals.
Jane
Posted by: Jane | 20 March 2010 at 08:15 PM
A very compelling and concise explanation of/for an event over which you had little or no control. The 'geeks' among us appreciate it!
Thank you.
Reggie
Posted by: Reggie White | 20 March 2010 at 09:34 PM
Hello Stephen and all you guys at HOS,
Thanks for the long detailed explanation of all of your problems last Sunday! And for the hard work your techs had to do on their free day!
We live in The Netherlands now, 9 hours ahead in time, and noticed
the slow buffering of your site, but after a couple of tries we got
the show our time at 2 in the afternoon. Thank you again!!
Best regards,
Jennifer Hopkins
Posted by: Jennifer Hopkins | 21 March 2010 at 09:54 AM
I tuned in to your beautiful Celtic-themed show, The Turning Tide," at about 8:30 a.m. Pacific time -- with my clocks turned ahead-- and I was able to listen to the entire show perfectly. So the glitch in Chicago must have happened after 10:30 CDT. For those who did not have the opportunity to listen to it, please run this show again sometime.
Isn't it amazing that we can send signals from Texas to Chicago to Los Angeles to Virginia, all in splits of a second?! And that we can enjoy space music every Sunday (mostly)?! I appreciate reminders from the universe not to take things for granted. :) The incredible team of producers for HOS did a remarkable job; thank you one and all!
Gigi in Springfield, Oregon
Posted by: Gigi | 21 March 2010 at 12:46 PM
cannot get on to play my free sunday #905. Have been trying since 10:00pm PST.
HELP!
Posted by: KAT. MCCULLOCH | 21 March 2010 at 11:58 PM
All's well on this end, Kat, or you would have seen a notice on the site. It's best to write to [email protected] for personal assistance--fastest way to get answers next to looking in our Help! section (link next to SIGN IN). The info you need is there: empty cache, delete cookies, refresh our site. Click-by-click instructions if you need 'em.
Posted by: Leyla Hill | [email protected] | 22 March 2010 at 12:45 PM
Graham: we're watching you;-)
Jonathan: couldn't agree more strongly with your comment. It's true of people, and it's true in the natural world.
Others: thank you for the understanding and spacious comments!
:: SH
Posted by: Stephen Hill | 22 March 2010 at 05:35 PM