Netflix Boosts Resiliency Against Outages with New Cloud Approach

By Tara Seals December 05, 2013

Netflix continues to post subscriber gains and original content milestones, showing no signs of slowing down in its growth. The online streaming giant is maturing its infrastructure approach accordingly, with the announcement that it has implemented traffic balancing across its Amazon Web Services cloud environment, to add resiliency to its distribution architecture.

 In a blog post, Netflix engineers Ruslan Meshenberg, Naresh Gopalani and Luke Kosewski explained that the company is balancing traffic now simultaneously across AWS’s US East-1 (in Virginia) and US West-2 (in Oregon) and balancing user traffic across them.

Last Christmas Eve, Netflix streaming was impacted by problems in the AWS Elastic Load Balancer (ELB) service that routes network traffic to the Netflix services supporting streaming; it caused a partial Netflix streaming outage that started at around 12:30 p.m. Pacific Time on December 24 and grew in scope later that afternoon. The outage primarily affected playback on TV connected devices in the Americas.

Netflix uses hundreds of ELBs. Each one supports a distinct service or a different version of a service and provides a network address that a Web browser or streaming device calls. Netflix streaming has been implemented on over a thousand different streaming devices over the last few years, and groups of similar devices tend to depend on specific ELBs. Requests from devices are passed by the ELB to the individual servers that run the many parts of the Netflix application. Out of hundreds of ELBs in use by Netflix, a handful in the Christmas Eve incident failed, losing their ability to pass requests to the servers behind them.

In June, the company announced the Isthmus project, a fail-over approach to achieve resiliency against region-wide ELB outage. Under normal operation, traffic would flow through both regions.  If one of the regions would experience ELB issues, Netflix would route all the traffic via DNS through another region.

Now, the company has embarked on the next step, which is a full multi-regional Active-Active solution, where streams run in multiple regions simultaneously. In a normal state of operation, users would be geo-DNS routed to the closest AWS region, with a rough split of 50/50 percent.  In the event of any significant region-wide outage, the tools are now there to override geo-DNS and direct all of users traffic to a healthy region.

“At Netflix, our internal availability goals are 99.99 percent - which does not leave much time for our services to be down,” explained the engineers. “So in addition to deploying our services across multiple instances and availability zones, we decided to deploy them across multiple AWS regions as well..”

Complete regional infrastructure outage is extremely unlikely, but “our pace of change sometimes breaks critical services in a region, and we wanted to make Netflix resilient to any of the underlying dependencies,” the authors added. “In doing so, we’re leveraging the principles of isolation and redundancy: a failure of any kind in one region should not affect services running in another, a networking partitioning event should not affect quality of service in either region.”



TechZone360 Contributor

SHARE THIS ARTICLE
Related Articles

Verizon Needs Tough Love on Copper Policies

By: Doug Mohney    1/29/2015

New regulation on broadband and telecommunications providers is at top of mind here at ITEXPO. Jeff Pulver, founder and chief executive of pulver.com …

Read More

OTT Video Set to Top $6 Billion in 2019

By: Tara Seals    1/29/2015

When it comes to over-the-top (OTT) video, it has grown not only in developed regions but also in emerging markets, both as an alternative and complem…

Read More

Digium CEO: Businesses at Every Level Can Get Started with UCaaS

By: Allison Boccamazzo    1/29/2015

Digium CEO Danny Windham made one thing clear during his keynote presentation at ITEXPO 2015: Businesses of all kinds, at every developmental level, c…

Read More

When Gaming Isn't a Game: 3 Best Practices to Protect Your Hosting Service Against DDoS Attacks

By: Joe Eskew    1/28/2015

The unprecedented number of security breaches, hacks and DDoS attacks on gaming communities, software manufacturers and even Hollywood studios grew to…

Read More

No Hackers Took Down Facebook; Hour's Outage Mostly Internal

By: Steve Anderson    1/28/2015

Facebook released a statement not long after the outage had hit, revealing that the cause of the shutdown was not "...the result of a third-party atta…

Read More