What is site reliability engineering and why do you need it?

3 performance engineering use cases

Dec. 6, 2019 | By Brian Tagami

Programming code abstract of software developer and Computer script

Whether it’s an award show, major retail sale or a streaming event—every year or iteration gets bigger and better in both scale of production and entertainment. We’re talking about increased viewership streaming from multiple kinds of devices. This results in huge demands for bandwidth and connectivity, as well as increased demand for secured transactions from e-commerce traffic by connected consumers responding to real-time ads and promotions. As every industry deepens its relationships and access within their most loyal customers, their evolution to provide better customer experiences, technology and operations needs to be supported with a site performance strategy that meets and exceeds consumer expectations. Traditionally, a DevOps solution is first thought of, but equally important is site reliability engineering, which plays a key role in managing reliability and application scaling.

What is site reliability engineering?

Site reliability engineering (SRE) bridges the gap between development and operations, combining software and systems engineering to build large-scale and highly protected systems. It uses automation and orchestration capabilities to scale security and performance, ensuring sites are reliable and efficient.

Site reliability engineering is critical to many cloud, DevOps and automation initiatives, and includes tactics like:

Test coverage
Load balance testing
CI/CD best practices
Modernizing legacy systems
Executing integrations
Platform configuration management

Site reliability engineering use cases by industry

Media and entertainment
Consider this: In 2019, the Oscars had an audience of 29.6 million viewers, Netflix recorded more than 158 million subscribers, the Super Bowl had 98.2 million viewers and the Women’s World Cup reached a record 1.12 billion viewers. Traditional forms of media and entertainment are forced to transform as consumers can reach, stream, interact and recommend virtually anything on any platform—and any device. How do you prepare for viral events or simultaneous surges? What happens when a stream cuts out or the broadcast glitches? Best case is lost opportunity, but worse case is irrecoverable brand damage—a Twitter storm of customer complaints can even impact brand perception from nonactive users. Site reliability engineering and site resilience testing will help ensure that doesn’t happen. By using a collaborative effort around architecture integrations and stress testing, site reliability engineering monitors site performance to ensure they’re operating smoothly and protected, without interruptions that could have significant brand and financial impacts.

Retail
The holiday season has become synonymous with shopping and hitting major sales on Black Friday. But instead of braving the crowds at brick-and-mortar stores, more customers are preferring to shop online. In fact, Black Friday 2019 broke records with a $7.4 billion in online sales. And it doesn’t stop there—many retailers begin offering discounts as early as the month before through Cyber Monday, meaning an explosive flood of website visitors at the end of each year. Incorporating site reliability engineering in systems architecture checks and technical operations support processes, much like a holiday health check, prevents site crashes or outages and ensures customers receive the best experience. It also bolsters risk and management frameworks for an added layer of security for customer identity and private information.

Consumer products and interactive media
The Walt Disney Company is undoubtedly the biggest player in terms of diversified entertainment, and their parks and resorts segment is unmatched. Aside from their resorts themselves, Disney’s consumer products operations—i.e., a suite of technologies that include automation, personalization, IoT, wearables, mobile applications and mobile ordering systems—has evolved to provide the ultimate customer experience that transcends physical to digital experience. These innovations, combined with the heavy traffic of daily users, require reliability engineering to identify issues and develop roadmaps to triage risks, as well as manage tooling and automated features.

An engineering approach to operations

As software systems exponentially grow, success will require more than just adopting DevOps or incorporating a security lens. Especially in cloud-native environments, a site reliability engineering approach will be needed and practiced more than ever to ensure systems and sites are scalable and reliable. Think: software engineering in one hand and development operations in the other for a powerful dynamic that enables fast issue resolution.

Brian Tagami is the managing director for TEKsystems’ communications, entertainment and media division. He has more than 15 years of industry experience managing enterprise customer relationships and delivering IT, creative, wireless engineering and field services solutions worldwide. Based in Seattle, he is continuously exploring and building new partnerships with industry leaders and analysts.

What is site reliability engineering and why do you need it?

3 performance engineering use cases

What is site reliability engineering?

Site reliability engineering use cases by industry

An engineering approach to operations

Resources

Policies

What is site reliability engineering and why do you need it?

3 performance engineering use cases

What is site reliability engineering?

Site reliability engineering use cases by industry

An engineering approach to operations

Thinking Forward

Resources

Policies