Would you like to start a conversation with other industry leaders to brainstorm a challenge or to just know more on a particular topic?
Customer experience is a criticalmission statement for any successful business. Very often customer experience is influenced by the usability and accessibility of corporate websites, mobile apps, and other technology interfaces, which are expected to be always available 24/7. In addition, all companies are becoming quasi-technology companies, driving transformative changes. Companies are forced to build their entire business strategy around cloud capabilities, making this a significant operational challenge for them. The error margins are very low - almost negligible. Any degradation of performance and customer experience results in a loss of revenue, margins opportunities, and reputation.
We are living in an environment where “agility”, “5 9s”, “shift left”, and “self-help & self-heal” are becoming the buzz words of every engineer and C’s in the IT corporate world.
In this era, where both man and machine are finding a way to earmark a unique space for themselves, SRE (Site Reliability Engineering) is fast becoming the new mantra within the IT corridors. Several new and progressive companies have begun to talk and invest heavily in creating this new function.
The agile movement promotes the importance of collaborative efforts between cross-functional teams and this gave birth to DevOps. DevOps is about drilling down on your own organization’s specific problems and challenges. It is also about achieving speed, efficiency, and quality. In essence, it is a culture, a movement, a philosophy of values, principles, methods, and practices to achieve the desired outcome for the organization. This velocity has created some instability where developers were moving faster than ever and creating a challenge for the operations teams. The IT operations teams were not equipped to deal with such speed, creating significant bottlenecks and backlogs for them. Not being able to cope with the pace lead to uncontrolled instability in production where systems became unreliable. Thus, a need for SREs got created.
The SRE discipline allows teams to design and operate scalable and resilient systems using a software engineering approach. It replaces the traditional approach to operations with something more agile. It applies engineering expertise to operations and infrastructure problems, which increases reliability, quicker deployments, and a well-defined system environment. Gartner defines SRE as a “collection of systems and software engineering principles used to build and operate resilient distributed systems at scale”, or to put it in simple words, using automation to create ‘self-healing’ systems. SRE acts as a complement to DevOps practices through managing the risks of rapid change by promoting resilience, accountability and innovation.
The SRE role was developed by a Google employee, Ben Treynor Sloss, VP of Engineering as a way of making sure the software, sites, and applications that Google was deploying were running nearly all the time (even Google breaks 0.01% of the time). He developed a new way of constructing operations teams, so that operations and development would stop squabbling over who broke what and why. He puts it very simply by saying “that’s what happens when you ask a software engineer to design an operations function.” The SRE role is a combination of traditional system administrator tasks and coding.
In a typical development lifecycle, the development teams are keen to launch new features all the time and see them adopted by users. On the other hand, the operations teams want to make sure the service does not break while they are holding the fort. Since most outages are caused by change - a new configuration, a new feature launch - the two teams’ goals are fundamentally in tension. Expert site reliability engineers can craft solutions that create a balance between development and operations teams.
Incorporating aspects of software engineering into the operations and infrastructure functions has numerous benefits, the most notable being more constant uptime and service resiliency. Other benefits SRE offers include:
Key misunderstandings, and thereby the value from SRE, are mystified with confounding service level objectives which focusses more on early failure detection, cause & analysis, and firefighting in production than improve systems and tools.
There are a few things that need to be understood before an SRE can be setup in an organization:
SRE clearly needs commitment and follow-through to succeed. While the engineering teams are launching new features and products, SRE should make sure that the service doesn’t break, and these two goals or objective should work in synergy.
The SRE manager is in charge of managing a team of people dedicated to proactively building reliability into the product. Since reliability in highly complex, integrated systems typically crosses between multiple programming languages, third-party services and integrations – as well as software and hardware – an SRE team needs to be skilled in both IT operations and software development.
SRE managers also need a breadth of knowledge and an ability to pull different disciplines together for one common goal: proactively building resilience into the IT infrastructure and applications.
Amongst others, a few basic responsibilities of the SRE manager are enumerated below.
Along with the normal administrative work required from a people manager, SRE managers need to know how different disciplines can come together on an SRE team. It is important that the SRE managers are also well-connected with the broader IT, engineering and business teams – staying up-to-date on feature development and assessing how it could affect the system’s overall reliability.
Service-level objectives, service-level agreements and service-level indicators are essential to SRE teams’ activities. The SRE manager should define the availability SLO (internal metrics) of the system, provide an SLA to the business and engineering teams to determine the extent of availability they can promise to the customers. The teams can then start to track SLIs to evaluate whether the system is meeting the required percentage of availability.
SRE managers are also in charge of project planning and task prioritization. It is important that SRE managers sit in on quarterly planning with the larger teams. This will enable the SRE team to build features and functions that proactively monitor the health of new features, communicate observations to the rest of the team and add reliability to the overall architecture.
More technically, the SRE manager will be tasked with improving the overall observability of the team’s applications and infrastructure. In Google’s SRE eBook, they laid out four golden signals of SRE monitoring. These constitute latency, traffic, error rate and saturation. While these signals are only the start of building a highly observable system, implementing the four golden signals is a great start for any SRE manager to be able to identify areas for improvement, prioritize future work and learn from the way their system behaves.
SRE managers should proactively run tests through their applications and infrastructure and take advantage of “game days” to practice the human element of incident response. By doing so, SRE teams would be increasing system reliability at every turn.
There are several great examples of companies which have successfully implemented SREs within their organizations. Some of the shining ones are:
It is important for an enterprise to operate reliably. An SRE-mindset should exist throughout the enterprise – not only in a dedicated SRE team. In fact, many organizations are actively integrating SRE roles into all of operational/ IT roles. This only implies the addition of an SRE manager’s passion for technical expertise, collaboration and organizational transparency. It will further enable the SRE culture to be an integral part of the organizational ethos. As more organizations adopt the SRE process, it becomes ingrained in the organizational culture – making reliability a core principle for all business operations.
ABOUT THE AUTHOR
Sugandhi is the Vice President of Technology Operations at an MNC. She is a technology leader with 20+ years of global experience managing IT shared services, having led multi-function application and infrastructure shared service operations. She has worked for multiple organizations like Ramco system, Modi Xerox, American Express, AXA and Auotdesk and has led several transformation programs setting up COE’s for IT shared services.