Five ways to step up software reliability

Time for site reliability engineering: 'It is one thing to introduce new tools and agile and lean techniques, but if the culture of the organization is ineffective, the efforts will be futile.'
Written by Joe McKendrick, Contributing Writer

In an era when DevOps has become a necessity, and no one can afford to have things go down, or even slow down, the practice of site reliability engineering (SRE) has become a must-have. SREs, who connect operations and development, are in hot demand.  

Photo: Joe McKendrick

There is a major difference between companies with high-functioning SRE organizations and those who have yet to grasp the practice, a recent study published by Constellation Research finds. "Laggards are one major incident away from a disaster," says Andy Thurai, an analyst with Constellation and author of the report. "Having a mature DevOps organization is just not enough to win in a digital economy. A mature SRE organization that takes a software engineering approach to IT operations is necessary to provide reliability and resilience to the code velocity that comes out of mature DevOps organizations." 

Culture and mindset are everything. "The mentality of IT as a cost center, or the thought that your systems are invincible, needs to change," says Thurai. "The whole idea of SRE is to make software reliable and to be prepared for unplanned downtime. It is one thing to introduce new tools and agile and lean techniques, but if the culture of the organization is ineffective, the efforts will be futile."

To develop a high-functioning SRE practice, Thurai offers the following recommendations:

Open up the organization: "Organizations need to foster one-team collaboration, the elimination of silos, a safe environment where people are free to raise concerns and issues, a continuous-improvement approach, autonomy for teams, and an empathetic approach to team negotiation," Thurai urges.  

Bring in artificial intelligence and machine learning: "Using AI and ML reduces a lot of noise and improves the noise-to-signal ratio. Avoiding alert fatigue helps reduce toil and burnout by enabling SRE professionals to chase only the major incidents and spend the rest of their time productively in coding and automation efforts." 

Invest in the right tools: AIOps, observability, Incident management, and IT automation tools can play a critical role in boosting an SRE effort. "When it comes to crisis and incident management in the cloud/digital era, hope is not a strategy," says Thurai. Investing in the right tools is "key in enabling digitally efficient organizations to survive and thrive."

Automate the infrastructure. "Automating the infrastructure is a must to reduce or eliminate toil with SREs. In addition to scaling up/down based on demand, Kubernetes orchestration, and cluster management, organizations can also use automation during an incident to automate simpler fixes without the need to involve an engineer."

Hire and train the right personnel: "The initial mix of personnel should be geared toward incident identification, escalation, and manual fixes," Thurai advises. As things progress, "the toil should eventually decrease and the SRE team members should be able to concentrate on automating or doing other productive work rather than escalating and chasing incident tickets manually."

Editorial standards