Incident response: Sorting the big problems from the small ones

PagerDuty aims to help companies get the help they need when systems go down.
Written by Colin Barker, Contributor

PagerDuty CEO Tejada: The developer has become "the architect and the designer of the experience".

Image: Roger Jennings

Nobody likes to think of the code they've written, or the systems they've built, going wrong. But they do, all the time. The trick is to work out which are the big problems and which are the small ones.

PagerDuty wants to help. It's an enterprise incident resolution service that connects with DevOps monitoring stacks, with the aim of streamlining incident management when things break. It allows companies to visualize the health of applications and infrastructure to help coordinate the response with the aim of reducing the MTTR -- the Mean Time To Repair -- for customers.

The company's CEO Jennifer Tejada said people used to think about incident management as that moment when everything goes wrong and the lights go off.

"It's not that way anymore. There's stuff going wrong all the time and there are signals that help you build a stronger and more robust service," she said.

The company was founded seven years ago by three graduates of Waterloo University in Toronto. Waterloo has an advanced work experience operation and through that all three gained experience working at Amazon, which is where the idea for the company took shape.

"What happens when you are an engineer or developer at Amazon and you first walk in is you get an initiation and part of that initiation is that they hand you a pager and say, 'You're on pager duty'," she said. In other words, a new recruit becomes the person that's going to get called in overnight to deal with systems issues -- and through that gain a lot of experience.

One of the issues they discovered, said Tejada, is that when there is a problem it's often hard to even know who to call for help to fix the problem. And this was for areas of the business that had real-time requirements and so network availability was critical.

Through this experience, the company's founders gained insight into a problem that needed solving. They went to work on it and got some funding to start building an app that could automate the process of dealing with these sorts of urgent repair calls.

Their timing was good because they were seeing a shift in IT operations, she says. "In traditional IT ops you had people sitting in, night after night to watch flickering red lights and trying to understand where they had a problem, escalate it and point it at somebody who could fix it."

The shift was to move to a world where companies wanted people to have 'full service ownership' so that the developer built the service or the code and then became 100 percent responsible for it when it went into production.

That would mean and there would be no incessant arguments about responsibility and instead a world where only one call needed to be made and one person would take responsibility.

In the PagerDuty world, the developer becomes 100 percent responsibility for a service that could then be assigned to a team that has 100 percent responsibility for keeping that service up and running.

"What we have done in our business is recognise that these days there are a lot of things that are happening very rapidly and that is bringing PagerDuty to the forefront of the operations that sit behind the digital business," says Tejada.

"One of those is that the number of releases that are happening in any one day is increasing. The number of days when events are not necessarily going wrong, but are not working as planned, where dependencies between services cause hiccoughs, are happening frequently."

The other thing that is changing, she says, is that the developer is increasingly at the forefront of the customer experience. "These days you have to really think about the way your digital business is built", if only because as she puts it, "revenue reliance is shifting to digital assets".

That means that the developer is no longer the enabler of the backbone but also "the architect and the designer of the experience".

She believes that IT is not used to dealing with people in that way. "Historically, the tools for the developer were kind of crappy -- lots of green screens, pretty basic scripts, a very poor user experience and not a lot of collaboration capability," she says. "That has changed dramatically."

PagerDuty is all about applying machine learning to the processes involved and not just automating the process of consolidating the alert and notifying the alert but also automating that process of working out which alerts matter the most.

"It's a matter of letting them see the problems and helping to tell them whether they have a big problem or a small problem and showing them the problem at the core," she says.

"We launched a product last year called Operations Command Console that is really starting to enable customers to understand the signals that come through a technology problem that can have a broader impact."

One of the features of that is, "helping people to understand the linkage, to understand the issues that come through the systems and developers and the business outcomes".

Read more

Editorial standards