How Telstra recovered when BlackBerry went pear-shaped

You may think your network and services are running fine, but users are the ultimate judges of service quality.

You may think your network and services are running fine, but users are the ultimate judges of service quality, as David Braue explains.

Many times, service providers don't know anything has gone wrong until they're hit by a flood of user complaints. Such was the case recently for Telstra, which had the good fortune to be trialling a new network monitoring technology when its Sydney BlackBerry wireless e-mail services came crashing down one Monday morning last September.

The outage, which came just as BlackBerry services were finally gaining the traction in Australia that they have elsewhere, was spotted just after midnight early on the Monday morning by Virtuo, a service monitoring framework from Washington-based firm Vallent.

Lowell Anderson, Vallent marketing VP

A specialist firm providing network availability monitoring capabilities to nearly 200 carriers globally -- including Optus and 3 in Australia -- Vallent had worked with Telstra to set up a trial of the Virtuo platform and, specifically, the company's NetworkAssure performance management and ServiceAssure quality management tools.

The Java-based Virtuo application, which is based on an Oracle database and runs under Sun Microsystems Solaris or HP-UX operating systems, monitors network performance and availability by aggregating and analysing data from sources right across the environment. This includes network performance metrics from network switches and wireless base stations; performance logs from application servers; primary data sources such as call detail records; and other relevant data sources of all types.

By correlating these many types of data and watching for significant changes in key performance indicators such as latency, the Vallent technology can pick up aberrations in overall performance even when the service is not compromised acutely enough to trigger built-in alarms.

"Just because the network is performing does not mean the quality is anywhere close to acceptable," says Lowell Anderson, Vallent's vice president of strategic marketing. "Visibility is one of the issues for operators, so that operators can understand why customers are experiencing degradation, and prioritise their operations to address that. Integration and data mediation provide the ability to drill down through the model and determine which elements of the network are causing the problem."

So it was, then, that the Virtuo technology noticed early one morning that something was seriously wrong with Telstra's BlackBerry service.

Monitoring of back-end systems and network devices confirmed that the infrastructure was running smoothly, but a range of red flags on the monitoring system reflected the fact that all of the messages in the outgoing message queue had been delayed. Further analysis confirmed what many Sydney-area customers were soon to find out: the service wasn't available at all.

The culprit, as Telstra soon found, was simply that one of the carrier's back-end BlackBerry server licenses had expired at midnight -- bringing the entire BlackBerry service offline.

The culprit, as Telstra soon found, was simply that one of the carrier's back-end BlackBerry server licences had expired at midnight -- bringing the entire BlackBerry service offline. Even as Sydney customers rang Telstra after finding their BlackBerry service wasn't responsive, the company was already ordering a new licence, and the system was back up and running a bit later in the day.

Monitoring the user's experience of a service has always been a problem, and is likely to become even more so as 3G, videoconferencing, mobile TV, and other wireless voice and data services increase both network complexity and user expectations of those networks.

Rapid identification of performance issues will therefore become critical for mobile carriers, who so far form the core market for Vallent's technology. A growing record of runs on the board in the telco space, however, could also drive the company in new directions with the use of its BusinessAssure tool.

BusinessAssure, also a component of Virtuo, ties the results of NetworkAssure and ServiceAssure analysis to real-world business processes, which are defined in terms of the systems upon which they rely. This correlation provides invaluable guidance as to which user communities are likely to be affected by an outage -- putting the IT team on the front foot rather than forcing it to wait until something goes wrong.

"We would like to get this [feedback] loop to the point where it's near real time," says Anderson. "That way, when customers are experiencing service degradation, operators can understand which customers are impacted and can prioritise operations based on levels of impact. This loop exists today, but many manual processes cause time delays where data has to be collected. Automating it allows customers to spot trends and manage key quality indicators long before they become a problem."

Newsletters

You have been successfully signed up. To sign up for more newsletters or to manage your account, visit the Newsletter Subscription Center.
See All
See All