Telstra has provided a more detailed update on its outage of National Broadband Network (NBN) and ADSL services last Friday, saying it was caused by modems misbehaving after a domain name server (DNS) software failure, with the telecommunications provider now resorting to sending out new modems to the thousands of customers still affected.
The DNS failure over a week ago caused tens of thousands of modems to continually reboot, Telstra COO Kate McKenzie explained during a call with media on Friday morning.
"A software update to one of our domain name servers caused that server to go down. It had a flow-on effect to our customers' modems, and they couldn't undertake the regular check they normally do in that environment," McKenzie said.
"We actually fixed the software bug overnight on Thursday night, and on Friday morning, we had thought that the problem was resolved ... we didn't anticipate the flow-on effect; a number of the modems sitting on the end of the network didn't behave in the way that they were supposed to."
Since not all customers were able to get back online after the telco had repaired the issue, it was forced to address individual customer complaints.
"We have had to go through pretty much modem by modem and domain by domain and down to customer by customer to figure out what the issues were and get them back up and running," McKenzie said.
"We're very confident that we've got the vast majority of that resolved, although there are probably still a few residual issues for a very, very small number of customers, and we are working through customer by customer."
Specifically, overnight on Thursday around 10 percent of customers, or 370,000, were affected; by Friday morning, only 1 percent of customers, or 33,000, were impacted; and the "residual" amount still affected number around 0.5 percent or 15,000, although she noted that some of those customers could just be having "completely unrelated" issues.
Those who are still affected are being sent free modems by Telstra.
"Where we've gone through the process of ordering a factory reset of their modems and the modems are still not working, in the last couple of days we have started to send out free modems to any customer who is not back up and running," McKenzie said.
Mike Wright, group managing director of Networks, further explained that the modems' unexpected behaviour was to blame for the outage's continuing consequences, with Telstra undertaking hundreds of updates every night of the week without issues.
"This particular change really just caused a short outage in a DNS and should have had no consequential impact, but what we discovered is we've got a heartbeat signal in our gateways -- designed in fact to improve service, so we know the gateways are online -- and during the duration of that short outage of that DNS, what modems did when they didn't see the heartbeat [is] they went through a reset," Wright said.
"That uncovered an unknown behaviour in the modems, which we spent the night recovering from with the network, but the next day, what we found was a few of the modems had a residual problem in software that caused them to continually reboot, and that's really been what we've been managing and recovering from for the most of this time.
"It's really the behaviour of these modems that didn't behave in a way expected after that loss of heartbeat."
Wright argued that the NBN, as well as the increased usage of high-bandwidth services, is making networks more complex environments, with outages to be expected as it is rolled out and switched on.
"What you're starting to see is the complexity emerging in these environments as we go particularly to NBN; these gateways are going from a simple box in the corner to something that's increasingly carrying complex voice calls, HD video calls, HD video streaming, so we're building a more and more sophisticated environment, and indeed that heartbeat was designed to help us manage that more," he said.
In order to prevent the same issue from reoccurring, Wright said Telstra is working with its software partners and vendors to test a fix and thereafter update the software to protect it from that particular bug.
"We are having individual conversations with individual customers who have been impacted, and we will talk to them about an appropriate apology," the COO said.
"Some customers received a AU$25 credit off their bill where we thought that was a commensurate response to their level of inconvenience."
Previously, Telstra had said simply that its NBN and ADSL outage was caused by a "complex" network management device fault.
"The issue we identified is extremely complex, but in simple terms, there was a fault with the device that manages the interaction between our network and all of the different types of customer modems," a spokesperson said.
The incident on Friday was prevalent throughout New South Wales, Queensland, Western Australia, the Northern Territory, and South Australia, with the telecommunications provider saying that a "significant restoration" of services had occurred by 11am AEST.
By Tuesday, the effects from the outage were continuing to be experienced, with some customers complaining over social media that they still cannot connect. Telstra said it was directly contacting customers still experiencing the impacts of the outage, with users being advised to reset their modem.
Wright also provided an update on Sunday's mobile data services outage, saying it was due to a hardware failure in one of the telco's gateways in NSW.
"Every state has two gateways that lets our customers contact the internet," Wright explained.
"It was a hardware failure in a card, and while it switched over, what the devices often have to do is be put back into either flight mode or turned off and on so they reconnect to the other gateways, so Sunday morning that was basically a fail over a bit of hardware."
Sunday's outage was an intermittent issue affecting 3G services and some 4G services mainly in NSW, although several customers across Melbourne, Brisbane, and Perth were also affected. All services were restored within two hours.
Telstra has had a rough start to 2016, with customers subjected to three other outages over a period of six weeks: The first on February 22, which affected prepaid and post-paid mobile services and was caused by "embarrassing human error"; the second on March 17, which involved an hours-long national mobile data and voice outage; and the third on March 22, which was a smaller voice outage.
Those first three outages led Telstra to commit to spending an additional AU$50 million on improving the monitoring and recovery times of its network.