In each telecom organization the Network Operations department interacts both with internal and external clients. I was working in a team consisted of three engineers who were rotating the standby telephone every week. It was my week of standby. I was managing two Ericsson engineers who were supporting our systems on behalf of the vendor. They had arrived at the company’s premises early enough and we were discussing the tasks of the day. They had scheduled a couple of minor configuration changes, which were prerequisites for a Saturday night activity that had been planned beforehand. I gave them some time to get a cup of coffee and make themselves comfortable before getting to work. They started working at 10 am and I was monitoring my tools for alerts or critical errors that could appear. I was scanning the system logs to ensure that everything was operational. At 13 pm, a colleague from the IT department called me and asked for some alarms he has monitored, with regards to the communication between the systems we were both supporting. I said that I would make a quick search and a health check and I would call him for feedback (Figure 2).
Searching for evidence, I realized that from the 15 connectivity flows with Information Systems, one had communication error. So I concluded that my system had no fault. The software and the Operating System were clear, the interface was connected with the rest of the flows. Our system, called EMA, had 14 green and one red connection. The faulty element was the Account Finder server, which handles queries about subscriber’s profiles with regards to credits and debits on their prepaid account. It is connected with EMA, which controls the prepaid traffic for each customer.
When I called back my colleague, he said “I have contacted my vendor support engineer and I have triggered a customer service request with high priority. I do not have any evidence on what could be wrong. That’s why I raised an emergency!”
I said “Ok George, keep me updated. For all I see, EMA is flawless” and after the conversation, I returned to my routine. I did not halt the Ericsson engineers but I continued helping them to close the maintenance window on EMA as soon as possible. I was sipping my coffee and I threw a glance out of the window. “This is not a good sign for Friday”, I was thinking. “Such problems have kept me in the office until late hours”.
About three hours later, at 16 pm, my manager called me to ask for clarifications. “What is going on with the prepaid provisioning? What’s this all about? I received an email saying that EMA faces interface errors and Network Operations department is working on it. Is this true?
I said, “No, not at all”.
“Are they expecting anything from us? A troubleshoot, an action, anything?” he continued.
“No Sir, the problem is on AF, I have checked the alerts and the description is generic though…I also talked to George who informed me that he has raised an emergency call to Ericsson support. They will handle the case”. I felt my mouth dry but I moved my tongue for a second. “Their investigation is ongoing and I’m waiting for an update. In the meantime, I supervise a maintenance task along with Efthimia and Harry, about…”
“Excuse me?” he asked.
“I am in contact with Ericsson to finish the job and be ready for tomorrow night”, I replied.
“Are you kidding? Why on earth haven’t you told me that we are in the middle of a configuration change in the first place?”
“Craig, come on…Why should I?”
“Because I talked with IS director and I didn’t mention that, that’s why. Now how can we prove…?”
“Craig, wait, wait. We started in the morning, the AF errors appeared at noon. How could it…?”
I didn’t finish my words when Craig said “The director is calling me, Athan. I hang you up”.
[…]
The Network director, whose name was Manousso, called to clarify the issue. He was asking the same questions more or less. There were customer complaints at the stores regarding wrong billing to prepaid value-added services. The IT announcement had put the blame on my team and we had to find a solution. We triggered an emergency service request to Ericsson support and we ordered an investigation. At the same time, we began to roll back the configuration changes that had taken place, because we had to prove that the problem had a different root cause. We finished the rollback procedure at 8 pm. “Now we are clear”, I was thinking. But – what a surprise! – the problem remained. The alert was active. We were stuck in the middle of a grey area, where nobody could pinpoint the root cause of the problem.
I decided to take action and I called my boss to explain that any further investigation from our side would be in vain. He answered raising the tone of his voice, saying that I should have replied to the first IT email, clarifying that Network department was not responsible for the alert.
“Craig, don’t yell at me! You’re getting angry and you don’t listen to what I’m saying!” He switched his voice to a more calming tone and we talked about the problem for a few minutes.
We stayed involved as a team until one o’clock in the morning. The same happened to Efthimia and Harry who were handling the emergency on behalf of Ericsson. I was feeling that my communication with my supervisors was weak because their main concern was to show to the IT that they play nice. But worst of all, I was feeling that the management could not advocate on my behalf.
The next morning, after all this mess, the IT decided that they had to do something as well. They could not stay idle, doing nothing. Of course, the upper management knew that they were not to blame, because it was stated in their announcement, but they had to perform investigation as well. So they decided to put in service the redundant infrastructure. They switched production traffic to the stand by AF and the problem resolved. It took twenty four hours of troubleshooting and costed some hundreds of thousands of Euros to the company.