Laugh or cry?:
I had a “routine” production implementation yesterday, the only notable feature being that the deployment itself would run for an estimated 4-5 hours, making it a full day on-site instead of a half-day. If only…
First off, there’s this other financial system involved. Their backups take 10-12 hours(!) and their restores an anticipated 2-3 days(!!) This was known going into the release, flagged as a risk, but it’s been this way for over a year and nobody’s cared to improve on it. Well yesterday their backup ran slowly. Yes, x2 or x3 slower than usual.
This threw management into a spin, naturally. Although our implementation’s risk was quite low, they couldn’t proceed without a solid backup of this other system, so we went on hold while they decided what to do. And decided. And decided. We were on hold for 5+ hours before the decision was made *to proceed* regardless. Our deployment’s runtime was blown out by 3+ hours because of competition from this ongoing (ultimately cancelled) backup – I hate SANs – making us 8++ hours late, so it made for a very long day.
The business users weren’t happy either, it was 9:30pm before the BVT could start, except they still couldn’t start because of a connection problem. Naturally the techs were offshore, and the first choice was on leave, and … it took a while. Some BVT took place last night, wrapping up well after midnight, the rest this morning.
Why this morning? Most of the original BVT resources are based in Manila. What nobody realised until yesterday was that (i) they’re not able to work past 7pm local time (ii) not able to work on Sundays. That caused a problem, the gap was filled with local resources eventually, …
Points arising:
1) This production financial system, that “normally” takes 12 hours to backup and would “normally” take 2-3 days to restore… WTF? Even if that’s “acceptable”, it should hardly be a surprise that the day would come when it ran even slower. All kinds of little itty-bitty things are chased up with the highest priority, tightening up password policy and the like, because it’s best practice (and because it’s a pending audit item), but something this big is off-radar to the auditors and allowed to persist until it inevitably causes a problem.
2) 5+ hours, really? How many managers in how many layers were involved in that decision? There were two choices: go ahead (without significant delay), or call it off and reschedule. Procrastination led to an inferior choice: have a huge delay but go ahead anyway. For all the talk of Inspired Leadership, there was none in evidence yesterday, it was management by consensus/committee.
3) Why was the backup x2 or x3 slower than usual? If they couldn’t get an answer while it was running, it’s going to be so much harder now. Part of the problem was that all the tech resources were finger pointing (not my problem, ask them), there was in-fighting over how serious it and repeated downgrading / upgrading of the severity. Process-oriented bullshit, with nobody focusing on results.
4) The offshoring of business users puts hard constraints on their availability. Leaving aside that nobody at the operational level knew of these constraints until yesterday, it begs the question whether this is acceptable. Suppose there’s a disaster, we need full-on business participation in the recovery … but oh dear, it’s after 7pm and it’s a public holiday tomorrow and our business users don’t have a building they can work from…