Follow the process…

Application has problems. Batch window (yes, it’s old school) projected to breach SLA. Local SME (me) sounds the klaxons, points the finger at the hardware. Yes, the application is crap, but everything was fine until we moved data centres… but the data centre guys just point at how crap the application is.

Management takes an interest. Understandably, they hire an acronym company to get a second opinion. In preliminary discussions all stakeholders agree, while they’ll look at other stuff, the focus is the hardware and figuring out the i/o bottleneck is the criterion for success or failure.

The acronym comes back with a mammoth report. Management is pleased. SME is not. There are two throwaway lines about the hardware and the i/o, and the rest is all about how crap the application is, plus some boilerplate stuff cut and pasted from a manual or the web. On close examination the report is mainly vague and non-specific, but when it does drill-down it’s not too hot on detail.

Management runs off with the report. Next thing, another external company is charged with raising a business case so we can get investment to implement the “recommendations” in the report.

I think I’m getting a Cassandra complex.

Posted in Uncategorized

There’s a reason why they don’t call it redundancy any more

People actually being fired for under performing is actually quite rare in the corporate world. Mostly, in my experience, people are retrenched for reasons having nothing to do with their individual performance or circumstances – in mass serial waves, year after year – either as standalone budget issues, or as part of a corporate plan to offshore its resources.
(In these retrenchment waves, it is noticeable how few if any of managerial rank are retrenched, while the ranks are gutted.)
Any national laws meant to protect people from being retrenched unnecessarily/unfairly are treated with contempt by any company big enough to have lawyers on payroll, and retrenched individuals (who are supposedly redundant i.e. performing a role that is no longer necessary) are almost always called upon in their final days/weeks/hours to handover or “train” their (cheaper, younger, offshore) replacements.

At the end of the day it’s metric driven.
The executives and managers are rewarded based on what’s measured. Employee costs are measured, to the last cent. Employee productivity/outputs are not adequately measured. For a while, this makes the exercise look successful on paper: costs are down, but no one really has a clue as to what it’s done to productivity (which is fortunate, as productivity has dropped by a bigger ratio than the costs).
The bubble bursts eventually, but it lasts long enough for all involved to be handsomely rewarded before moving on to another company.

Posted in Uncategorized

Why bureaucracies fail

You have this bureaucracy, doesn’t matter but let’s say it’s a bank.
It’s making billions of dollars in profits, year in, year out.

Yet every year there are cuts. Cuts everywhere. “More, with less” Retrenchments. Outsourcing, offshoring.

Why? What drives this? The business is not in trouble. But sooner or later those cuts will get it in trouble. Key-man risks emerge. Knowledge is lost. Systems fail. Processes are forgotten.

Take offshoring. It doesn’t save money.
Trade one experienced onshore resource for one experienced offshore resource, and you save money.
Trade one experienced onshore resource for one inexperienced offshore resource. You may save money, but you get less (or even nothing) done.
Scale up, and trade a group of experienced onshore resources for a group of largely inexperienced offshore resources. You save money, but less done. So get some more (offshore) resources.
My perception is that by the time you’re paying ~90% of what you used to pay, you’re achieving ~20% of what you used to. One in four of the resources are okay, and value for money. The rest are “ballast”. And they all cost the same.

But I digress.

What happens is this. Your CEO says to his minions, “cut your budgets by 10%”. This will earn him brownie points with the board. Perhaps his bonus depends on it. He makes sure his minions’s bonuses depend on it. The CEO’s minions pass the pain down the line. Line managers have to cut their budget by 10%, no matter what, but still do the same job as before.

Of course they can’t. Not after the last few cuts eliminated any fat that may have been in the system. So what they do is produce less. A 10% budget cut is virtually guaranteed to produce >10% drop in productivity (given there’s no fat / waste left to cut).

That doesn’t matter, the key things is that (1) the budget cut is easily measured, and there’s no way round it (2) the output/productivity is often less easily measured, and there are many ways to juke the stats so as to make the output/productivity seem not as bad as it really is.

So at all levels from the executive down to the line management, everybody plays the same game:-
1) Cut the expenditure, or lose your job/promotion/bonus prospects.
2) Juke the stats to make it look (in the short term) like cutting the expenditure was no big deal.

Cut by cut, productivity falls – far faster than the budget has been cut, and faster with every subsequent cut.
This may indirectly put pressure on the budget next time round, leading to more cuts in a never ending vicious spiral.

Posted in Uncategorized

SNAFU

Just a routine troubleshooting incident. A policy “disappeared” before a user’s eyes.

It didn’t really disappear, but the status was screwed up.

The user clicked a button, which was supposed to perform a two part operation: invalidate the old record, and add a new valid record. This is supposed to be in a transaction.

What actually happened was, the transaction handler is somehow screwed and the user clicked the button twice in succession without realising it. The first operation succeeded, then repeated: the recently added new record was now invalidated, but the addition of another new valid record failed. Not being in a transaction, and error handling not helping, we ended up with no valid records.

This is hardly unique to this patch of code. Most programmers wouldn’t think to program defensively against a double click. Nobody has the time to worry about things like that anyway, all the coding is seat of the pants. And the transaction handler has been broken for >5 years, but remains unfixed despite being a known source of problems.

In the good old days,
* management would have been seen the transaction handler problem as a MAJOR issue, and fixed it
* lead developers would know about guarding against double-clicks (etc) and set standards accordingly
* mid-rank developers would be peer reviewing each others’ work looking for this kind of thing
* junior developers would be spoon fed baby work until they’d proven they weren’t menaces

But now,
* management take no interest in the application beyond high-level resourcing and schedule issues
* there are no lead developers, or at least none doing that kind of work
* mid-rank developers do their own thing, and no one mentors them or checks their work
* junior developers are just thrown in the deep end, without supervision

It’s not just this team, this application; if anything, it’s above average for the neighbourhood.

What happened in the industry, that this is normal, and only a few dinosaurs appreciate that this is a problem?

Posted in Uncategorized

Do The Math

I recently overheard how much offshoring really costs.

We have 30 testers, 3 onshore (quite good ones), 27 offshore (not sure if all of them actually exist, and it seems not more than a handful are actually productive).

We pay a blended rate to a 3rd party outsourcer which is equivalent to an amount per tester which happens to be a middling testers’ salary in Sydney, according to industry surveys.

So the saving amounts to eliminating superannuation, insurance and other overheads. 15%? Still, substantial enough to be worth it.

However, offshore, you’ll be lucky if 1 out of 3 are good hires. But you’ll never know about the ones who stay offshore; you don’t know who’s doing what exactly, and you see a blended output that averages out amongst good hires, marginal hires, and waste-of-space hires in roughly equal proportions.

Seems there’s a good argument to be made for offshoring when done properly. Police the hires, make sure you only keep the good hires, and cycle through the rest as fast as possible until you end up with good hires only – and enjoy your 15%-20% saving over an equivalent local hire.

Problem is, the price will go up if the outsourcer cottons on that you only want the better hires. It’s only cheap because your blended rate cross-subsidises currently unproductive hires – training them up for the future benefit of the outsourcer.

Anyone with a business brain wouldn’t enter into this kind of arrangement. However, this is a bureaucracy first and a business second.

Posted in Uncategorized

Ah, Consultants

We had some consultants in recently. I don’t know how much they cost, but I’m guessing between $50k and $100k was spent on them.

They were supposed to discover why our backups were taking a variable amount of time, and generally slow at that – a SAN issue suspected.

What they didn’t do is make headway on the one actual problem we didn’t know how to solve ourselves.
Instead what they actually did was flag all the other known problems our application has (by interviewing the local SMEs), and put them in a report suitable for digestion by senior managers. (You know, lots of pictures, words of one syllable, primary colours.)

On the surface, a waste of time. They told us nothing we didn’t know already.

But on the other hand, it was essential. It depends on your point of view.

It’s clear we have problems with our hardware, even if we don’t know exactly what, and one possible solution is to buy better hardware. Our managers, however, can’t go and spend on the hardware based on the word of the local SMEs alone. Their solution is, engage some consultants to tell them what the problems are (even if that’s just documenting what the local SMEs are saying). Then they’ll have justification to buy the hardware.

The consultants are happy, they’ve been paid big bucks for doing bugger all. The SMEs are happy, the problems they’ve been bitching about for years might actually get some attention. The business are happy, they might actually see some results now. The managers are happy, they can now spend the problem away without fear of recriminations downstream, because in the worst case they can always blame the consultants for getting it wrong.

Everybody’s happy here in paradise.

Situation Normal.

Update:
The final report came out recently, and it was worse than expected. They didn’t crack the SAN problem, in fact they said it was all hunky dory and working as expected – while simultaneously recommending that we move to a dedicated storage array. But those little nuggets could be easily missed, a couple of lines buried in a long laundry list of things that could be done to improve the application, only some of which were accurate; some items were miscommunicated; many items were just plain wrong.

Maybe Weird Al can redo “Money For Nothing” with IT consultants instead of rock stars…

Posted in Uncategorized

How hard can it be to raise an invoice?

When starting/renewing a contract at a certain organisation, at the best of times it takes up to 2 weeks for a work order to be raised and approved.   Mind you, the decision to hire/renew has already been taken, so it’s not entirely clear why any “approval” is involved.  There’s no prospect of the work order NOT being approved, because the person is already busy at work.

So anyway, up to 2 weeks is normal.  After my renewal work order ran over the 2 weeks by a couple of days, I escalated.   Turns out the software had broken because of a data integrity issue, so my work order was stuck in limbo.  Eventually that was sorted.   But it was a struggle.

Then the next problem:  as soon as I realise my work order is ready to use, I have lost my access.  Also a data integrity issue.   By now, I’ve missed no less than 4 pay periods and I can see the next one ticking away.

At first sight, it seems no one wants to help or is able to help.   But it’s not true.   Help is at hand, it’s just in slow motion.   You log a call.  It sits in a queue.  You chase it, it gets assigned to someone.  You chase it, that someone might look at it, and pass it on to the next queue.   Rinse and repeat, over and over again, with at least a day between each 5 minutes of activity.   There are cycles: acknowledge/assign/diagnose/act.   Each step in the cycle could take a day.  The final action might be no more than pass on to another queue.  Or in some cases, get approval to ask someone else to raise another call to assign to another queue…

And meanwhile, I haven’t been paid for 4-5 weeks.   If I owed a bank a similar amount of money, and was being tardy and using excuses like “my software has a glitch” to avoid payment, you can bet they’d be adding penalties and interest and before long threatening lawyers and collections agencies.

Posted in Uncategorized

The fall of standards / the rise of the executive

When I was a lad, entering the workforce with a newly minted Oxbridge degree in Mathematics, with a peer group of graduates most of whom had just completed 3 years of a computer science degree, in my first job I was on an intensive series of training course for the first 6 weeks.   Then I was spoon-fed little “black box” programs by my new team, little harmless projects to keep me out of harm’s way.   Certainly I was let nowhere near production code.

 

These days, standards are different.    You don’t need an Oxbridge degree, nor a computer science degree, nor frankly any kind of academic qualification.   You don’t need a 6 week training course before starting work, nothing longer than an informal hand-over and an hour with the manuals.   And you don’t get “training wheels” projects to ease you in… from day one, you’re writing code without supervision that is destined for production.   Sure, it’ll be tested, but when the testers have even less arduous qualifications than the developers, that’s hardly a comfort.

 

Where is the executive and upper management in all of this?   They’re traveling all over the region on junkets, having lots of photo opportunities, telling their own success stories (tall tales, to be sure) and blogging their philosophy.   Their philosophy turns out to be a lot of abstract nouns they’ve picked up from a thesaurus, padding out some central theme like “We’ve got to make more money” or “We’ve got to be more efficient” or “We’ve got to reduce costs” or “We’ve all got to work together as one team”.   Sheer genius!   No wonder they get paid the big bucks.

 

Posted in Uncategorized

Offshore Replacements

What’s the *minimum* experience do you think you would want a DBA to have in the relevant database to support a production system, say for a major financial institution?

5 years?

2 years?

How about ~2 months experience.   Part-time.   When I say part-time, I mean 10% FTE at best.     So maybe the equivalent of 1 week’s full-on experience, at best.

That’s what my “replacements” will have when they “officially” go on roster to “remove” me as a key-man risk.   Unofficially, I expect to be around for a while…  until management play musical chairs again, and some new guy believes the BS that the “replacements” have it covered and there’s no need to keep on the “previous” DBA.

Sensible, responsible policy?

 

 

 

Posted in Uncategorized

Organising a pi**-up in a brewery…

Laugh or cry?:

I had a “routine” production implementation yesterday, the only notable feature being that the deployment itself would run for an estimated 4-5 hours, making it a full day on-site instead of a half-day.  If only…

First off, there’s this other financial system involved.  Their backups take 10-12 hours(!) and their restores an anticipated 2-3 days(!!)  This was known going into the release, flagged as a risk, but it’s been this way for over a year and nobody’s cared to improve on it.   Well yesterday their backup ran slowly.   Yes, x2 or x3 slower than usual.

This threw management into a spin, naturally.   Although our implementation’s risk was quite low, they couldn’t proceed without a solid backup of this other system, so we went on hold while they decided what to do.   And decided.   And decided.   We were on hold for 5+ hours before the decision was made *to proceed* regardless.   Our deployment’s runtime was blown out by 3+ hours because of competition from this ongoing (ultimately cancelled) backup – I hate SANs – making us 8++ hours late, so it made for a very long day.

The business users weren’t happy either, it was 9:30pm before the BVT could start, except they still couldn’t start because of a connection problem.   Naturally the techs were offshore, and the first choice was on leave, and … it took a while.    Some BVT took place last night, wrapping up well after midnight, the rest this morning.

Why this morning?  Most of the original BVT resources are based in Manila.   What nobody realised until yesterday was that (i) they’re not able to work past 7pm local time (ii) not able to work on Sundays.   That caused a problem, the gap was filled with local resources eventually, …

Points arising:

1) This production financial system, that “normally” takes 12 hours to backup and would “normally” take 2-3 days to restore…  WTF?   Even if that’s “acceptable”, it should hardly be a surprise that the day would come when it ran even slower.   All kinds of little itty-bitty things are chased up with the highest priority, tightening up password policy and the like, because it’s best practice (and because it’s a pending audit item), but something this big is off-radar to the auditors and allowed to persist until it inevitably causes a problem.

2) 5+ hours, really?  How many managers in how many layers were involved in that decision?   There were two choices:   go ahead (without significant delay), or call it off and reschedule.   Procrastination led to an inferior choice:   have a huge delay but go ahead anyway.   For all the talk of Inspired Leadership, there was none in evidence yesterday, it was management by consensus/committee.

3) Why was the backup x2 or x3 slower than usual?   If they couldn’t get an answer while it was running, it’s going to be so much harder now.   Part of the problem was that all the tech resources were finger pointing (not my problem, ask them), there was in-fighting over how serious it and repeated downgrading / upgrading of the severity.   Process-oriented bullshit, with nobody focusing on results.

4) The offshoring of business users puts hard constraints on their availability.   Leaving aside that nobody at the operational level knew of these constraints until yesterday, it begs the question whether this is acceptable.   Suppose there’s a disaster, we need full-on business participation in the recovery … but oh dear, it’s after 7pm and it’s a public holiday tomorrow and our business users don’t have a building they can work from…

Posted in Uncategorized