This is the 36th excerpt from the second book in the Defen series: BIT: Business Information Technology: Foundations, Infrastructure, and Culture
The Windows data center case study
(Long and sad, but a personal favorite - the Happy Valley Tax Authority, from 2002)
When Unix is a four letter word: the Happy Valley Tax Authority
The "I" in this scenario is that of a hapless systems consultant who didn't do his homework before setting off to meet the client.
The Happy Valley Tax Authority, its staff and mandates, are fabrications but the situation presented, and the remedies offered, reflect the author's recent experience with real-world clients facing similar problems. This tax authority is imaginary, but the conditions, decisions, and outcomes described are broadly based on real events.
The Happy Valley Tax Authority was set up as a regional co- operative to administer tax programs for local governments along an eighty mile stretch of highway. At the time of incorporation none of the players would agree to use the largest municipality's name for the joint effort and so the tax co-operative was named for a local tourist attraction: the Happy Valley Ranch. Although now a federally funded national heritage site, the ranch house had been built in the 1890s as a second generation cattle baron's imitation of an English country manor, acquired a a rather different cultural status during prohibition, and been razed to the ground in an uprising of the local moral majority in 1957.
In the twelve years since start-up, the tax authority has acquired duties that go beyond simple property tax assessment and collection. One town has a hotel room tax, another provides school tax credits for couples with two or more children, while a third has an industrial land development program with both rebate and tax relief schemes to attract tenants. Today the authority collects 32 different levies from about 45,000 taxpayers; administers eight rebate, direct support, or tax relief programs, and collects tolls on one road bridge and two park entrances.
About eight months ago a town councilman with strong connections to the firm I usually work with got his council to hire the firm to do an operational audit of the authority's effectiveness and assess what value the town was getting for its continued support of the authority's mandate. That report was dully produced with the usual platitudinous result and recommendations for minor change.
The councilman had lunch sometime last week with the managing partner, and now Barb Rush and I have been dispatched with specific orders to "review and evaluate the use of information systems technology with a view to making recommendations to improving the efficiency and effectiveness of operations." She works in the insolvency practice, but apparently has audit experience with the client so we're to spend a day there and a day working out our report.
The tone for the day gets set at the 8:30 AM meeting. Not only am I the only male not strapped into a suit and tie, but Barb switches from normal to distant as soon as we enter the building. In the boardroom to which we're conducted to wait, the informal chatter preceding the executive director's arrival amounts to a secret handshake among initiates to a cult - I'm not good at this but Barb and the systems director deftly negotiate the protocol by talking about upgrading their home systems from ME to Windows 2000 -Professional- while others listen and nod approvingly.
Their behavior reminds me uncomfortably of ants rubbing antennae together to establish their mutual allegiance to the same queen. According to their words, neither one can get the thing to work, but what they're actually saying to each other is "I'm no threat." There's nothing I can contribute to this, but when eventually asked for an opinion there's a sudden hush and I get marked as an outsider when I say I don't use Windows and have no idea what the problems might be.
When the boss arrives a precise eleven minutes late, Barb smoothly slides the knife in as she introduces me: "Paul," she tells the group, "is a Unix/Mac expert who sometimes works with us."
I'd have been better off, I think, if she'd introduced me as a child pornographer out on day parole. Their board has told them to answer questions, including mine, but everyone else at the meeting, including Barb, make it clear that they constitute a group --to whose membership I need not aspire.
The director, who attaches himself to us for the day, tells us that his systems department has a staff of nine including himself but that the systems budget is both too complex and too sensitive to share. Barb promptly agrees we don't need it, leaving me floundering - and unwilling to ask for the service level agreement because I have no idea what she'll say.
The authority employs a total of 68 people of whom 61 have PC desktops. All of these have Windows NT 4.0 Workstation with SP1 on 550Mhz P3 chips, 15" screens and 64MB. The NT Servers are in four rackmounts and there's one real surprise: an HP K220 that turns out to have 24GB of Oracle 7.31 dataspace on two refrigerator sized external disk packs that each have 32 650MB drives.
Given the patent hostility in the place, finding a K220 is bit like unexpectedly running into an old friend. I tell my two tour guides what great machines these were, but there isn't even a pretense of interest.
The primary property tax application had been custom written for one of the municipalities involved, originally for a McDonnell Douglas Microdata running PICK. As part of the authority's start-up this package had been ported, mostly by outside consultants, to Oracle on HP-UX with Windows 3.11 clients and they're in the process now of porting it to NT with SQL-Server. That's why there are nine staff, four are client-server developers on this project. When I ask whether they're working with SQL-Server 7.0 it turns out they started over a year ago - but they're sure the upgrade will be be no problem at all.
Overall, the production software doesn't look too bad. There are individually licensed copies of Microsoft Office on all the PCs -but all of them, including the rackmounts, look like they came from a basement assembler so I ask Barb if the auditors had verified their license status but she doesn't know.
There's no formal workflow or document management program in place, but most of the smaller tax programs have some useful form of automated support. In many cases, this is fairly minimal; but the volume is so small I don't see cause for concern. The hotel room tax, for example, applies to a total of six establishments and 82 rooms. So what if it's managed from an old Lotus 123 spreadsheet that's been ported to Excel? you could pretty much do this kind of thing on paper without missing a beat.
The missing workflow and tracking applications raise intriguing audit issues but Barb's not interested in that either. On balance I'm starting wonder what she plans to do all day and why we're even here when we get the first hint of serious trouble. A couple of assessment clerks take our visit as an opportunity to harass the systems director about database crashes. This happens all the time, they say; and often means doing work over. Right now, they say, they're re-entering data from yesterday.
To me, this doesn't make sense: Oracle's 7.31 was one of those stopped clock moments in product development. Particularly on an HP K-class, it should take serious effort -something like high explosives or an idiot with the root password - to cause it to fail. So I ask the director about the PC client and the network as more probable suspects - and trigger an off scale defensive reaction.
Networking, it turns out, is incredibly complicated stuff for them. Despite using NT Workstation they rely on Windows for Work Groups for PC networking. The TCP/IP access needed for Oracle and SQL-Net is layered on using a dedicated DHCP server accessed after someone booting a PC logs into the appropriate departmental file and print server. Not only are they running half a dozen LANS on the same wiring, but they're using 124.0.0.x for the DHCP server and cycling through 200 or so IPs to allocate new addresses for every PC reboot - triggering new rounds of cleanup and resource authentication effort each time.
There's an external web server and a firewall too. Both are NT machines but there's no actual content on the web site yet. It has Crystal Reports and IIS hooked up, but the application they're developing for it isn't ready. The director explains that the building permits system will allow people to access and pay for local building permits on the internet. Right now, he says, the municipalities do this manually and the property assessments often don't get updated. By putting this on his web servers, he'll integrate the databases and save the local governments money while raising tax revenues. It's an experimental program, he says, that they've had to put on hold for a few months because they have had so much trouble with Oracle that they decided to expedite the conversion to SQL-Server before proceeding with the web site services mandate.
Although Barb worked on the audit team, she doesn't admit to knowing anything about project authorizations and I don't want to ask the director for his file because I've been trying to reduce his hostility and suspicion. Instead, I offer to check out the problem with the K220 for him.
He does not want me to look at the machine, but I do my best humble cowboy shuffle and, mainly because he's deeply conflicted between wanting to throw me out and not being sure what credibility his bosses will attach to my report, he decides to accept my offer to "just check it over and see if it's something obvious."
There are lots of obvious somethings, but nothing bad enough to cause frequent failure. It's still running 10.20 and has no patches more recent than 1998 but it's got four 120 MHz CPUs, 768MB, and separate narrow SCSI controllers that go out to each of the disk packs. Swap and system disks are dedicated 1GB internal devices and there's a DDS3 tape drive that they tell me is used for backups.
The Oracle installation isn't what I'd do either. Cooked files, no obvious attempt to balance dataspaces across controllers, and no log mirroring at all. Inefficient, poorly structured and unmaintained, but not remotely sufficient to explain their problems. Why does it fail so often?
The answer is that it doesn't. What's happening is that years of disk fragmentation, inappropriate kernel parameter settings, the use of cooked files, and very long record formats, derived from the PICK system the software was first written for, sometimes combine to force Oracle to issue thousands of sequential page reads to satisfy relatively simple lookup requests. That causes long delays during which the system seems to lock up. When the DBA reboots [!] the machine to clear it, Oracle data written to system buffers but not yet flushed to disk is lost -causing long recovery times and work rollbacks.
The DBA, who acts as the Unix admin, thinks the IPL stuff he sees on screen at boot time is Unix and tries to run Oracle via a socket connection from his NT workstation. When he can't get Oracle's attention this way, he just power cycles the server. What's really astonishing is that he's had the job for over a year and reports quite cheerily that this always works.
"You need to get some help here," I tell the systems director, "get Oracle set up right; use raw devices, balance the load across some bigger disks -you could probably pay for some 9GB disks just on power savings- maybe contract out for a part time Unix Admin. This thing should never fail, there are K220s out there with the same Oracle release that haven't been restarted literally in years, get it set up right and it will have years of life left in it. Get some 10,000 RPM disks instead of the 3300 RPM units in the machine and it will easily out perform SQL- server too."
While I was working on the machine the director's natural friendliness had started to override his fears but all the faces around me close up tightly when I say these things. I'm contradicting absolute and revealed truths here: Unix is obsolete, disks are many times more expensive than for Windows, the technology is hopelessly unreliable and hard to manage -just look at their experience with it for proof- no, no, no, NT will fix everything, that K220 is the enemy incarnate, and, by association, so am I.
On my better days, I'm smart enough to know that there's no point in arguing with a client, but this wasn't a good day. I point out that the table definitions used in the Oracle implementation will be impractical in SQL-Server, they're going to have to redo all the data structures, rebuild the application as a set of stored procedures, and rebuild the client. All that will take time and testing, so why not buy a whole bunch of time by fixing the K220 setup now?
It's not going to happen. They have an all NT strategy, the director says with finality, and that's where they're going. Microsoft, I say, is making your work with NT and SQL-Server 6 obsolete (this is mid 2000) you'll have to do it again for Win2K and SQL-Server 2000. This, of course, is my worst blasphemy yet, they know that their code will work perfectly with future Windows releases. I compound my mistake by asking them if this was true for the upgrade from Windows 3.11 to Windows 95? from NT 3.51 to NT 4 Server? or from Windows ME to Windows 2000 desktop?
The result is a further change in attitude. The hostility and suspicion visible before get submerged into the kind of bemused condescension usually reserved for other people's obnoxious children or the very old. After some verbal fencing, he asks, do I have any other recommendations?
I do, but they aren't anatomically possible, so, instead, I ask about his schedule for meeting the public information access mandate. This is a bluff, I'd never heard of it until he used the magic word earlier, but he doesn't know that.
The deadline set by the board wasn't realistic, he explains. It's all the fault of Oracle running on that old HP box. They have installed an NT server outside their firewall with IIS front end that lets users run Crystal Reports but Unix doesn't work across the firewall, so they put the project on hold until they get the SQL-Server conversion done.
I'm dumbfounded; and finally silenced. So he fills the silence by talking about having had to sign a non disclosure agreement to get early access to Microsoft E commerce Server for his building permits initiative. Microsoft's sales guy is even talking, he boasts, about partnering with the tax authority on this. "We'll be helping them sell their package to other municipalities" Barb chimes in, helpfully.
It takes a minute or so to absorb all of that. A couple of hours ago, while looking at the K220 I'd seen the support modem with its dedicated phone line properly in place and asked what help they were getting from HP on their Oracle problem. They'd cancelled support right after starting the conversion project - about a year and something ago - expensive, and unnecessary because they have the NT server already, he said - but when I called the number helpfully listed on the back of the K220's cabinet, the modem had answered. This is more of the same, but worse.
By now even I know it's hopeless, but I don't want to think about what he just said and seize on an issue that's been niggling at me most of the day: if I have the chronology right, they started the permits application more than a year earlier, and there are four people. This just seems way off scale for something that any competent Linux developer could whack together with a mysql/php combo in a matter, probably, of a few weeks to prototype and maybe two months to final product. So, I ask for a project plan - but the person who has the licensed copy of Microsoft Project isn't in right now; and of course they're using object techniques but their expert on using UML within Visual Studio is the same missing person, so they'll send me the information later.
At this point Barb finally proves useful, she sidesteps a possible assault charge by pulling me away to go in search of the executive director because he'd asked to see us before we left for the day. When we get into his office it's clear that he's been briefed on my discreditable knowledge of Unix but has a message of his own to convey. The web issue isn't important, he says. The board caved to one town's demand for public access and didn't really mean it, so the delay doesn't matter. Furthermore, they're going to love the building permits initiative, he says, and accept any delay as the cost of getting it right.
Message delivered, he and the systems director rub antennae with Barb again and we're ushered out of the building.
Next day, I do what I should have done first, find and read the file from the previous review. It is, of course, signed by one of the insolvency partners Barb reports to- and he thought the systems operation was "forward looking", "innovative", "on track", and "meeting best practices standards for highly effective computing."
The report was, if I do say so myself, a masterpiece in the art of weasel wording. "The problems and delays encountered by the authority were," I wrote, "due largely to rapid and unexpected change in necessary third party components affecting both hardware and software." Then I pointed out that the need to replace or substantially upgrade all of the Authority's computers and applications as Windows 2000 comes in "creates a unique opportunity for the Authority to formally consider its longer term systems infrastructure and staffing ratios."
"Such a review," I wrote, "would focus on accountability, cost, and performance as the Authority's senior management adapts to changes in its mandate, the emerging public awareness of information security issues, and the availability of inexpensive outsourcing solutions."
Then I washed my hands, cleaned the keyboard, changed the printer toner, and resolved never again to darken the authority's doorway.
- These excerpts don't (usually) include footnotes and most illustrations have been dropped as simply too hard to insert correctly. (The wordpress html "editor" as used here enables a limited html subset and is implemented to force frustrations like the CPM line delimiters from MS-DOS).
- The feedback I'm looking for is what you guys do best: call me on mistakes, add thoughts/corrections on stuff I've missed or gotten wrong, and generally help make the thing better.Notice that getting the facts right is particularly important for BIT - and that the length of the thing plus the complexity of the terminology and ideas introduced suggest that any explanatory anecdotes anyone may want to contribute could be valuable.
- When I make changes suggested in the comments, I make those changes only in the original, not in the excerpts reproduced here.