This is the 13th excerpt from the second book in the Defen series: BIT: Business Information Technology: Foundations, Infrastructure, and Culture
Note that the section this is taken from, on the evolution of the data processing culture, includes numerous illustrations and note tables omitted here.
Roots (Part Four: sample System 360 best practices)
Clearly Defined Line Management Structure with rigid role separation
At a minimum there should be:
- An operations unit responsible for day to day execution of scheduled processor tasks;
- A control group responsible for data collection and other user input;
- A systems development group responsible for development work;
- A help desk manager responsible for PC operations;
- A capacity and utilization management unit;
- An end user support manager responsible for business applications operations;
- A data management unit containing a data architect, data manager, and one or more database administrators;
- A technical support group responsible for managing and updating system software;
- A license and documentation management unit responsible for tracking PC licensing and application documentation;
- A contracts manager;
- A security and related (personnel) policies administrator; and,
- A PC systems administration unit responsible for PC hardware and software.
SLA includes annually budgeted operations
The service level agreement is the contract between the data center and the user community. This is the peace treaty in the battle for resources and control between user groups and the data center. As such it governs expectations and is renegotiated annually as part of the budget process.
The SLA should be integrated with the overall systems governance process and be administered by a systems steering committee including members of the senior executive.
Clearly documented SDLC standards
Data centers that run only packaged applications tend to stagnate. The growth and service potential is in new development, new deployments, and the discharge of ever increasing corporate responsibilities.
Early System 360 adopters generally underestimated development complexities and limitations, and therefore tended to over promise. As most projects failed while a few succeeded the critical success factors for developers soon became clear and, high among these, was the use of clearly enunciated and strongly enforced systems development lifecycle methodology or SDLC.
Developers who obtained user sign-off at each stage of a project's lifetime and then incorporated the resulting expectations into service level agreements generally found that users who had been co-opted during project design accepted weaker results as successes and were less likely to rebel at budget increases.
The typical SDLC is defined in terms of steps leading to deliverables and sign-offs rather than working code or reviewable systems documentation. Many of these steps are inherently technical but the focus is on the signoffs and processes rather than the contents of each deliverable, thus decoupling the systems development management process from systems development and testing.
"Lights out" 24 x 7 operation
Automated, or "lights out" operation is normally presented as a means of saving costs - not having to run a night shift means not paying those salaries. But, in reality, people assigned operational functions during these shifts tend to be low cost, so savings are usually neligible on the scale of the overall data center budget.
The management value of lights out operation as a best practice derives from something else entirely: the fact that it is functionally impossible to achieve this without first implementing a series of related practices ranging from proper management of job scheduling, to accurate capacity planning, effective abend minimization, and automated report distribution.
Use of Automated Tape Library
Use of an automated tape library coupled with vaulted third party off-site storage for backups is a common best practice mainly because it reduces both data loss and tape mount errors.
Disaster Recovery or Business Continuity Plan
A documented disaster recovery plan must exist.
The traditional first step in a mainframe disaster recovery planning effort is the classification of systems (meaning applications groups) according to the severity of the impacts associated with processing failure. Thus most plans are ultimately predicated on the time frames within which processing is to resume for each of a set of jobs grouped according to headers like Critical, Vital, Sensitive, or Non Critical.
The more common recovery strategies are built around:
Hot site agreements with commercial service organizations under which the company regularly transfers tapes to the hot site operator and the site operator assures the company of access to physical and processing facilities for the duration of any emergency.
Hot site agreements come in multiple "temperatures" with a cold site, for example offering little more than space and a physical facility without having any of the company's code preloaded or communications links pre-tested.
Internal systems duplication in which the company maintains two or more independent data centers and uses each as backup for the other.
Disasters are extremely rare. When they do occur weaknesses in the recovery plan are usually found in one or more of three main places:
- The materials needed to resume processing - including things like network information, back-up applications, libraries, and data, licensing, or report distribution maps - turn out not to have been updated sufficiently recently to allow full functionality to resume without significant and unexpected recovery effort;
- The organizational effort to re-route manpower and re-assign personnel to the interim facility often turns out to be much greater than expected and an initial period of apparent chaos ensues as roles, assignments, and authorities are worked out; and,
- Third party access to, or from, the interim facility often fails; resulting in missed file transfers which, in turn affect scheduled batches and thus the applications in which those batch runs figure start to "go off the rails" - ultimately requiring database rollback and imposing extensive re-processing on users.
As a result it is common in real processing disasters to find the data center director reporting full functionality at the interim site several days before users can resume normal operations.
- These excerpts don't (usually) include footnotes and most illustrations have been dropped as simply too hard to insert correctly. (The wordpress html "editor" as used here enables a limited html subset and is implemented to force frustrations like the CPM line delimiters from MS-DOS).
- The feedback I'm looking for is what you guys do best: call me on mistakes, add thoughts/corrections on stuff I've missed or gotten wrong, and generally help make the thing better.
Notice that getting the facts right is particularly important for BIT - and that the length of the thing plus the complexity of the terminology and ideas introduced suggest that any explanatory anecdotes anyone may want to contribute could be valuable.
- When I make changes suggested in the comments, I make those changes only in the original, not in the excerpts reproduced here.