With such an infrastructure in place, the next challenge is in keeping it shipshape. And key in maintaining any IT infrastructure is meeting service-level agreements (SLA) that companies have set internally, or with their end users. With IT being used more and more as a strategic business tool, SLAs will have a direct impact on the bottom line of the company.
To keep an IT infrastructure healthy and work towards incurring minimum downtime requires proactive maintenance. Simply, you should be spending time preventing problems, and not solving them.
So what are proactive steps that can ensure your SLAs are met? And how can you do this and still manage new releases and upgrades of the different products that exist in the integrated stack that is your production environment?
Let us look at some best practices seen from our perspective as a system integrator:
Step 1: Be a doctor of applications
The chart below shows that 80% of applications downtime is typically accounted to software failure and human errors.
Figure 1: Causes of unplanned application downtime. (Source: Gartner Group)
One key to reducing application failure-related downtime is through root cause analysis, as most of the downtime is due to the time spent in identifying the root cause of the problem. To do this, it is essential to identify, understand and monitor continuously the business processes that your IT systems are enabling.
You need to ask probing questions, especially those that might throw up more questions. Remember, you want to get to the source of your problems. And don't ignore the simple, obvious questions, like "how are my orders being processed?"
Or "how can shoppers access my web site?", "Are your online reservations always available?", "How can your brokers conduct online trading?", etc. Understanding these processes will help you pin-point root causes of problems that surface.
Many tools are available to solve business process-monitoring and root cause analysis problems. We should implement these systems to reduce the downtime as it helps to identify the root cause of the problem
Step 2: Know your OS stacks
Two basics apply here:
- Be judicious in monitoring for the latest OS patches, versions and release upgrades.
- Do so with a planned schedule.
Lastly, create your matrix in the order at which support needs to be provided. Before you run the patch, make sure to test your matrix in a staging, or test environment, including the applications.
Step 3: Know the rest of your stacks
Apply step 2 to the rest of your IT infrastructure stack, namely, the data management, application infrastructure, application and network stacks.
Here are what each stack comprises in detail:
- The data management stack pertains to patches and upgrades for file systems, volume managers, cluster layers, HBA and SAN fabric. As in step 2, you can use the support matrix that accompanies the release. Test your matrix similarly before applying your patches.
- The application infrastructure layer includes the application server, database, directory services, identity management and the middleware layer. In general, patches or version upgrades will have an impact on the OS version matrix. Hence, as you generate the IT infrastructure integrated stack matrix, note also the corresponding supported Server OS, file system, volume manager, cluster layer, server hardware, firmware versions, storage drivers, SAN fabrics, storage, applications server and database layer.
- The application layer should include all your applications, and the relevant patches and updates. Remember, as in point 2 above, that patches and updates go hand-in-hand with your OS versions and the patches they support. Generate your matrix as above. Remember to test the applications as well.
- The network stack has minimal impact on the other IT infrastructure stacks. However, as a practice, apply the templates above and generate the requisite matrix. Do this for completeness, as well as to reduce the chance of application failure due to human error.
Step 4: Create your staging and testing area
You will need a staging or test area, which is a scaled-down version of the production for the above testing purpose. To save money, you may want to consider outsourcing some of your testing work to a systems integrator which can provide a shared environment for testing purposes.
Step 5: Repeat your stacks for disaster recovery
Enterprises which keep disaster recovery (DR) sites as part of their business continuity plan will need to duplicate all of steps and considerations outlined above at their DR site. This is to ensure that both the production and DR sites are synchronized and aligned, so that business can continue when disaster strikes.
One last point: Before applying your support matrices at your DR site, remember to first test them in a staging environment and run them in production.
Gowthaman M is the technical director of Frontline Technologies, a Singapore-based regional IT service provider, where he leads a team in providing business solutions, professional services and engineering. A 15-year IT services and operations veteran, Gowthaman's current interest is in the area of tech investment optimization.