Humans make mistakes: Is cloud automation the answer?

Q&A with Jason McKey, Logicworks: how software automation can take the errors out of your code.

Use of the cloud continues to rise -- but that doesn't mean it's always simple for companies to get complex projects right.

Logicworks says its automation technologies can help improve the performance and security of cloud-based infrastructure. ZDNet recently caught up with the company's CTO, Jason McKay.

ZDNet: Why did you decide to focus on AWS?

McKay: We wanted our engineers to build tooling and frameworks to enable repeatable automation across our clients' environments while allowing for customisation.

jasonmckay.jpg

Logicworks CTO Jason McKay: "Our Cloud Control [product] grew organically."

Image: Logicworks

We have always had some traction in the automation space and many of our clients needed to be compliant with PCI DSS [Payment Card Industry Standard] where you could improve things by eliminating the human error aspect. And it helped us make things predictive; repeatable components from automation could help with security.

So what became our Cloud Control [product] grew kind of organically. It was a case of understanding what would work commonly across our user base that we could then build into something that was repeatable.

A couple of things we recognised early on were basic sanity checkers that ran across our client environments that could be customised and were generally applicable.

So we built out what is generally referred to now as the "scanners" portion of cloud control and that is a job that we are constantly adding to. We are scanning across our client base and adding to it, which is made simple for us because of Amazon's API accessibility.

What are you looking for when you scan?

Anomalies, violations of best practice and so on, and when we find a high-risk factor we will automatically correct those violations.

We want to automate for several purposes. One is scalability but the benefit that most immediately becomes apparent is that humans make mistakes. That's just the way things work.

Someone goes in and either through negligence or just mistakenly deploys resources into the wrong region. Something like, they want to set up a test lab, which is no big deal, but they want to set it up in Singapore where it won't conflict or affect any of [their] production or DR workloads which are running in the US.

The problem is that now you have resources that are unaccounted for, that are running in another region and are checked out by the governance folks. So we have a scanner that goes and looks for new resources in a region and then we can do something about it.

Then the other thing we do is follow best practice and run AWS Cloud Trail and Config, which are two services, one that provides API audit history and the other gives you a version control history that shows changes across the environment.

These are two must-have services in an Amazon environment, especially because, as we have found, they code-manage the AWS environment. They give you that audit trail of 'we know who did what and when'. That is key.

Hyperscale is where the growth is: Cloud and virtualisation ravage enterprise server demand

But a pause in hyperscale datacentre development, plus waning demand from enterprises has hurt server sales.

Read More

But also for forensic purposes, it lets you see what changes have been made over a period of time and what effect those changes may have had on the application.

That also means that when we scan we can see if CloudTrail and Config have been enabled and if they haven't we can enable them. Now that is the top-level view of the Amazon tool, but the other and more interesting one is the automation framework that we built.

This came about as a result of our trying to make repeatable, scalable builds for our clients.

That consists of tooling that we put together to create cloud implementation frameworks where we can inject some metadata into them.

These take care of all the initial frameworks and inject them into a Puppet or some other jobs that can take care of the risk management and can connect them to an Active Directory, a main controller or a service for centralised user authentication and so on -- the automated services that we will want to do on every build.

Once we have done that, we can use Puppet to bring in any service modules that a particular client may want to run. That lets us build something that is very repeatable and scalable.

We find ourselves constantly working on the automation framework as we get new demands from clients. All along with the automation framework, we are trying to limit the human intervention required.

The aim is that nobody has to walk into a production server at all. All the work for the production environment should be done automatically. Like all ideals, we know that it is most likely unachievable, but we try to get as close as we can.

What are the issues you come across?

Small things mainly. Sometimes you can see that your time protocols have not been configured properly and this has led in turn to the servers not configuring properly. Now this could be due to one mistake in the Config file. You may be tempted, for expediency, to just write a For-Loop and a shell script and run through a 100 of your servers and get them up and running. That may work but you may also make a mistake.

What you should do is take a step back and ask yourself, 'What have I done? Is this something I can do programmatically and test in a low-level environment? And if so, is this something I can automate?' If you can, then use it, test it, and then use it in production.

It is a matter of choosing where and when you are logging in, the need for human effort, and minimising those.

The other one is 'code pushers' -- making changes and enhancements and going to a fully agile environment. These are the ones who are going from an 'every three weeks' release to trying to release weekly. When you do that manually, FTP-ing or artifacts and tarballs on the server, it is simply not feasible. So it is making sure that your code deployment pipeline is as air-tight as possible and getting the code onto the box.

What is your role?

I have been the CTO for just over a year but I have been with the company for ten years. I have been in engineering roles that whole time. It's great because it is a very fast-moving area. It is not the staid IT of the early 2000s and 90s.

Read more about enterprise software and the cloud