Towards the end of 2014, NASA was looking at overhauling its information technology governance (PDF) to ensure the security of NASA's information technology systems. Intel had a part in helping NASA's Center for Climate Simulation investigate the viability of running the organisation's modelling and simulation applications on cloud infrastructure as an alternative to its purpose-built computing cluster named Discover.
Back in 2011, network architects at the NASA Center for Climate Simulation began investigating (PDF) the viability of running the organisation's modelling and simulation applications on cloud infrastructure as an alternative to its purpose-built computing cluster named Discover.
Hoping to capture the inherent advantages of cloud infrastructure, such as agility and elasticity, they wanted to establish whether open cloud architecture could meet the applications' rigorous throughput and latency requirements.
In particular, they needed to ensure that overheads associated with virtualization would not limit performance.
As part of the shift to the cloud, the team hoped to converge the environment's backbone and management infrastructures onto 10-gigabit Ethernet. Using a single network fabric is expected to help optimize the flexibility and cost effectiveness of the overall solution.
The NASA Center for Climate Simulation's research on climate change and related phenomena requires extensive computer modelling, and contributes to efforts such as hurricane prediction, analysis of past weather patterns, and scientific support of government climate policy.
The cluster named Discover previously did this work for some years, and was used as an integrated set of supercomputing, visualization, and data-management technologies to deliver roughly 400 teraflops of capacity.
It had Compute resources of 30,000 conventional Intel Xeon processor cores and 64 GPUs; Inter-node backbone: DDR and QDR InfiniBand; Management networking: Gigabit and 10-gigabit Ethernet (GbE and 10GbE); Data store: ~4-petabyte RAID-based parallel file system (GPFS), plus ~20 petabyte tape archive Discover is based entirely on non-virtualized machines, so adding capacity requires additional physical servers to be provisioned.
Reducing the traditional cost and complexity of those changes is one benefit of cloud computing. Moreover, cloud architectures add elasticity that aids in job scheduling and helps avoid operational bottlenecks associated with long-running jobs.
Intel suggested Nebula, based on OpenStack, as an alternative to Discover. But the team had to sort out whether Nebula could deliver equivalent performance, as Discover needed to be sorted out. In particular, the team needed to determine whether the virtualized environment on which Nebula is based will introduce overhead or other factors that will create unacceptable limitations, compared to "bare-metal" clusters.
To advance the state of this preliminary testing, additional work was needed. In particular, the team had to test additional benchmarks and real-world applications, as well as extending the tests to include InfiniBand fabric and cloud infrastructures such as OpenStack and Eucalyptus.
To meet critical speed and latency requirements in node-to-node communication, NASA performance engineers worked with Intel to employ virtualization technologies to their full potential.
Together, the team established a test methodology to compare the two environments on several workloads, including the Nuttcp network performance measurement tool, the Ohio State University MPI Benchmarks, and the Intel Math Kernel Library (MKL) implementation of LINPACK. Analysis using these benchmarks enabled the team to measure and compare system throughput and latency between various types of physical or virtual servers.
The approach of comparing multiple virtualization scenarios enabled the testing to reveal the role that those virtualization technologies can play in meeting performance goals. The core conclusion from this testing was that cloud-based high-performance computing is a viable possibility. Continual testing will also include additional hypervisors, such as Xen and other VM OSes, such as Red Hat Enterprise Linux and SUSE Linux.
NASA continues to refine its cloud-based infrastructure as a service, and it expects to realize more benefits in the areas of simplification, flexibility, and cost effectiveness. Looking ahead, the agency's high-performance computing workloads have begun the process of shifting to open infrastructures that use Ethernet fabric, and further acceleration seems inevitable.