Kaplan Test Prep graduates to a cloud-based data lake

​Kaplan uses SnapLogic and Amazon Redshift to cut costs, optimize its product portfolio, and boost profits.

Video: Having big data is not enough: Tips to turn it into a business advantage

Kaplan Test Prep is well known for helping students prepare for college-entrance exams, such as the SAT and ACT; post-grad admissions tests, such as the GRE and GMAT; and licensure exams for medical, legal, nursing, financial, and other professional careers.

Unfortunately, the company wasn't making the grade when it came to using all available information for data-driven decision-making.

Founded in 1938, Kaplan has decades of historical data, scores of legacy systems and diverse applications. From 2013 to 2015, it made a methodical move to a virtual private network and cloud-based application stack on Amazon Web Services (AWS), an effort that helped Kaplan modernize infrastructure and consolidate from 12 data centers down to four. But from an analytical perspective, Kaplan continued to rely on siloed tools and reporting capabilities. It lacked a centralized store where it could consolidate and analyze data from many data sources.

Read also: Cloud computing: AWS bumps up its datacenter capacity, again

"We had one, small [Microsoft SQL Server] data warehouse that was ingesting data from just two systems; that's it," says Tapan Parekh, director of analytics and data architecture. "It wasn't a complete view of data, and nobody was happy."

kaplan test prep

When he joined Kaplan in November 2015, Parekh immediately began developing an architecture for an analytical data platform. Given that the majority of data sources were now running on AWS, Parekh was considering Amazon Redshift, the vendor's columnar database service. His biggest challenge was figuring out how to get data into Redshift.

"We have many different applications using different underlying databases and technologies," says Parekh. "We had different velocities and volumes of data coming in. Ingesting from a relational database is straightforward, but we also have data coming in from streams, which is nonrelational, JSON data, and we have one or two applications that are XML-based. So, a traditional [batch] approach wouldn't work."

Anticipated data-velocity requirements ranged from once-per-month loads from accounting systems to daily, interday, and microbatch loads from relational and NoSQL sources, to real-time requirements from Amazon Kinesis-based streaming applications.

Read also: The emergence of NoSQL and convergence with relational databases

Kaplan looked at integration options including Informatica, Microsoft SQL Server Integration Services, and hand-coding with Python, but it quickly narrowed its choice to SnapLogic, based on factors including ease of use, cost competitiveness, and security features, according to Parekh. But the selection wasn't finalized until SnapLogic and Redshift passed a proof-of-concept test in which data was loaded from Salesforce and Zuora SaaS services as well as from a homegrown system of record running in Kaplan's VPC on Amazon. Once the data was loaded into Redshift, the next step was to build a data mart making all these sources of data available for analysis.

"We were able to do it all within three months using all of the data within these systems, not just dummy data," says Parekh.

In the first year of the production deployment that followed, the focus was on getting data into the Redshift-based platform. The Kaplan team doing this work varied between three and four people. In one project after another, they managed to build SnapLogic pipelines for data ingestion from more than 30 applications into Redshift. Most of these applications are still active, so Kaplan continues to load copies of incremental data changes at latencies ranging from monthly and daily to hourly, near-real-time and streaming speed. Sources range from systems of record, learning management systems and financial systems to Salesforce CRM, Workday, Zuora, and Google Analytics. Underlying database management systems include Oracle, PostgreSQL, Microsoft SQL Server, MongoDB, and DynamoDB.

Read also: SAP targets Salesforce on CRM: What's aspiration vs. reality?

In some cases, Kaplan is consolidating data using Redshift, doing one-time migrations from legacy applications that have since been retired or that will soon be retired. In these cases, Kaplan moves all available data onto Redshift, retaining historical information that might fuel seasonality, time-series and other long-term trend analyses.

Kaplan is using Redshift's Spectrum capability to provide access to variably structured information. Examples include JSON data from Kinesis-based streaming applications and Mixpanel data on mobile app clickstreams. This data is stored in the Amazon S3 object store. Redshift Spectrum SQL commands query data in S3 through external tables, effectively joining this data with the structured data on the core platform. Kaplan is exploring the use of Amazon Athena as the unstructured data querying opportunities expand.

As detailed in my latest case study, Kaplan Graduates to a Cloud-Based Data Lake on Amazon Web Services, Kaplan has already seen a greater than 10-times return on its investment, and the benefits keep coming. Not only has the company retired aging software and systems to the tune of more than a $1 million in one-time savings, the new platform is powering activity-based cost analyses that are streamlining operations and boosting profits. What's more, data-archiving workflows powered by SnapLogic are expected to cut CRM system storage costs by $150,000 annually. To find out more about this case study, follow this link and download the free excerpt.

Related stories