AWS re:Invent Big Data spotlight shines on global footprint and storage access

With Google Cloud Spanner and Microsoft Azure Cosmos DB in its sights, global scale and resiliency were key big data themes at re:Invent.
Written by Tony Baer (dbInsight), Contributor

Video: The biggest challenges and opportunities facing Amazon Web Services

Counting over a half dozen announcements, Amazon Web Service's expanding database portfolio was a major theme of CEO Andy Jassy's keynote yesterday at re:Invent. Among the highlights were scale-out related announcements for Amazon's Aurora SQL and DynamoDB NoSQL platforms; they were clearly responses to Google Cloud Spanner and Microsoft Azure Cosmos DB, respectively.

Aurora, Amazon's cloud-native SQL transaction database, is adding a multi-master capability designed to put write performance on par with read. Until now, Aurora has had a master/slave topology. Writes were made on the master node and replicated up to 15 places across multiple Availability Zones (AZs) within a region; you could also have read-only replicas outside the home region.

In practice, with multiple read replicas, it wasn't surprising that read operations, or failover of read nodes, were much faster than those for writes. The new multi-master capability should level out the playing field.

With the new Multi-Master capability, Aurora now supports multiple write master nodes across multiple AZs. That allows applications to tolerate failures of any master with subsecond failover -- making downtime practically nonexistent. This is especially critical for applications with extremely demanding throughput and availability requirements.

Also: Amazon Aurora PostgreSQL headlines pre-show announcements

Taking advantage of Aurora's lighter weight log-based updates (by contrast, conventional databases update entire pages), its "optimistic" write lock approach keeps the database highly available, making multi-master possible. There will be obvious comparisons to Spanner, which just went multi-region in its support for single read/write instances. For now, Aurora's multi-master approach is confined to a single region (with read-only replicas outside the region), and initially will be available only on the more established MySQL engine; it is now in preview. Amazon Aurora Multi-Master will add multi-region support for globally distributed database deployments in 2018.

At the other end of the scale, Aurora has added a new serverless feature designed for workloads that are highly variable and subject to rapid change. Being serverless, customers do not have to provision Aurora clusters. Being elastic, they pay (by the second) only for database resources used, and when the run is over, customers only pay for storing data. It's now in preview for Amazon Aurora MySQL.

Keep in mind that the reason that the new Aurora features are for now only available on MySQL is that it is the more established engine; as we noted yesterday, Aurora PostgreSQL only became generally available just over a month ago. We expect that as the Aurora PostgreSQL engine matures, that these new capabilities will trickle over.

There was a similar scale-out theme behind DynamoDB's new Global Tables and On-Demand Backup features. Global Tables fills a gap in the DynamoDB platform; it can replicate tables globally, but until now it required a manual scripting workaround. Now you can create tables that are automatically replicated across two or more AWS Regions, and significantly, support multi-master mode for cross-region writes. This allows applications to perform low-latency reads and writes to local Amazon DynamoDB tables in the same region where the application is being used. So, a consumer using a mobile app in North America experiences the same response times when they travel to Europe or Asia without requiring developers to write complex application logic. And that provides a knock-on effect to software quality.

Amazon DynamoDB Global Tables also provide redundancy across multiple regions, so databases remain available to the application even in the unlikely event of a service level disruption in a single AZ or single region.

This is clearly an answer to Cosmos DB, which already had the multi-region, multi-master write capability - although the two platforms have other differences relating to multi-model support and the way that consistency is managed. It is generally available now.

Another enhancement to DynamoDB is On-Demand Backup, a feature that continually backs up the database capable of handling up to 100 TBytes daily. This is useful for all the reasons that you need backups: satisfying long-term data retention policies along with short term data protection and protection from data loss due to application errors. It provides a faster, more convenient way to perform backups, small or large, with virtually no impact on performance, with just a single click. On-Demand Backup is available now; a related capability -- point in time restorations going back up to 35 days -- will be available early next year.

Also: Intuit to use AWS as its standard artificial intelligence platform

Amazon now finally has its own graph database. With Amazon Neptune, Amazon joins the ranks of pioneer Neo4J, plus household names like Microsoft, Oracle, Teradata, IBM, SAP, and DataStax. In Amazon's case, it is a full graph database, not an engine that superimposes a view on relational or JSON data structures. Compared to other graph databases, Neptune is less prescriptive for graph query: it supports Apache TinkerPop3 for property graph queries and the less popular RDF for more semantic-oriented graph queries using the SPARQL language. The guiding notion is providing the right approach based on the use case.

Amazon Neptune storage scales automatically without downtime or performance degradation. Neptune is highly available and durable, automatically replicating data across multiple AZs and continuously backing up data to Amazon S3.

On the theme that cloud object storage is becoming the de facto data lake, Amazon opened the door last year with several services opening S3 to ad hoc SQL query and as an extended data store for its Redshift data warehouse. Amazon isn't alone. Google, with BigQuery; and Microsoft, with Azure SQL Data Warehouse, have also extended their data warehousing offerings to access cloud storage. Meanwhile, Oracle is adding support for Database 18c to access cloud object stores directly, and we expect that its rivals won't be far behind.

This year, Amazon is upping the ante with S3 Select, a new service that can retrieve just a portion of the data stored within an S3 object with simple SQL select statements. The example that Amazon gives is that if you aggregate sales for a couple dozen stores within a single S3 object, S3 Select lets you confine your download to just a single store if you want. Amazon claims that, by making S3 data retrieval more selective, you could get up to 4x performance improvement.

It differs from Athena, which is integrated with AWS Glue providing a table-based view of S3 data. S3 Select instead uses a lower level binary wire-protocol for accessing data; it also has a Presto connector allowing SQL query from platforms like Amazon EMR. S3 Select is now in preview.

A related offering, Glacier Select, provides access via simple SQL calls to Amazon's Glacier storage. Like S3 Select, it allows you to query just a portion of the data. The difference here is that, as archival system, Glacier Select returns results without requiring you to conduct a time-consuming restore operation. Now generally available, next year it will add integration with Athena to extend ad hoc query reach deep down into the archives.

Editorial standards