​Google pushes on with big data: Cloud Dataflow beta and BigQuery update

Along with new features for its Google BigQuery cloud analytics platform, the company's Cloud Dataflow managed data-processing service is now available in beta.
Written by Toby Wolpe, Contributor

Announced last summer, in alpha in December, Google's Cloud Dataflow managed data-processing service is now publicly available as a beta, with what the company is describing as better elasticity and fine-tuning mechanisms.

The search-to-cloud giant has also unveiled new features for the Google BigQuery cloud analytics product, which like Cloud Dataflow also forms part of the Google Cloud Platform set of modular services.

BigQuery now has improved security and performance, with features including row-level permissions for easier data sharing, a higher default ingestion limit of 100,000 rows per second per table, and geographic data isolation for businesses that want data stored in Google Cloud Platform European zones.

The idea behind Cloud Dataflow is firms use its SDKs to write software that defines batch or streaming data-processing jobs. The service then takes care of running the jobs on Google Cloud Platform resources, using technologies such as Compute Engine, Cloud Storage, and BigQuery.

Google Cloud Dataflow product manager Eric Schmidt - one of several Google employees of that name including executive chairman Eric Emerson Schmidt - said the refined elasticity offered by the beta is important in allowing Google to scale resources dynamically to meet a specific job's runtime needs.

"In alpha mode it was OK if the system runs in a relatively static environment, which is what people are used to. But we're effectively showing them, 'It took you 10 minutes to run your job. You can now run it five minutes for fundamentally the same cost'. We're just deploying more resources and elastically managing it for you," he said.

"You can run faster but get the same accuracy, and you control costs. You can now deploy a cluster that autoscales intelligently."

Schmidt said the introduction of correctness controls to Cloud Dataflow, whose programming model is completely open-sourced, is very necessary for tuning the accuracy of streaming data.

"Batch systems are highly correct and reliable. We've been doing that for years. But when you move into the streaming world, time becomes your enemy. Time doesn't stop, so messages are coming in from different devices at different data rates, and you want to process them in real time," he said.

"But the challenge is you're never guaranteed to have all the data that you need that represents that window of time because the upstream system could be lagging. Someone's phone could be having a difficult time, it gets to an edge node and that edge nodes tips over and it reboots or it gets to a queuing system and that queuing system has a lag in it."

So the question becomes what should be done with data that is delayed - wait until it catches up or admit the data that has arrived and deal with the late data later?

"It's a very specific concept but it's also extremely powerful. It's a deficiency in pretty much all existing systems," Schmidt said.

The new correctness controls in Cloud Dataflow offer the options of processing late-arriving data but with notifications that it is late, dumping it, again with notifications, or accumulating it and then updating answers later.

Also available with the Cloud Dataflow beta is improved worker, or virtual machine, scaling and management, according to Schmidt, with constant inspection of the throughput of each worker to spot laggards, whose work can then be redistributed.

"Imagine if a network card on that machine is going bad, and packets were being dropped, and its work times are increasing, or maybe your work code is processing a key on a record and that key structure happens to be super bizarre and the algorithm in your code is taking longer to run," he said.

"What would happen in a classic cluster environment is these would continue to lag and the entire stage would be affected, so even though some workers are working faster it can't complete until everything is done.

"If you take elasticity and you combine it with worker optimisation, you now have a model where we're maximising the resources that you're paying for and we're also minimising the clock time."

Potential use cases of Cloud Dataflow, which can run in batch or streaming mode over small or vast amounts of data, range from mobile game developers, who need to know in near real time whether what they have just pushed out is now causing critical user behaviour, to healthcare applications.

"The real use scenarios come down to this: people who want to do ETL [extract, transform, load], move data from point A to point B and along the way want to do something to it, filter it, maybe anonymise it, enrich it with other data and then maybe move to some place else to do the analysis or we can also do the analysis for you inline in classic MapReduce style or continuous analysis," Schmidt said.

More on big data

Editorial standards