Amazon AWS promises to let analysts do drag-and-drop data cleansing with DataBrew

The new program, an extension to the existing Glue software, lets non-coders such as data scientists and data analysts participate in the data preparation step using templates and drag-and-drop activities.
Written by Tiernan Ray, Senior Contributing Writer

Amazon today announced it has extended its program for data cleansing, known as Glue, with a visual user interface that automates some steps necessary to prepare data, to simplify the task for non-coders.

Called DataBrew, the program lets data analysts and data scientists carry out the steps known as extract, transform, and load, or ETL, which happen before any data can be analyzed in a data warehouse or another repository. 

Whereas Glue, which was introduced in 2016, was a visual tool for engineers to do ETL with some coding involved, DataBrew is meant for analysts and data scientists to work on the same data cleansing operation by simply clicking buttons and checking off radial boxes in a visual user interface. 

As AWS describes the service, as consisting of "250 pre-built transformations to automate data preparation tasks (e.g. filtering anomalies, standardizing formats, and correcting invalid values) that would otherwise require days or weeks writing hand-coded transformations."


In a demonstration video, AWS shows how the DataBrew program can, for example, remove special characters in a database entry such as an ampersand, which can't be used in data analysis. 

Similarly, a text-string can be mapped to numeric values to make the entries analyzable, using a "categorical mapping function."

So, for example, a column "user type" that includes entries of either "subscriber" or "customer" can be mapped to the values "1" and "2" by clicking the mapping button in the user interface, and clicking a radial button, which produces a new column with the 1 and 2 values corresponding to all the character entries. 

A profiling function offers statistics about the data set, such as the number of missing entries in a dataset. 

The Amazon initiative will presumable provide newfound competition for companies that specialize in data cleansing, such as Talend. 

Amazon said it already had some customers using the software, including Japanese telecom giant NTT DoCoMo and energy giant bp plc.

For further information, there is also a Glue DataBrew blog entry on the product.

Editorial standards