Trifacta goes all in on the cloud

The last of the independent data prep players still standing is now making a major pivot to the cloud that will take the equivalent of its Google Cloud Dataprep service across AWS and Azure. Beyond data prep, the new Trifacta cloud service will cover data cleansing, validation, profiling, and monitoring of data pipelines.
Written by Tony Baer (dbInsight), Contributor

Trifacta, which has become the last pure play data prep tools provider still standing, sees its future as a broader based cloud software-as-a-service (SaaS) service. This week, it is unveiling a new Data Engineering Cloud that will deliver a fully managed service on each of the major clouds. That will be in addition to, not instead of Wrangler, its long-established on-premises prep suite.

Trifacta's niche will continue to be serving as the front end design studio where the data engineer, data scientist, or business developer creates the "recipes" for data preparation and transformation. The Trifacta Data Engineering Cloud will extend beyond data prep to encompass cleansing, validation, profiling, and the monitoring of data pipelines. But those pipelines will run in the downstream execution tool of choice. The Trifacta Data Engineering Cloud service won't replace the Databricks or Snowflakes of the world, but instead let users run data prep inside them.

In the run-up to the announcement, Trifacta has had a good dress rehearsal for the SaaS service as the OEM partner behind Google Cloud Dataprep. The GCP offering put the Trifacta suite on a cloud-native platform running on Kubernetes (K8s), and while it was initially focused on ELT working with Google BigQuery and cloud storage, it recently added a premium tier that added support for non-Google data sources such as Oracle, SQL Server, MySQL, PostgreSQL, and salesforce.com. The premium edition serves as a prelude to the new Trifacta Data Engineering Cloud offering, which also takes advantage of the microservices and K8s architecture of the Google offering to provide the cookie cutter template for rollout to other clouds.

Beyond multi-cloud support, the Trifacta offering broadens beyond the no-code, drag and drop tool for business analyst to provide multiple pathways for designing data preparation. It now offers three views. It includes the original "grid" view, that provided the spreadsheet view for data preparation tasks, where values were reconciled to the right columns. Then it adds a flow view, which shows the entity relationships familiar to SQL developers, and the "code" view that is suited for Python programmers. While SQL developers can use  DBT (Data Building tool) for writing transformations using SQL Select statements, data scientists can write transforms in Python from their Jupyter notebooks; the results populate Trifacta recipes that are handed down to execution environments. A rich library of 180+ connectors are also provided. Once the recipes are created, they can be integrated into the data pipelines or workflows of external tools or services, such as Databricks, through APIs.

When Trifacta emerged roughly a decade ago, data preparation was targeted at data lakes, viewed as a rough-cut alternative to traditional ETL tools, typically using a spreadsheet-like interface where rudimentary machine learning capabilities would suggest columns names, spot specific types of data patterns such as street address, names, or personally-identifiable data such as account numbers, and then suggest which columns could be consolidated and modest corrections to make data more correct or uniform.

These capabilities eventually became commodity, and as such, ended up getting incorporated into ETL suites, data science tools, data catalogs, and so on. Unlike the old days of enterprise data warehousing, where IT or database developers handled data transformation, data preparation became a broad-based responsibility as end users, from business analysts to data scientists, clamored for self-service. Instead of forcing these folks into different tools, data prep grew ubiquitous in their existing workspaces and tools of choice.

Also: What is low-code and no-code? A guide to development platforms

Not surprisingly, most of Trifacta's pure play rivals have either disappeared or been acquired, among them, Paxata by Data Robot less than a year and a half ago. At this point, Alteryx, which also positions itself as an "analytics process automation" workbench for citizen data scientists, remains Trifacta's best-known rival

Not surprisingly, with core data prep functions commoditized, the new Trifacta offering goes beyond that with predictive transformation that autodetects data formats and structures and infers transformation logic; "adaptive" data quality that statistically profiles data to identify complex patterns and suggest transformation rules; and "smart" data pipelines that model data flows. While data integration, data science, and analytic tools cover data prep, Trifacta is positioning its Data Engineering Cloud as a more deluxe service.

With the new cloud service, not surprisingly, Trifacta is rolling out consumption-based pricing, providing a contrast to the traditional licensing of its Wrangler on-premises suite. It's an expected route for SaaS providers, and for Trifacta, is intended to open up its addressable market beyond large enterprises that start with six-figure investments with tiers that start with free trials and starter subscriptions at $80/month.

The service, not surprisingly, is patterned off and expands on the OEM service that Trifacta has delivered with Google for the past three years. There will be feature parity across AWS and Azure, in addition to GCP. Nonetheless, GCP will remain first among equals as a jointly supported and sold OEM offering natively integrated to BigQuery.

Trifacta's challenge is akin to that of third party databases or analytic tools that are not the captive of a specific cloud provider, analytics tool, or data science workspace. It's the classic choice between umbrella platform vs. best of breed, and single cloud vs. multi-cloud. For Trifacta, it is enterprises whose data assets and analytic platforms are heterogenous and likely to remain so. With APIs, Trifacta aims to embed its data engineering services into the workflows of whatever runtimes that business analysts, data engineers, or data scientists are using. Thanks to its three years running an OEM service on Google Cloud, Trifacta is not entering the world of SaaS as a rookie.

Editorial standards