How Trifacta is helping data wranglers in Hadoop, the cloud, and beyond
Trifacta is known for doing one thing, and doing it well: data wrangling. Because of this, the company has an informed, data-driven view on the big data and not-so-big data market. Trifacta's insights have driven its latest product release, but are also relevant to draw a big picture of big data.
The focus of Trifacta is enabling people who know their data best (analysts & business people) to effectively explore, structure, and join together diverse data sources for a variety of business purposes.
The company also just released a new product, and this presented a good opportunity for us to have a discussion with Joe Hellerstein, Trifacta's co-founder and CSO, and Joe Scheuermann, VP of marketing, for their thoughts on machine learning, data wrangling, Hadoop, and more. According to Scheuermann:
"The majority of data wrangling work was previously done by technical users in coding environments or complex ETL systems, so a lot of our users are relatively new to this type of work and benefit greatly from intelligent suggestions for how they should explore or prepare their data. These predictive transformation capabilities are powered by machine learning.
Every click, drag, or select within Trifacta leads to a prediction where the system intelligently assesses the data at hand to recommend a ranked list of suggested transformations for users to evaluate or edit. For more advanced users, the automated guidance and parsing of data accelerates the efficiency of their work.
Machine learning and intelligence is a critical aspect of our user experience and a core company strategy. With our free Wrangler desktop product, we've been able to build the largest global community of data wranglers and we're able to leverage all of this rich, anonymized usage metadata information to constantly improve upon the intelligent suggestions and guidance in our product by training our machine learning algorithms."
This means Trifacta is sitting on tons of data about using data. But their founding vision, Scheuermann says, "was not exclusively focused on wrangling big data. When the company was originally created out of 15 years of joint research between Stanford and UC Berkeley, the focus was on helping anyone who works with data have a more productive and enjoyable experience making data useful for analysis."
For Hellerstein, it was data and a little bit of a friendly push that made him go from being a renowned academic researcher to co-founding a startup that has turned into a company operating worldwide. He was no stranger to "the real world", as he had served as advisor and board member in a number of companies, but when the data from the initial incarnation of the Wrangler tool started accumulating, they pointed to the fact they were onto something.
"is because we saw tremendous inbound demand for an edition of Trifacta that supported the data wrangling requirements of teams that were working with data outside of Hadoop. Given the diversity and scale of wrangling challenges in platforms such as Hadoop, it was an obvious starting point and continues to be a focus area of our development and go-to-market efforts around our Wrangler Enterprise product.
Nearly all large enterprises that we engage with have a Hadoop initiative in various stages of implementation and adoption, so we continue to see a growing and maturing market for big data technologies. But we see demand for wrangling everywhere, on a user's desktop, in the cloud, amongst teams with diverse data in various systems and in the Hadoop/big data realm. Our focus is to help customers address these wrangling challenges regardless of the data's size, shape or location."
To the cloud and beyond
So have we reached "Peak Hadoop" yet? According to Scheuermann:
"It's less of a question of 'Spark vs. Hadoop' but more of a question of a sea change in how organizations are approaching analytics/big data initiatives. The cloud brings a lot of advantages for deploying infrastructure and with the constant innovation being driven by cloud service providers, we're seeing a lot of momentum with customers moving in that direction.
Our latest v4 release included extensive support for deploying Trifacta in these environments and integration with various cloud services is a big focus of ours moving forward. That being said, a lot of organizations will never move certain data to the cloud and we will continue to support their use cases whether on a Hadoop data lake or some other environment."
But how is Trifacta going to evolve in the future? Is there a danger of becoming superseded by vendors with extensive ecosystems that are already beginning to include intelligent data wrangling tools in their offerings, following the footsteps of Wrangler? Hellerstein sees this as proof of Trifacta's success:
"When we started, we had to build software, but we also had to educate people on what it is we do and why it's valuable. Today others are trying to copy us, so we must be doing something right. Being part of an ecosystem can be tempting, but it does not necessarily mean you can displace a better product. We do believe we have a better product, and we are part of an ecosystem of our own via our partnerships. We have a good headstart and we are constantly evolving to stay ahead of the curve, so competition is welcome.