.NET for Apache Spark brings enterprise coders and big data pros to the same table

A year ago, Microsoft enabled .NET developers to work with Apache Spark using C# or F#, instead of Python or Scala. More functionality and performance enhancements have since been layered on. The result shows how software world demarcations can be transcended.
Written by Andrew Brust, Contributor

Enterprise software development and open source big data analytics technologies have largely existed in separate worlds. This is especially true for developers in the Microsoft .NET ecosystem. The reasons for this are many, including .NET's Windows heritage and the open source analytics stack's allegiance to Linux.  

But Microsoft's .NET Core, already in its third major version, is cross-platform, running not just on Windows but also on Linux and macOS. And Apache Spark, which largely eclipsed Hadoop as the open source analytics poster child, has made its way into numerous Microsoft platforms, including its flagship SQL Server database and Azure Synapse Analytics, Redmond's latest gambit in the cloud data warehouse wars. Despite these developments, coders on the Spark platform have largely stuck with Scala, Python, R and Java. What had been missing was something that connected the dots between .NET and Spark.

Also read:

Casting a .NET

All this changed a year ago, when, at the Spark and AI Summit, Microsoft introduced the preview of its .NET for Apache Spark framework, which provides bindings for developers using the C# and F# languages on the .NET platform. And that plot thickened a couple of weeks ago, when Microsoft extended .NET for Apache Spark to support in-memory .NET DataFrames, something Brigit Murtaugh, Program Manager for .NET for Apache Spark, announced in a blog post.

I've been involved with .NET since its it was still in its Alpha days 20 years ago. And I've been involved in the big data world for almost half that time. I've wanted to see these two worlds converge and have argued for such a union. That aside, I hadn't really investigated the .NET for Apache Spark framework (hereafter, Spark.NET) until now, choosing instead to hobble along mostly in Python. Having now examined the framework more carefully, I like what I see and wanted to report back on it. The good news: Spark.NET works well and, beyond integrating the two technologies, makes their respective programming paradigms dovetail very nicely.

Getting started 

Microsoft has worked hard to make the Spark.NET barrier-to-entry quite low. Case in point: The .NET for Apache Spark Web site provides a big white "Get Started" button that guides developers through the process of installing the framework, creating a sample wordcount application and running it. It takes the developer through the installation of all required dependencies, configuration steps, installation of .NET for Apache Spark itself, and the creation and execution of the sample application.

The entire guided procedure is designed to take 10 minutes, and assumes little more than a clean machine as the starting environment.  In large part it succeeds (with the caveat that I had to research and manually set the SPARK_LOCAL_IP environment variable to localhost to get the sample to run on my Windows machine), and I have to say it's quite a rush to get it working.

Pick your environment

The tutorial is designed to do everything from the command line, including editing an input text file and the C# code; compiling the application and running the .NET console application by calling Spark's spark-submit utility. But experienced .NET developers who prefer to work in Visual Studio 2019 can use Spark.NET from there as well.

I verified this myself, in fact. After working through the Get Started tutorial, I created a new C# console application in Visual Studio 2019, used the NuGet package manager to add Spark.NET to my project, then replicated the coding steps in the command line-oriented Microsoft's tutorial. After compiling everything in Visual Studio, I submitted the job to Spark and everything ran just fine. 

Ready for Spark prime time

After getting things to run locally on a dev machine, you'll want to try running on a full-fledged Spark cluster. These days, that's likely to be in the cloud. The tricky part is that you'll need to make sure Spark.NET is installed on the cluster before your own code can run. Microsoft says Spark clusters on Microsoft's own Azure HDInsight service, as well as Spark pools in Synapse Analytics (currently in preview) already have Spark.NET on board.

Beyond that, though, Microsoft provides explicit instructions for deploying .NET for Apache Spark to Azure Databricks *and* to the Databricks Unified Analytics Platform service that runs on Amazon Web Services. Still not impressed? Microsoft also provides installation instructions for AWS' ubiquitous Elastic MapReduce (EMR) service.

Also read: Databricks comes to Microsoft Azure

You can deploy your .NET assembly to your Spark cluster and run it on a batch basis from the command line if you wish. But, for C# developers, Microsoft has also enabled the very common scenario of working interactively in a Jupyter notebook. That support includes a Jupyter notebook kernel that leverages the C# REPL (read-eval-print loop) technology which, is highly innovative in and of itself. Microsoft provides an F# kernel as well.

When you combine notebook support with Microsoft's enabling of Spark.NET-based Spark SQL UDFs (user-defined functions), support for .NET DataFrames, and that implementation's abstraction over Apache Arrow RecordBatch objects, you can see that Microsoft has worked hard not only to bring Spark into the .NET world, but also to bring .NET into each of several Spark programming use cases. It's also made things perform well -- Apache Arrow supports the sharing of columnar data in-memory, eliminating the overhead of converting the data into and out of different formats in order to process it.

Also read: Apache Arrow unifies in-memory Big Data systems

What's the point?

Seasoned Spark developers are unlikely to switch from, say, Python to C# to do their work, and Microsoft has no illusions about that. But the number of lines of .NET code out there, created over the last 20 years, is staggering. Bringing even a small fraction of that code into the world of open source big data has a lot of value. So too does bringing the legions of .NET developers into the world of analyzing high-volume data sitting in data lakes, as well as the streaming data and machine learning use cases that Spark enables.

In other words, Microsoft's goal here is to make the worlds of enterprise software development, analytics and data science converge. Blending those communities, use cases and skill sets, rather than leaving them in separate silos, is logical and laudable, so there's that. But, more important, if we're going to be serious about data-driven decision-making, pervasive data culture and digital transformation, unification of these communities and sub-disciplines must happen -- doing so is critical, not discretionary.

What's more, Microsoft is integrating the communities and the tech by elegantly blending their paradigms, rather than making one subservient to the other. That subtlety provides practitioners in each community a portal to see the wonders of the other, and not just smoosh them together in some contrived fashion that would make for a worst-of-both-worlds outcome. Instead, the ethos of pragmatism and platform openness that Satya Nadella has engendered at Microsoft has made its way all the way down to a developer framework. There's nothing but upside in that.

Editorial standards