Last November, the US National Security Agency open-sourced its Niagrafiles, or NiFi, data-flow software.
That uncharacteristically overt operation raised a few eyebrows, although in fairness to the NSA it was not its first gift of technology to the open-source community. It had contributed the Accumulo database to Apache three years earlier.
The NiFi data-flow orchestration tool, drafted in as part of the NSA's duty "to respond to foreign-intelligence requirements", now finds itself on the front line of Internet of Things technology, according to Hortonworks CTO Scott Gnau.
"With the Internet of Things, there are a couple of really big technological problems that need to be solved. First, compared with traditional processing and even data at rest, sensors will send data in, and you may need to communicate back out to a sensor and in some cases you may want sensors to communicate with each other," he said.
"Instead of a one-way traditional streaming or data flow, it's bidirectional and point to point. That's a really big difference technologically and from a requirements perspective."
That distinction is what marks NiFi out from technologies such as stream-processing framework Apache Storm and real-time micro-batching tool Spark Streaming.
"Those things are unidirectional streaming mechanisms and they are certainly very good technology. There are a lot of use cases, and obviously we integrate with those technologies," Gnau said.
"But as they relate to the Internet of Things, there are some discrete limitations of those technologies. You would have to go deploy them individually for each point-to-point communication, which would not be particularly scalable.
"They're technologies that may sound very similar but they have very different use cases and they're frankly all part of the different platforms that you may want to deploy in a broad base of data architectures."
Hortonworks will be releasing its NiFi distribution later this month in the form of Hortonworks DataFlow, with a sandbox download available for experimentation, together with documentation, implementation tips and techniques, and support and consulting services.
The company describes DataFlow as a package of software that orchestrates, manages, and validates Internet of Things data flows.
"It's almost like the evolution of power companies from traditional models, where they would build a power plant and ship electrons through the network to customers in a one-way path - that's traditional data movement," Gnau said.
"Fast-forward to today where there are solar panels on the grids. So now there are a thousand different generation points and a million different customers. Optimising the flow of electrons is a very different problem to go solve. It's kind of like that in data flow for the Internet of Things."
The second important area, which NiFi can also help address, is that with Internet of Things use cases the notion of the perimeter of control changes.
"In traditional data processing, in the data lake and all that, the secure perimeter is around your datacenter or, if you've deployed in the cloud, it's around your servers in your cloud. But it's a very finite group of servers that you need to protect," he said.
"In the new world, security, privacy, and data protection go out to the sensor, to what we're calling the jagged edge, because it can be millions of different sensors with different protocols in different places. Being able to create a secure passage and understand that perimeter is a very new and interesting problem that Apache NiFi is uniquely good at solving."
The third area is managing the huge and complex network of data flows created by the movement of information between devices.
"The management and infrastructure is a little bit more complex than traditional data flow and streaming. So there's this notion of data provenance coming onboard that's very interesting: being able to understand that the data you got from a sensor is incomplete, and being able to go back and get it automatically instead of processing, being able to trust and understand that the data is valid from point to point through each of the hops in the network," Gnau said.
"There are many reasons why provenance is interesting. Obviously from a privacy perspective, from a completeness of analytics perspective, and even potentially from a regulatory perspective, the traceability of that provenance will be very interesting as well. This too is something that Apache NiFi brings to the table."
Part of the way NiFi operates involves a process that runs on the sensor to collect the data and includes the security wrapper and the functionality for provenance.
"That, combined with the other-edge virtual machines, really creates that network where the security perimeter is managed from end to end with encryption and provenance built in," Gnau said.
The nature of NiFi also allows users to manage their bandwidth more efficiently, which is a significant issue given the potentially vast volumes of data generally by Internet of Things apps and the physical and economic restrictions on bandwidth that exist in many parts of the world.
"If you imagine the use case for IoT tends to be moving metal - things that move - in a connected car, I may have a sensor on my brakes that indicates that they've overheated and sends that information over a wireless network back to the manufacturer who's watching out for a warranty and trying to be proactive in caring for and maintaining the vehicle," he said.
"That's all well and good. But if you can imagine millions of those devices across different wireless networks in different parts of the world where bandwidth may be constrained, it's important to be able to prioritise the data that gets sent and maybe only send the summary level information and if an anomaly is detected. Then from the central processing area you can go back and request more data from that particular unit."
In NiFi, along with that functionality, which allows only relevant data to be sent, are the processing algorithm and a graphical user interface to help monitor and manage the bidirectional data flow.
"Other examples could be if you have sensors that are sending in imaging data from crops, you may want to send the thumbnails in first. If you see something that looks unusual, you can go back and get the high-resolution version," he said.
"All that functionality is built-in on top of the superstructure. It's not a Storm or a Spark kind of thing. It's really suited towards devices and sensors communicating with each other and optimising that communication path."
Strategically, the allying of NiFi with Hadoop - data in motion and data at rest - is an important move for Hortonworks.
"Of course the data in motion will probably land somewhere. In all likelihood a lot of it will land in Hadoop. A lot of analytics will be created from this data and stream back out into the network of data in motion for better optimisation or for pushing actions back out to devices, which is a kind of a closed-loop operational thing," Gnau said.
"I see them interacting very heavily, a symbiotic relationship where obviously a lot more data is being created in Internet of Things than anything we've ever seen. That's an opportunity to do more expanded analytics using the traditional tools - and even tools maybe yet to be developed in the Hadoop area. The interface back out to NiFi is also critically important."