How Amazon's DeepLens seeks to rewire the old web with new AI
In a world where users interact with cloud-based servers using not only audio but video, text may already be out of fashion. That may be perfectly fine with Amazon, which is testing a deep learning-enabled camera like it's going for broke.
Every Amazon Echo, Google Home, Sonos One, or similar device in your house that recognizes your voice, and every smartphone through which you've ever spoken with Siri, Alexa, or Cortana, has an open and direct channel to servers someplace on the back end that are running artificial intelligence. By volume, the category of AI application that is at or near the top of the most actively deployed list is not decision making or forecasting but speech recognition. As bandwidth becomes more plentiful and functions on the cloud more accessible, object recognition from video or photographs will not be far behind for long.
In November 2017, Amazon Web Services launched a real-world experiment to determine just how soon object recognition may become both viable and reliable. Like Echo, DeepLens is an Amazon device suitable for at-home deployment. Unlike Echo, it's an Intel Atom X5-based, Ubuntu Linux-powered, quad-core computer, with an attached camera whose purpose is to scan for something or someone in particular. Behind the device, on the server side of the system, Amazon's servers run algorithms that scan the incoming contents of the video in search of something familiar: Any part of the image that its databases may have already tagged and identified.
Imagine what an AI-powered camera could do
"We believe that you'll be able to get started running your first deep learning, computer vision model in ten minutes, from the time that you've unboxed the camera," said AWS CEO Andy Jassy, announcing the product during his company's re:Invent conference in November 2017.
"You can program this thing to do almost anything you can imagine," continued Jassy. "For instance, you can imagine programming the camera with computer vision models where, if you recognize a license plate coming into your driveway, it'll open the garage door. Or you could program it to send you an alert when your dog gets on the couch."
As you've caught me saying in ZDNet before, that word "imagine" can be the tripwire. "Imagine" is the word vendors invoke when they need to leverage your own mind to fill in some gaps in their own products. DeepLens is far from ready for the retail consumer market. Thankfully, other than that dangerous word "imagine," it does not pretend to be.
DeepLens is an experiment, which for $249 you can partake in yourself, if you have some experience training convolutional neural networks (CNNs, and no, I didn't just make that up). It's an important experiment from Amazon's side of the system as well, specifically as an effort to make cloud-based AI services profitable.
In October 2017, Amazon cleverly inserted a front porch camera as part of the release of its Amazon Key delivery system. Then in February, Amazon acquired Ring, a firm that had already produced a doorbell-based camera that alerts users through the web that people are at their doors. But DeepLens is not just about the front porch. Its objective appears... well, deeper: To establish the connections that bind the web, in an era when typing will be the standby method of communicating with the search engine.
Google became the champion of web search when it found a way to reap revenue from every query -- through keyword relevance-based advertising. As more searches and similar transactions become conducted using voice and, eventually, video -- which consume orders of magnitude greater bandwidth than text -- the cost of processing these transactions rises. And since the keyword ads market long ago became commoditized, raising rates is no longer an option.
If the keyword ads market collapses, the web as we know it will go away. So the time to discover the next generation of relevant, profitable transactions is now. DeepLens ventures beyond the territory of Echo, into the unsettled embryo of a new and largely indeterminate market, with all the tenacity of an organization with both the capital and wherewithal to make one or two missteps along the way... if it gives Amazon a chance to leverage its growing expertise in deep learning to corner a new industry.
Imagine a world, Amazon may be saying to itself, where Google isn't already 85 percent of the way toward whatever it is you dream to accomplish.
There's a handful of sample apps, one of which was inspired by an episode of the HBO series "Silicon Valley." It attempts to identify a hot dog, based on data it's already collected from having analyzed pictures and videos of hot dogs. (This last sentence should tell you everything you really need to know about DeepLens' readiness for prime time.)
There are a few others: One which estimates the number of degrees of tilt for a person's head; another which attempts to discern images of cats from images of dogs in a group of both; one which attempts to recognize the actions associated with about three dozen activities, including cutting food with a knife and playing the drum; and interestingly, one which attempts to apply the artistic style of a painting it's studied to a real-world image.
At a machine learning conference in London in April 2018, AWS technical evangelist Julien Simon demonstrated one of DeepLens' sample applications, identifying chairs, people, and bottles in the same room. What the demo managed to reveal is that, at least at moment one, DeepLens doesn't do much on its own. When one attendee placed a protein shake bottle in front of the camera, DeepLens only took a few seconds to correctly identify it as a bottle.
"Yay, deep learning!" declared Simon, though not too fervently. "That's the thing, it has only 20 things to pick from. So it's unlikely that it's gonna say, 'Sheep,' but you never know... Of course, what you want to do is to train your own models. You want to run your own project... That's the idea, really: To let people train and build fun apps."
"For AI in general, a lot of projects at the very beginning suffer from a paucity of data," remarked my long-time friend and colleague, and fellow ZDNet contributor, Ross Rubin. "As you have more developers bringing more scenarios to the table, the underlying technology can improve."
True consumer-grade artificial intelligence, at the level required to reliably identify multiple classes of objects collectively (as opposed to just one class, like people's faces) has yet to be developed. What Amazon is truly experimenting with here is the notion that its consumption models may be applied to people who can develop these capabilities -- that people would gladly pay for the opportunity to make future applications feasible.
"Clearly, many open source projects begin with the germ of an idea, and very basic functionality in the early days," added Rubin. "They attract interest and improve, and in many cases, become either competitive or best-in-class over time."
Beneath the surface, but not very far beneath, DeepLens is an experimental toolkit clearly geared for high-level programmers accustomed to low-level transactions with Linux command line-based clients. It may look more like a high-end, motion-tracking add-on for a game console, but in practice it's a lot more like the Altair 8800 in 1975 -- a toaster-sized box begging for something new to do.
Is it really open source?
For the most part, the software libraries used in the creation of DeepLens software is open source. That fact does not restrict the applications one produces with it, to being open source as well. Depending on the license, commercial products may be built with open source underpinnings. A glance over the DeepLens license reveals that Amazon restricts its developers to using DeepLens "for personal, educational, evaluation, development, and testing purposes, and not to process your production workloads." That last part suggests DeepLens may not be used as part of someone else's commercial product development.
The license does not go so far as to imply that Amazon has de facto ownership, or right of first refusal, over anything a developer may create with it. However, it does suggest there's little else a developer may do with it besides experiment, limiting his commercial options to negotiating with Amazon. Participants in the DeepLens Challenge competitions have seen their works published to GitHub, presently the most prominent open source distribution channel.
Isn't Google doing something similar?
One month prior to Amazon's 2017 announcement with its own Google declared the existence of a project called Google Clips, utilizing a device it believes is small enough to be wearable. (Perhaps on Halloween, or to scare your children into believing you're part-Dalek.) Compared to DeepLens, Clips is not an AI experimental toolkit, but rather a device that utilizes motion detection, along with what Google qualifies as AI, to capture what it detects to be just the right moments without the need for human interaction. Clips is limited to this purpose only, but it maintains Google's place in the discussion of AI in an Internet of Things.
What DeepLens is Really Made Of
As a whole, the DeepLens system is comprised of the Intel Atom / Ubuntu computer at the client side, the extensive AWS network of AI and services on the server side, and a less-than-insignificant bit of wire and paste in the middle called the internet. The entire system is required in order for DeepLens to complete any single task -- which for now falls under the category of identifying objects from still or motion video.
The purpose of the client-side device is to collect the video data needed for the very large data sets required to train the system as to how to recognize objects. By "recognize," I mean to be able to flag a pattern of adjacent pixels from the middle of a photo or a moving video as being similar enough to a pattern that has been observed before. And by "object," ironically, I'm being very subjective. What the system will characterize as an "object" is a pattern defined by rules that the AI algorithms will have determined for themselves. In other words, I can't exactly explain what makes an object an "object," other than the presumption that the system will have identified characteristics that, for its purposes, qualify as similarities. Think of how a baby learns to mimic sounds her parents make (without Amazon having printed a manual for her first) and you'll get something objectively similar to the point I'm trying to make.
The fact that DeepLens' hot dog recognition app is correct as often as it is, is somewhat impressive. It has already, to a reasonable extent, been trained. More precisely, the app has already been fed very large data sets (for now, that's the term for them), all of which the app trusts to represent the same thing. Its algorithms have identified correlations, which serve as its best-guess-estimates for why all these pictures include the same thing. The test is whether these correlations are meaningful enough for the app to effectively know a hot dog when it sees one.
What DeepLens needs in order to do anything serious
To seed interest in the device, AWS gave away its first batch to select developers attending its re:Invent 2017 conference. There, they could attend the first sessions devoted to how to program the thing.
"Probably the first thing we need is to be able to experiment in the cloud," said AWS' Julien Simon. "Obviously, even if we want to deploy at the edge, on camera, any kind of sensor, drones, etc., we still need to work with data sets in the cloud."
At the heart of the DeepLens system in the cloud is, for lack of any other abbreviation, the CNN. We could spend several volumes defining this phrase (believe me, I've tried), but I can instead give you a few basic images that give you the gist of the idea.
In computing, a neural network is a model for representing values based upon a old theory (which, ironically, neurologists contend has since been proven wrong) about how the brain learns to recognize patterns. The model involved associations, which the theory describes as relationships that are memorized. Those associations are physically represented by neurons, like sockets in an electrical chain linked by wires. As an association grows stronger, its wire is "yanked," if you will, by a representative weight that gives it higher priority when the brain tries to remember and recall something. What the weight is yanking -- the physical substance of the association -- is called an axon.
So a data set, such as an image or a set of frames from a video, impresses itself in sequence upon the neural network. As certain impressions are made more frequently, their representative weights grow heavier, and their axons are given higher priority.
The convolutional part (the "C" part of the phrase, which would sound especially awesome if it were pronounced by James Earl Jones) is actually, quite literally, convoluted. It has two simultaneously correct meanings, one of which may be the ability to have two simultaneously correct meanings. Neurologists perceive a model where associations are twisted together like a jelly roll or spiraling strands of DNA. As these convoluted strands experience several layers of processing, over time the properties and characteristics of the things the brain perceives (size, color, shape) may take on actual, physical proportions -- a true handle on things that have similarities. That way, when the brain fires a signal that's essentially categorical (for example, a search for all things bigger than a breadbox), it can recall objects belonging to the class as easily as it can recall any single object. At least, that's the theory, assuming you believe it.
In computing, a convolutional model is one where tables of values can be transposed into long sequences, and/or vice versa. A CNN actually does the latter to achieve the former.
The software components of DeepLens
When you play with DeepLens, you're experimenting with the mechanisms with which such a model is produced. Here are some of the software tools you would use as part of DeepLens:
TensorFlow -- The leading open source framework for automating the distribution of actions and events over the various layers in DeepLens' data sets. A tensor is an arrangement of values in a space with any number of dimensions ("with n dimensions"). The values inherent in a video image would translate into a 2D tensor. A typical programming language assigns individual values to symbols or units of memory; with TensorFlow, you distribute actions over areas.
Amazon SageMaker -- Announced during the same week as DeepLens, SageMaker is AWS' cloud-based service for automating the training of models in a deep learning system. The hot dog recognition app was already trained; you would use SageMaker to create a new model for detecting a different object, or perhaps a category of objects with a similar characteristic. This is an Amazon product unto itself, requiring the user to have a fairly significant investment in AWS cloud services (ah, there's the rub!), including S3 storage buckets.
AWS Lambda-- Amazon's function-as-a-service platform, using what many call a "serverless" model (you don't think about the server much when you use the function). Unto itself, Lambda runs functions on its own servers, and renders the results as solutions. Many of DeepLens' necessary functions are provided to the system as Lambda functions.
AWS Greengrass -- Because it's impractical for the entire machine learning operation for video to be processed in the cloud, AWS devised a system for Lambda functions to be transmitted to remote devices -- for example, Internet of Things components -- and executed there, at "the edge." Intel's Atom X5 box is one of these components. With Greengrass, a Lambda function is still "consumed" the way it would be if it were being processed in Amazon's cloud, except it's actually being processed locally.
When AWS' Simon refers to "the edge," he means the outer border of a communications system that is closest to its messages' destination. As regular readers of ZDNet may recall, there isn't really one edge (which makes the use of "the" in this context a bit presumptive). Here, what Simon is referring to is the DeepLens device's capability to pre-process, cleanse, and transform the video data to some extent, before uploading it to AWS' servers.
If bandwidth were ubiquitous and cloud servers infinitely powerful, the DeepLens device could be little more than a pocket calculator with a telephoto lens. It could send streaming video directly to the cloud, and you'd have your answer as to whether the images therein contain a hot dog or not, within a second or two.
In such an environment, there wouldn't need to be an "edge." All the processing could take place in the cloud, on the back end.
If you've ever uploaded a video to YouTube, you know full well why this isn't possible yet. The bandwidth does not yet exist, not just for real-time video uploads but, in addition, instant analysis.
So DeepLens has to have some processing power right next to its user, at what Simon and Amazon perceive as the edge. Now that a great many people, myself included, have fully functional Linux computers in their pockets, it should shock no one that DeepLens' Intel Atom X5 device is a complete computer in a form factor close to that of an external hard drive or a hardback book.
The DeepLens form factor
The DeepLens device communicates with its user (whom it presumes to be a knowledgeable developer) using a "terminal" -- a Linux command line. While it could have used a graphical front-end or a web browser, Intel assumes the developer already has these things on her PC anyway, with which she can already communicate with AWS, and access the DeepLens Console. Adding another browser to the mix would be redundant.
With real-time digital video, there's no way yet to transmit the data, store it, then analyze it fast enough for the results of that analysis to still be valuable. That's why the Intel device is really a full-scale computer.
If you already own a full-scale computer, you may be asking yourself, why doesn't AWS simply sell the camera as a PC peripheral, and distribute the AI application for it over the web? Certainly that's feasible. But with the PC market subsiding, and Windows 10 generating about as much consumer enthusiasm as a flu shot, the first real generation of consumer-oriented neural network-based AI will also need to be something that captures consumers' hearts as well as their minds.
Is DeepLens intelligent?
"Artificial intelligence" is a largely subjective term. As I have defined it over the past decades, it has a clear borderline that may be relocated as either people or machines either get generally smarter or go the other way.
To display artificial intelligence is for a mechanism of any kind to render a result whose outcome appears, to a reasonable person, to have required intelligence. Put another way, if something looks like the product of a smart person but is not the work of any living entity at all, its practitioner is likely to be AI.
At its outset, speech recognition was designated an AI category. Over time, as it becomes more commonplace, human beings attribute less and less intelligence, artificial or otherwise, to the concept of machines recognizing your voice and responding reasonably. In a way, it's the reasonable part that works against AI, diffusing the mystique from it and exposing itself as more the product of high intelligence than the producer of it.
But Ross Rubin believes that the mystique won't go away. It'll just transfer itself to the next round of amazing, if not yet functional, applications. Today, Apple and others are experimenting with examining the faces of television viewers to detect expressions that could reveal their true sentiments about the programs they're watching. Rubin took that to the next step, in envisioning a way such an application could be made pertinent for an individual user.
"Someone who is giving a presentation in a conference," Rubin projects, "might be able to get real-time feedback, perhaps in an anonymized way: 'Hey, buddy, you're losin' 'em! This idea you proposed is sinkin' like a rock!' Or, 'You've definitely touched on something that there's some sensitivity to in this group.' That could be valuable feedback in an organization."
When a consumer product based on or inspired by DeepLens finally makes itself available, from Amazon or whoever crosses that milestone first, and it makes a judgment call that the person dressed as a meter reader in your back yard is not actually a meter reader, his successful arrest will probably make headlines and earn 20 seconds' mention on the evening news. Five years later, when these sorts of events are commonplace, the algorithms responsible for these correct judgment calls may no longer be generally considered artificially intelligent, but rather just reliable and practical.