A new study in deep neural networking, titled "How to make a pizza: Learning a compositional layer-based GAN model" and recently published on arxiv.org (.PDF), explores how machine learning can be used to transform a single image of a dish into a step-by-step guide on how to create it.
The PizzaGAN project is described as an experiment in how to teach a machine to make a pizza by recognizing aspects of cooking, such as adding and subtracting ingredients or cooking the dish.
The Generative Adversarial Network (GAN) deep learning model is trained to recognize these different steps and objects, and by doing so, is able to view a single image of a pizza, dissect and peel apart each object or change 'layer,' and recreate a step-by-step guide to cook it.
"Given only weak image-level supervision, the operators are trained to generate a visual layer that needs to be added to or removed from the existing image," the research paper explains. "The proposed model is able to decompose an image into an ordered sequence of layers by applying sequentially in the right order the corresponding removing modules."
To break down a pizza, you have steps such as rolling out the dough, adding a sauce, cheese, and then adding or removing various toppings.
As each task is completed, the visual of a pizza will change, and if images of each step are fed into the neural network, the machine can begin to recognize and connect each procedure to the finished product.
The pizza dataset in its first stage was composed of roughly 5,500 images, all of which were synthetic and created in a clipart style. The team behind the research say that this method saved time and allowed them to separate the toppings from the base for improved results in the neural network.
Once the synthetic images had been added to the dataset, the team then fed the machine with an additional 9,213 "real" pizza images harvested from the web. In total, 12 toppings were also added to the dataset, including pictures of arugula, bacon, broccoli, corn, basil, mushrooms, and olives.
"Given a test image, our proposed model detects first the toppings appearing in the pizza (classification)," the team says. "Then, we predict the depth order of the toppings as they appear in the input image from top to bottom (ordering)."
When shown an image of a single input pizza, PizzaGAN was then able to predict and create a step-by-step guide to pizza creation as output.
So far, PizzaGAN has succeeded in this task with a strong degree of accuracy, although understandably, the highest accuracy results were from the synthetic image dataset.
It's an interesting project, of which the applications could eventually go beyond food to other, more valuable digital systems, as the researchers point out.
"Though we have evaluated our model only in the context of pizza, we believe that a similar approach is promising for other types of foods that are naturally layered such as burgers, sandwiches, and salads," MIT says. "It will be interesting to see how our model performs on domains such as digital fashion shopping assistants, where a key operation is the virtual combination of different layers of clothes."