AI still writes lousy poetry

Machine-generated poetry doesn’t show much thought, but it’s getting good enough to receive favorable nods from human readers.
Written by Tiernan Ray, Senior Contributing Writer

Her eyes, twin pools of mystic light,
Forever in her radiance white—, 
She sought the bosom of the Night. 
Away it came, that mystic sight!

— Anonymous human writer in collaboration with poetry algorithm  

A survey of recent literature in the machine learning category of artificial intelligence shows steady progress in the development of techniques for automatically generating poetry. 

The output remains fairly mediocre, but it is getting good enough that some human readers will give the poems respectable marks in controlled evaluations. And some people will even be fooled into ascribing human authorship to machine poetry. 

While the essentials of the most sublime form of human writing escapes AI, the software is proficient enough to generate facsimiles that can pass a test. 

The poem listed at the top of this article was created in part by a machine. A Google software program, called Verse by Verse, introduced in March, takes a piece of input from a human, the first line of text, Her eyes, twin pools of mystic light. It then continues the stanza, automatically producing the subsequent three lines.

This quatrain, as it's called, was favorably regarded by crowd workers whom Google enlisted to compare the human-plus-machine creations with poems written entirely by a human. The program and the evaluations are described in the introductory paper by Google researchers David Uthus, Maria Voitovich and R.J. Mical.

Verse by Verse's ability to produce text is a result of the program having ingested a corpus of twenty-two different poets' work. On the program's Web site, a user is invited to choose up to three of the famous poets as "muses" to complete a poem for which the user supplies the first line.

The results are an amusing parlor game and a lousy poem. Here is what happens when Verse by Verse is fed the first line of Rainer Maria Rilke's Duino Elegies:


The four-line output requires the user to pick from among multiple suggested lines generated by the algorithm, each line conditioned to be in the vein of one of the three chosen poets. While some other combinations could be more desirable — choice and possibility are signature elements of automatic text — it's unlikely any ground-breaking verse would emerge even with many attempts.


The poetry machine: Google's Verse by Verse takes in a body of work as input to train a Transformer neural network to generate text in general, and then the final output of any user session can be conditioned on input samples of specific poets.

Uthus et al.

Verse by Verse is an example of a trend over the past decade among AI scholars in academia, and researchers at Google and other large corporations, to refine machine learning programs that either generate a completely new work or extend a phrase input by a person. 

The works all utilize what are called language models, programs that use machine learning to build a statistical representation of how words typically fall together in a sentence. The paradigmatic language model is GPT-3, a program released last year by the San Francisco startup OpenAI that has taken the world by storm. 

Many commentators have been absolutely smitten by the seemingly human text of GPT-3. The New Yorker's Stephan Marche has written that GPT-3 can write like Franz Kafka, citing a snippet of The Metamorphosis re-written by GPT-3. 

Also: What is GPT-3? Everything your business needs to know about OpenAI's breakthrough AI language program

In fact, the redux really doesn't sound like the original. But Marche and others have caught on to the main achievement of GPT-3 and other language models, which is to replicate the surface qualities of word combinations, which can include stylistic emulation. 

Poetry has been a popular choice to push the limits of what such language models can capture because most poetry is marked by formal qualities that can be statistically measured including meter, rhyme scheme, and assonance.  

Recent work on AI poetry seeks to mimic those formal elements with increasing rigor. 

Kevin Yang and Dan Klein of U.C. Berkeley in April published a paper on their invention, called FUDGE, which can automatically generate the second line of Shakespeare's couplets, spontaneously replicating iambic pentameter, the stress pattern that Shakespeare utilized so effectively. 

What emerge are clunky couplets but decent formalism. Here is Shakespeare's original sonnet Number 48:

How careful was I when I took my way,
Each trifle under truest bars to thrust,
That to my use it might unused stay
From hands of falsehood, in sure wards of trust!
But thou, to whom my jewels trifles are,
Most worthy comfort, now my greatest grief,
Thou best of dearest, and mine only care,
Art left the prey of every vulgar thief. 
Thee have I not locked up in any chest,
Save where thou art not, though I feel thou art,
Within the gentle closure of my breast,
From whence at pleasure thou mayst come and part;
       And even thence thou wilt be stol'n I fear,
      For truth proves thievish for a prize so dear.

In FUDGE, Yang and Klein take the first line of the ending couplet, And even thence thou wilt be stol'n I fear, and have the program write a new second line. What comes out is not as poetic, and bears no trace of the extended metaphors of the sonnet:

And even thence thou wilt be stol'n, I fear,
for this shall be the end. That's pretty clear.

Technologically, FUDGE is a tour de force. Yang and Klein have taken GPT-3's predecessor from 2019, GPT-2, and tweaked it. (GPT-2 is available for download, which makes it a popular choice for language model development, unlike GPT-3, whose use is restricted by OpenAI.) 

GPT-2 and GPT-3 don't know anything about iambic pentameter, they merely ape whatever example style they are given. By adding some lines of code, Yang and Klein were able to force FUDGE to reliably keep to iambic. Hence, FUDGE couplets fulfill a stylistic obligation in a consistent fashion. 

Yang and Klein, to their credit, are under no illusions about the quality of the generated couplet. "Shakespeare is only included as a whimsical point of reference," they write, "our generations obviously do not hold a candle to Shakespeare's originals."

When machine poems fall flat, it is all the more obvious within a very particular formal tradition. Take limericks, those beloved five-line poems that have a consistent arrangement of syllables and a consistent rhyme scheme. 

Michael Palin, of Monty Python fame, has offered up limericks of his own creation:

They said of a midwife called Paula,
If there was any trouble just call her.
Her skills in the water
She learnt from a porter
Who delivered fish, fresh, off a trawler.

Whether or not such poems are funny, and they're generally supposed to be, they usually have one complete idea that is played out in the five lines, with a kind of twist or surprising turn that is designed to produce delight.

To try and auto-generate limericks, Jianyou Wang and colleagues at Duke University in March debuted LimGen. LimGen uses what's called a template, a set of rules of how limerick lines are formed, such as a subject plus a verb plus an object. That's based on 300 limericks as examples, a relatively small selection. 

Wang and team add to the template another algorithm, popular in language models, called a beam search. It automatically scores the text generated by the template program to select the best output, as a kind of voting authority.

The results in a sense work, by feeling reminiscent of limericks, but there's something flat about them:

There was a honest man named Dwight 
Who lost all his money in a fight.
His friends were so upset,
They were willing to bet, 
And they did not like feeling of spite. 

There was a loud waitress named Jacque, 
Who poured all her coffee in a shake. 
But the moment she stirred,
She was struck by a bird,
Then she saw it fly towards the lake. 

Although there is, roughly, continuity in these limericks, there is a strange dissolution toward the end of each stanza, as if the development of the idea has been surrendered to the formal constraints.


The limerick machine: A template assembles sentences automatically matching known constraints for what each line of a limerick should be, and then an automatic search function, called beam search, picks out the best candidate lines. 

Wang et al.

Given the dissatisfactory results of raw output, more programs will probably emulate Google's human collaboration in Verse by Verse. 

The usual term in AI for that combined effort is human-in-the-loop. A striking inversion that has emerged is "computer-in-the-loop." 

Imke van Heerden and Anil Bas, scholars from Koç ̧University and Marmara University in Istanbul, in March debuted a computer-in-the-loop approach to enlist humans to effectively edit a machine-generated text into a poem. They focus on Afrikaans, one of the official languages in South Africa and other countries in the region, a language that hasn't traditionally gotten a lot of attention in AI language models.

Van Heerden and Bas's language model program, called AfriKI, for "Afrikaanse Kunsmatige Intelligensie," Afrikaans artificial intelligence, is explicitly trying to enhance, rather than displace, human work. 

"Whereas [natural language generation] in its quest for full automation may frown upon human involvement, our human-centered framework does the opposite," they write. 

"This study demonstrates that human-machine collaboration could enhance human creativity."

AfriKI ingests all 208,616 words of a single Afrikaans-language novel, Die Biblioteek aan die Einde van die Wêreld (The Library at the End of the World) by Etienne van Heerden. 

In a process that seems similar to Google's Verse by Verse, AfriKI generates hundreds of prose phrases, and the human chooses which phrases to use, and the order in which they should be assembled into a stanza. 

The result are short pieces that have some vivid imagery and some interesting uses of metaphor: 

Die konstabel se skiereiland

Afrika drink
onheil in die water.
Die landskap kantel sy rug
in sigbewaking en vlam. 
Ons oopgesnyde sake
brandtrappe vir die ander state. 
Hierdie grond word intimidasie.

The constable's peninsula

Africa drinks
disaster in the water.
The landscape tilts its back
in surveillance and flame. 
Our cut-open affairs
fire escapes for other states. 
This soil becomes intimidation.

As the authors note, there is enough figurative language and metaphor here to be reminiscent of some schools of poetry. "The language can be described as minimalist, evocative and abstract, and therefore open to interpretation, resembling Imagist and Surrealist poetry."

Up to a point. The poems still seem to be mostly tinged with colors, brush strokes, without having an idea. 

It's easy to see the common pitfall to which all of the language models succumb. Machine learning programs are transformation machines: their utility is to transform some input data into output in an automated fashion. 

Language models take example text, such as poems, and transform them into a score that sums up the relatedness of words in the frequency of their co-occurrence, as well as many other measurable things such as the frequency of sounds and syllable counts. 

Also: AI in sixty seconds

In that way, AI is performing a data compression action, compressing entire libraries into economical bundles of stats. The act of decompression reconstitutes the formal patterns of language in the generated text. 

What escapes such a process is another kind of compression, the human poet's compression of associations on a vastly larger scale. Poetry plays around the edges of things and what is left unsaid is what is able thereby to emerge.

Here are Romeo and Juliet romancing one another with word play:

ROMEO: If I profane with my unworthiest hand
This holy shrine, the gentle sin is this
My lips, two blushing pilgrims, ready stand
To smooth that rough touch with a gentle kiss

JULIET: Good pilgrim, you do wrong your hand too much,
Which mannerly devotion shows in this;
For saints have hands that pilgrims' hands do touch,
And palm to palm is holy palmers' kiss.

The lines contain not only the formal relationship of sound and imagery in the internal structure. They also contain the play of turning over ideas, regarding them from different angles, refracting them like light.  

Whether that can be captured in a statistical model, perhaps a more sophisticated one, is an interesting question. But right now, the state of the art in AI misses the point. 

The giveaway is the way that AI researchers speak of their endeavors. The various language algorithms are all working on "the problem of poem generation," as one paper puts it. But generation is probably the wrong term.

In his Letters to a Young Poet, Rilke wrote about the importance of solitude as something that strips away the business of the world, making clear what is essential. 

Consider again the first line of Rilke's Duino Elegies:

Wer, wenn ich schriee, hörte mich denn aus der Engel Ordnungen?

Who, if I cried, would hear me among the hierarchy of angels?

The novelist William Gass has said that Rilke didn't generate the Elegies so much as receive them. 

"The Duino Elegies were not written,' observes William Gass, 'they were awaited'," as the critic Lewis Hyde quotes him. (Gass himself said that his preoccupation with formal qualities of writing were an infantile phase, an obstacle. "It wasn't until I was ready to come out of my formal phase that I began to read Rilke," he has said.)

The human poet, rather than being a transformation machine, is something more like a finely tuned antenna, picking up what is already out there. The solitude that Rilke referred to allows that kind of attention. 

AI, as a transformation machine, runs in the opposite direction of solitude, quiet, silence. AI in a sense fears a vacuum. Its goal is to reconstitute total information, Big Data. More often than not, automatic language cannot help adding more stuff.

Some AI studies openly acknowledge the drawbacks of merely replicating formal qualities. A study by IBM researchers in 2018, called Deep-Speare, asked human crowd workers to judge whether a Shakespearean sonnet was actually by The Bard or was machine-generated. 

While many poems by the machine were judged successful on formal grounds — rhyme and meter, say — human crowd workers found poems dissatisfying in terms of emotional impact. So did a professor of English, Adam Hammond of the University of Toronto. 

As the authors write,

Despite excellent form, the output of our model can easily be distinguished from human-written poetry due to its lower emotional impact and readability. In particular, there is evidence here that our focus on form actually hurts the read- ability of the resulting poems.

Nevertheless, AI's predilection for information overload will only increase as more scholars subject human poetry to analytical techniques that employ massive surveys of texts, which can then be sliced and diced. 

For example, Thomas Nikolaus Haider of the University of Stuttgart, and Steffen Eger of the Technical University, Darmstadt, in 2019 compiled a corpus of 75,000 poems in German by 269 authors, from the 16th century to the present.  It is "the largest corpus of poetry to date," they note. 

The authors analyzed the "tropes" in the poems, meaning, patterns of expressing a given concept in language that reoccur — once again, things that can be quantified. 

Using a familiar machine learning technique, where juxtapositions of words are given a numeric score, the authors found that tropes such as the expression "love is magic" have increasing prevalence in the German Romantic period, the 18th and 19th centuries. They compare that to phrases that have diminishing currency, such as "the drums of love."

The point is, the study of the statistics of poetry, enabled by large collections of data and new analytical tools, lends support to the sense that there are patterns, at least formal patterns, that underlie the creative impulse, and that there must therefore be something that can be captured by the appropriate program.  

The punchline to all this is that most humans appear unable to tell the difference between a machine's writing and that of a person, and if they can tell at all, they don't necessarily care. 

In a paper in January, amusingly titled "Artificial Intelligence versus Maya Angelou: Experimental evidence that people cannot differentiate AI-generated from human-written poetry," Nils Köbis and Luca D. Mossink of the University of Amsterdam and the Max Planck institute asked people to pick which they preferred among two poems that each began with the same line, one completed by a person and one completed by GPT-2.

Across multiple different test set-ups, the authors found that  "people are not reliably able to identify human versus algorithmic creative content." 

Moreover, many people showed they were just fine with the machine-created poems even when they were told up-front that they were reading the work of an algorithm. 

In another study, Andrea Zugarini and colleagues at the Universities of Florence and Siena in 2019 generated tercets, a three-line unit within a poem, and challenged humans to tell them from Dante Alighieri's own tercets in his poetic masterpiece The Divine Comedy

Naive human judges, those with no particular background in Dante studies, judged the generated tercets to be really written by Dante almost half the time, basically, a coin toss. Dante experts fared better. 

Zugarini and colleagues conclude that their work is able "to keep Divine Comedy's meter and rhyme" even if it fails in other respects. 

As researchers get better and better at constructing such formal evaluations, where humans are willing to accept as valid a machine that approximates superficial qualities, then the concerns of human art may fade.

Hence, a kind of golden age of human and AI collaboration may be set to unfold, with couplets, tercets, and quatrains exploding on the scene faster than you can say 

so much depends


a red wheel


Editorial standards