AI in Love: Crafting Love Poems with Large Language Models

Large Language Models (LLMs) have caused a substantial shift within the artificial intelligence sector, integrating into consumers’ daily lives by offering assistance with tasks like text classification, summarization, and question-answering. But there is one field where LLMs excel: creative writing. They can craft emails, marketing slogans, or even full essays that are hard to distinguish from human-written content. Can these LLMs also woo us with poetry?

With Valentine’s Day approaching, Toloka asked three widely accessible LLM-powered chatbots — ChatGPT, Gemini (formerly Bard), and Copilot — to craft love poems for the occasion. This article examines the quality of the poems, the potential for personalization, and whether results could be consistently replicated by the average user.

LLM Comparison: Fundamentals of Poem Crafting

In the initial phase of our investigation, Toloka tasked the chatbots with a simple challenge to write a short poem dedicated to Valentine’s Day, but we didn’t provide supplementary details. The exact text used is shown below, alongside the chatbots’ answers.

The generated poems, while distinct in vocabulary choice, shared a structural symmetry, each comprising three stanzas with four lines. The poems from ChatGPT and Gemini follow a consistent AABB rhyme scheme within each stanza, where the first two lines rhyme with each other, as do the last two lines. This stylistic choice in these poems was most likely picked up from the training data.

To dive deeper into the LLMs’ poetry-writing skills, we ran these results by a group of literature experts to give us their opinions using the Toloka Deep Evaluation Platform. They found that all of the poems conveyed emotions pretty well and were grammatically and structurally sound. On the other hand, they criticized clichéd vocabulary and lack of continuity in the story of the poems. All of the experts preferred Copilot´s poem, commenting on the slightly more elaborated vocabulary and the fact that it conjures quite a lovely picture of lovers walking in a rose garden.

All of the poems, however, lack any type of personalization that one would expect from hand-written poems, and we will address this aspect in the next phase of the experiment.

LLM Comparison: Crafting Customized Poetic Narratives

To add a layer of personalization, the second iteration of our study introduced additional elements in the prompt provided to the model, including the lover’s name, the setting of the couple’s initial encounter, and a brief account of their relationship’s evolution. Let’s examine how well each model weaved these details into their poems.

All of the models maintained the four-line stanza format after running the prompt but increased the number of stanzas from three to four, most likely to incorporate all of the required details. The poems vividly brought Monica to life, capturing the genesis of the couple’s romance and their journey, showcasing the models’ impressive capacity for personalization.

Gemini and ChatGPT kept the same AABB rhyme scheme, enhancing the lyrical quality of their offerings, while Copilot Chat gave the poem a name, ¨Monica´s Wave¨, and added a simple one-liner at the end for the occasion.

So what do our literature experts think about GenAI personalized poems? They strongly preferred the Copilot poem again due to its superior vocabulary and better flow. They also believed that the freeform (no rhyming scheme) used in this poem comes off as more sincere and personal. On the other hand, they noticed that the perspective of the poem was wrong and not written as a poem by the author to their girlfriend. Despite this small glitch, the experts agreed that this was the best AI-generated poem.

Once again, Gemini’s and ChatGPT’s creations were criticized for being “lazy” and overusing simplistic phrases and vocabulary. Gemini’s poem received extra points for mentioning and building the story over the two-year time period. Copilot also captured this detail, leaving ChatGPT as the only AI model that did not pick up this small but important nuance.

Comparing LLM vs. Human Crafted Poetry

To add an extra layer of analysis to this research, Nik Barkley, VP of brand marketing and experience design at PAN Communications, crafted his own Valentine’s Day poems using the same prompts provided to the LLM models. Nik is creative at heart and dabbles in many different mediums, everything from watercolor art to poetry.

We then ran a brief poll using Dynata to survey 1,000 US consumers ages 18 and up to determine if average citizens could determine if each poem was developed by a human or by Generative AI.

For Prompt 1, 58% of respondents correctly identified the human-generated poem, however, most respondents believed the three AI-generated poems to also be developed by humans – ChatGPT (59%), Bard/Gemini (56%) and Copilot (61%). Of the four poems, 31% of respondents preferred the human-generated poem, narrowly beating out the ChatGPT-generated poem (30%).

For Prompt 2, which asked the models to personalize the poem in more detail, 55% of respondents correctly identified the more personalized, human-generated poem. Most respondents still believed the three AI-generated poems to be developed by humans — ChatGPT (51%), Bard/Gemini (55%) and Copilot (52%). Respondents were more incorrect for Prompt 1, meaning the less personalized writing came across as more “human” than the more detailed prompts. Of the four poems, 31% of respondents preferred the human-generated poem, beating out the Copilot-generated poem (28%).

What this shows is that LLMs are very close to imitating humans in creative writing, especially in generalized contexts.

LLMs role in the future of creative writing

For those seeking to enchant their loved ones with a unique and personalized poetic gesture, LLMs present a promising avenue. Even though the experts preferred Copilot´s poems due to more original content, we can all agree that poetry is often a matter of taste and opinions may differ. All of the models produced acceptable results and we believe that the experiment showcases the impressive capabilities of current LLMs. This opens up conversations about the future of creative writing in the age of AI and shows that these tools can augment human creativity by helping us create personalized poems, texts, and stories.

So which poem was your favorite?

Poem examples created in this article were written by general models that are not specifically trained to produce poetry. Evaluating the output of these models is complex and nuanced due to the subjectivity of what a good poem is. With Toloka Deep Evaluation, we can evaluate metrics such as grammar, construction, novelty, flow, usage of metaphors, or other linguistic techniques to judge models´ performance. Deep insights from expert evaluators can provide valuable feedback to improve models across all types of domains.

Source link