Back in 2019, when AI was safely in the realm of science fiction and GPT-2 was still several months from release, a group of researchers submitted a paper to that summer’s annual meeting of the Association for Computational Linguistics. The paper “Towards Multimodal Sarcasm Detection (An Obviously Perfect Paper)” described a proposal for a database of annotated examples of sarcasm in speech, drawn from popular TV shows including Friends and The Big Bang Theory. The idea was that the database—dubbed “Multimodal Sarcasm Detection Dataset,” or “MUStARD” for short—could be used as a resource for research into detection of sarcasm in conversation.
The nature of sarcasm means that it can be difficult to identify from looking at words alone: a sarcastic statement will often involve saying one thing but meaning another. This requires the actual meaning of the statement to be derived from other, more subtle cues. The original MUStARD paper identifies several examples of such cues—”a change of tone, overemphasis [on] a word, a drawn-out syllable, or a straight-looking face”—and argues that such “multimodal” analysis is essential for parsing sarcasm correctly.
In the intervening five years, the idea of natural-language human-computer interactions has gone from filmic plot device to everyday occurrence with head-spinning velocity. Sarcasm, however, remains difficult to detect, and two sessions at a joint meeting of the Acoustical Society of America and the Canadian Acoustical Association taking place this week in Ottawa were devoted to ways to improve sarcasm detection.
The first of these, from a team at the University of Groningen, described a neural network that builds on the approach set out in the 2019 paper. The network is trained on data from MUStARD, and The Guardian reports that it has been able to detect unlabeled examples of sarcasm from the shows in the database 75% of the time. A short abstract of the research published on the meeting site explains how the model works: the words from audio data are extracted with automatic speech recognition, and are then assigned an emoticon to denote their underlying sentiment. This emoticon is then mapped to multimodal cues like tone of voice or wider conversational context. The authors suggest their approach “leverages the strengths of each modality… [and] compensate[s] for limitations in pitch perception by providing complementary cues essential for accurate sarcasm interpretation.”
Pitch perception is one of the most established methods of looking for sarcasm in speech, and the other presentation at the meeting to address sarcasm detection looked primarily at methods of pitch perception analysis. In particular, it focused on changes in the F0, or fundamental frequency, which is the lowest frequency of a given person’s voice. Certain changes in this frequency often characterize sarcasm in English, and identifying these changes have thus been a reasonably reliable way of identifying a sarcastic phrase.
The presentation, given by a team from the University of Michigan, looked in detail at changes that take place in the F0—referred to as “F0 contours”—when a person makes a sarcastic remark. The team identified certain acoustic signatures—”wiggliness” and “spaciousness”—that recurred in many subjects’ contours, and on further analysis of nine subjects’ speech, found that “wiggliness and spaciousness alone can capture some of the differences between sincere and sarcastic contour clusters for some speakers.” The presentation abstract cautions, however, that while “any speakers produce contours characteristic of sarcasm or sincerity … these contours differ by speaker.”
This shows how difficult it can be to identify sarcasm by relying on any one aspect of conversation alone. This is as true for people as it is for computers, and while coverage of these sessions has focused largely on the possibility of incorporating sarcasm detection into large language model-driven chatbots like ChatGPT, there are potential benefits for actual people, too. The University of Groningen’s team suggest that their work could be helpful “for [people] with auditory processing challenges”—especially those with “disorders that affect pitch perception or those lacking contextual auditory cues”—and, more generally, for “the advancement of speech technology applications.”