- While the accuracy of automatic deepfake detectors is expected to improve, now that some synthesised audio is no longer obvious to humans, it is especially important to spread awareness and education about media literacy so that we can think critically about the content we see online.
- Researchers at the National Institute of Informatics (NII) in Tokyo, Japan are tackling many open research problems to make speech synthesis more useful and accessible. While an individual’s voice can be “cloned” using as little as a few minutes of their speech, this speech must be recorded in a very quiet environment and the words they said must be transcribed. It still remains a challenge to create a high-quality personalised voice using only a very short, untranscribed utterance that was perhaps recorded in noisy conditions.
- Since 2015, a team of researchers from NII and other organisations around the world have organised the bi-annual ASVspoof Challenge, a competition to develop countermeasures for spoofing attacks on automatic speaker verification (ASV) systems. In addition to encouraging research in this area, the outcomes of the challenges have served as a record of progress in this field.
What are deepfakes?
Deepfakes have appeared as a new type of online harassment, a potential medium for fake news, and a type of fraud. But what exactly are deepfakes?
Deepfakes are a form of synthetic media: artificially-created or manipulated images, video, or audio. Deepfakes are so-called “deep” because they were generated or manipulated by a type of machine learning model called a deep neural network, which is made from many stacked layers of computational units called perceptrons. The foundations for these models were described as early as the 1940s, and they saw a resurgence in the 1990s. However, it was not until recently that the computational power was available to use them so effectively. While graphics designers and audio engineers can manually create and manipulate media to produce very realistic effects, deep models integrated into a mobile app or other easy-to-use format now allow anyone to generate realistic synthetic media without any special skills, which was previously not possible.
History of synthetic media
There is a long history of creating or altering media artificially, and efforts to replicate the human voice go back very far. In 1845, an Austrian inventor named Joseph Faber exhibited the Euphonia, a mechanical reproduction of the human vocal tract. A bellows acted as “lungs,” and mechanisms corresponding to the throat, tongue, and jaws were controlled with a piano-like keyboard. The Voder, invented by Homer Dudley from Bell Labs and exhibited at the 1939 World’s Fair, also used a keyboard to control electronically-generated waveforms to sound like speech.
Researchers in the 1970s crafted rules for mapping text to waveforms, and in the 1980s and 1990s, speech recordings were cut-and-pasted automatically to produce new sentences. From the early 2000s, researchers investigated machine learning approaches to teach models to learn relationships between linguistic features of text and acoustic features of speech from data. This opened up new possibilities for flexible and adaptable speech synthesis, since these models can be taught to speak in the voice of a different person, or in a different style, using only a small amount of new data.
Over the past decade, machine learning models have become more complex and powerful, and deep models can synthesise speech that sounds as natural as human speech — this is the technology behind speech deepfakes.
Open research problems in speech synthesis
Now that it is possible to synthesise speech that sounds as natural as human speech, it may seem like speech synthesis is a solved problem — however, this is not the case. Researchers at the National Institute of Informatics (NII) in Tokyo, Japan are tackling many open research problems to make speech synthesis more useful and accessible. While an individual’s voice can be “cloned” using as little as a few minutes of their speech, this speech must be recorded in a very quiet environment and the words they said must be transcribed. It still remains a challenge to create a high-quality personalised voice using only a very short, untranscribed utterance that was perhaps recorded in noisy conditions. Furthermore, there are over 6,000 languages spoken in the world today, and it takes a huge data collection and engineering effort to create a synthesised voice for a new language. Reducing this barrier to entry is an important goal to make the benefits of speech synthesis technology more equally available to everyone.
Speech synthesisers are typically designed for a neutral speaking style and a high quality can be achieved for that case, but human vocal expression is much more diverse. Emotional speech synthesis, synthesis of non-verbal vocalisations such as laughter, and the modeling of different speaking styles such as storytelling can help to make synthesised speech more expressive. It’s also important to consider when and where people will listen to synthesised speech — since listeners may be in a noisy environment, speech intelligibility enhancement is also an important topic.
Uses of synthetic media
Some of the earliest uses of speech synthesis were for accessibility, and these remain important applications today. Screen readers enable vision-impaired individuals to hear information on their computer screens. Communication devices allow individuals who have lost the use of their voice to speak using a synthesiser. With the advent of model-based synthesis, someone who will lose their voice in the future due to a surgery or disorder can record their voice to create a personalised synthesiser for their later use, which is called “voice banking.” Speech synthesis also helps break down language barriers as a component of speech-to-speech translation. Virtual assistants, home devices, and GPS navigation devices all use speech synthesis to communicate and help us in our daily lives. Synthesised voices, characters, and images also appear in the music, computer game, and film industries.
Of course, any technology can be used for good or harm. The ability to synthesise speech in the voice of an individual using as little as a few minutes of recordings opens up the possibility of creating audio deepfakes — audio intended to deceive the public, to harass, or to conduct phone scams and defraud individuals. Deepfakes can also be used for“spoofing,” or, to fool machines such as biometric voice authentication systems. How can we counteract these malicious uses of technology?
Detecting audio deepfakes
Much like we can create synthetic speech by training a deep neural network, we can also train models to discriminate between human and synthesised speech. Synthesised speech typically has some artifacts or differences from human speech, and although these may be inaudible to the human ear, machines may nevertheless be able to detect them. Automatic detection of audio deepfakes has become a well-received research topic in both biometric and speech research communities. Since 2015, a team of researchers from NII and other organisations around the world have organised the bi-annual ASVspoof Challenge, a competition to develop countermeasures for spoofing attacks on automatic speaker verification (ASV) systems. In addition to encouraging research in this area, the outcomes of the challenges have served as a record of progress in this field.
Taking the ASVspoof Challenge 2019 as an example, the task is to discriminate real human voices from audio deepfakes generated using many advanced text-to-speech systems. The difference between real voices and the fake audio from some text-to-speech systems is barely discernible to human ears. Nevertheless, the five best detectors submitted by participants can correctly detect around 990 out of 1000 such natural-sounding audio deepfakes.
In contrast, if we build deepfake detectors using the winning techniques from 2015, they make 15 times more errors. This progress was possible because researchers found more effective ways to use deep learning to detect artifacts in deepfake audio. Another contributing factor is the creation of large-scale audio deepfake databases, including those released by the ASVspoof Challenges and other research groups.
Detecting deepfakes is inherently a cat-and-mouse problem: as countermeasures learn differences between human and synthesised speech, spoofing attacks may become more advanced and further reduce these differences.
But the prospect of detecting advanced audio deepfakes is not bleak. Audio deepfakes can forge natural-sounding voices using data-driven approaches, but they are not constrained to obey the physical laws of voice production. Even if it is not impossible, creating a deepfake that exactly reproduces physical voice production is a Herculean task. This dichotomy between the data-driven simulation and physical process leaves traits behind in audio deepfakes. While human ears may not perceive these traits (as the voice perception system is not wired to do so), audio deepfake defenders — data-driven machine learning models — are likely to spot them given a sufficient amount of data. Surely, advanced speech synthesisers can provide such training data for the deepfake defenders. Meanwhile, while the accuracy of automatic deepfake detectors is expected to improve, now that some synthesised audio is no longer obvious to humans, it is especially important to spread awareness and education about media literacy so that we can think critically about the content we see online.
* From Science & Technology Office Tokyo, Embassy of Switzerland in Japan: Together with researchers from the University of Zurich, the authors have recently presented their work on deepfakes at the Pre Agora session of Science Agora, one of the largest science communication forums in Japan. If you would like to learn more on this topic and experience actual examples with interactive deepfake detection games, have a look at the video from the event.