“As a Red Teamer, I’ve always believed the best defense is a great offense. And in this world, that means having a clear understanding of the tactics, techniques, and procedures, because you can’t defend against a threat without first understanding the threat and how it really works.” – Brandon Kovacs, Senior Security Consultant
There is no doubt about it: AI is shaping the future of social engineering. AI has given rise to a new class of cyberattacks that enable attackers to create hyper-realistic deepfake video and voice impersonations with the intent of manipulating or deceiving their victims into divulging sensitive or take actions they otherwise wouldn’t take.
We saw this play out in the news earlier this year when a finance professional at a global corporation in Hong Kong was deceived into transferring $25 million to scammers, who leveraged deepfake and voice-cloning technology to impersonate the company's chief financial officer on a video conference call.
In light of this alarming incident and so many others, Senior Security Consultant Brandon Kovacs explored how deepfakes and voice cloning are created and how companies can defend their organizations against these advanced attacks in a highly attended webcast.
Deepfakes and voice impersonations are made by training on data collected from public sources, including social media posts, podcasts, interviews, and earnings calls.
Let’s drill down into what that means. Here are a few terms to lay the foundation of understanding.
All models are trained by going through a series of steps, and this process is adapted depending on whether you’re creating a deepfake video or voice clone model.
The process begins with collecting data, whether text, photos, videos, audio files, or documents.
Using retrieval voice conversion (RVC), you can train your own vocal models with WAV audio files as the training data. RVC models can then be inferred to perform audio-to-audio conversion, or vocal cloning. Essentially, you train a model with a source audio dataset and are then able to manipulate your voice’s tone and pitch to match the subject’s voice with remarkable accuracy.
Next, you must prepare the data by cleaning and formatting.
Once the data has been cleaned and prepared, you are ready to train the data.
The software takes the audio files you’ve uploaded, analyzes it, and recognizes patterns in the voice. From there, it creates a trained model from that specific individual’s voice which can then be used to infer what someone sounds like - in essence, having the ability to impersonate or replicate an individual’s voice.
Vocal cloning technology offers a wide range of potential benefits, from creating personalized virtual assistants to improving accessibility for those who cannot speak. However, there are serious ethical and security implications associated with this technology that must be considered.
In the webcast, Brandon gave a live demo of a deepfake model, using an open-source tool, to impersonate his buddy Chris (with his permission, of course). In this section, you’ll see screenshots from the process Brandon completed and a clip of Brandon deepfaking Chris in real time.
To create the deepfake, Brandon used DeepFaceLab, a popular open-source tool for creating deepfake models that is “purportedly responsible for 95% of deepfakes on the internet” according to their GitHub repository. This technology uses advanced machine learning techniques to train a deepfake video model using photos as the training data. Through a process known as “merging,” the face is swapped from the source to the destination.
To create a deepfake, you will go through the same general method as that of vocal cloning - you will have to collect, prepare, and train the data to achieve the intended results. Let’s dive in.
For deepfake videos, you may collect photos or videos. This is best achieved with high-quality videos filmed in a studio but can also be achieved by collecting video from publicly available content. The videos of the subject should have a variety of angles, facial expressions and lighting conditions.
Data preparation for deepfakes revolves around identifying and defining the source and the destination. The source is the original person or object that is being cloned (i.e. the public figure). The destination is the outcome of the deepfake manipulation (i.e. the individual who is going to impersonate the source) where the sources face will appear on.
Essentially, deepfake technology takes a source (the original content) and transforms it onto the destination (the manipulated or fabricated content). Preparing the data is a lengthy, multi-step process:
During the masking phase, the model learned to recognize both Brandon and Chris’ faces while ignoring obstructions (such as hats).
After the lengthy process of preparing the data and preparing an Xseg mask model, you can now train the deepfake video model. During the training step, the model is taught to recognize patterns in both the source and destination, learn from it, and make a series of predictions based on what it knows at that point in time.
Top row: Predictions of source (Chris)
Middle row: Predictions of destination (Brandon)
Bottom row: Face-swapping prediction between the source and destination
As the number of iterations increases, the images become more lifelike and realistic. Using DeepFaceLive (a separate tool), you can use the model that you created and trained using DeepFaceLab to do a live deepfake.
The following clip is a timelapse showing the evolution of the model as it is being trained. The two far left columns represent the source-to-source prediction, the middle columns represent the destination-to-destination prediction, and the far-right column is the deepfake prediction, swapping Chris and Brandon’s faces.
Take a look at the final product. Here is the live demo of the final output in the webcast showing the results of the trained model where Brandon was able to deepfake his Chris in real time.
Deepfakes and voice clones are a new class of AI-powered cyberattacks with wide-reaching implications for society and business. Here’s how Brandon recommends incorporating this technology into your Red Team exercises to ensure your organization is equipped to defend against these types of threats.
Two examples include:
The bottom line with this technology is that it will affect everyone. Attackers are more widely using this technology for phishing and vishing scams at large, and the frequency and sophistication of attacks will only increase. It will impact small, medium and large businesses alike, along with vulnerable and elderly individuals.
And if you are a public figure, you are even more at risk of being impersonated. There is a lot of publicly-available content that exists – earnings calls, interviews, podcasts, social media blasts, networking forums, company website – and all of it can be used to train a video model or voice model of you.
With that, we have two practical ideas for operational process improvements to mitigate the risk of fraud in your organization. In a scenario where an individual requests an action to be taken (such as transferring money) – especially while employing fear, uncertainty, or doubt (FUD) – you need to establish caller’s identity to ensure the validity of the request.
Two ways to do that are:
These are both low- or no-tech approaches that enable you to cut the attacker out of the loop. Even in the face of cutting edge cyberattacks, sometimes old school methods work best.
Subscribe to Bishop Fox's Security Blog
Be first to learn about latest tools, advisories, and findings.