Outshift Logo

INSIGHTS

19 min read

Blog thumbnail
Published on 10/17/2022
Last updated on 02/09/2024

The AI of Forgery: Synthetic media, deep fakes, and the new “uncanny valley”

Share

It is easy to fool someone by simply directing them to focus on that which matches what they expect. This way they are far less likely to be made aware of the (sometimes amazingly obvious) attributes that would fail the most rudimentary authentication.

un·can·ny val·ley

noun
used in reference to the phenomenon whereby a computer-generated figure or humanoid robot bearing a near-identical resemblance to a human being arouses a sense of unease or revulsion in the person viewing it.

Recent advances in synthetic media, not just deep fakes, presents some interesting observances of the interactions between the individuals who are interacting with it and the context, content, and intent of the interaction. There are many aspects of AI generated content where this is something important to be aware of.

Many of the surveys that measure the responses of users don’t refine (or do not represent the refined data if available) the types of systems the users are responding to.

User’s that are seeking noncritical information, or are seeking information without a some time restriction for a result, often will accept any interaction so long as it produces a satisfactory result, Additionally, if the user isn’t satisfied with the result or the speed at which a result is achieved, then dismissing the action results in minimal frustration.

As the importance of the need of the user gets more critical, the interactions potential failure changes the perception of the user in the context of the AI presenting the information. A simple example of this is when you, as a third party observer, see the system interacting with someone else. These observances are focused on two external parties. You, observer, are more likely to perceive realism in the interaction since you are not personally involved. Therefore, there is no personal stake in the result of the interaction. However, when you are now the direct participant with a need that must be met, the perception of the interaction changes significantly. Now the importance of the interaction is far higher on the result than on the realistic aspects of the AI. If your request isn’t simple or is communicated in a colloquial but ambiguous manner that may not be properly understood by the AI, clarifications which may be heavy-handed or misunderstandings can quickly cause frustration.

This is where the new Uncanny Valley is apparent. The traditional idea was based on the idea that the closer to “real” the appearance of an artificial person, the more obvious it became that it was artificial and therefor made it “more creepy” or unnerving. When you understand that the uncanny valley is based on a graph this graph, you can see that the closer the fake gets to reality, the further the drop off becomes between the accuracy of the fake and “real thing”.

familiarity

This is essentially where and how the “creepiness factor” comes into play. When we look at something artificial, without knowing it is going to be artificial beforehand, we tend to notice first the attributes and elements that are consistent with our expectations.

Consider our suspension of disbelief for anything that seems or we pretended to be real. These situations are acceptable because there is always a discrete understanding and acceptance that they are not real. If we talk to a stuffed animal and pretend there is a response, we know that it isn’t actually going to respond. If it did, we would be shocked. There are elements of a stuffed animal point to it being similar to a real animal but there are many elements (polyester fur, bead eyes, lack of articulated joints) that keep it from being perceived as a live animal.

However, the more elements (over time) that are compiled, that confirm our acceptance of it being real, the more likely we are to notice a small inconsistency, with what we know. This isn’t merely a flaw, but rather something that conflicts with a basic expectation in an unexpected way. Often these responses aren’t easily defined because there is no real world analogy. It is a cognitive dissonance we’ve never seen and therefor never needed to address. It could be an unnatural smoothness of skin or a freckle pattern that is outside of our experience, or something more subtle such as subsurface light scattering (which is how light reacts beneath the skin.) This, often sudden realization, results in a level of revulsion that is a combination of the disconnect with what is real and the fact that it is so close and yet so far from it. Just few years ago when CGI effects in movies and animation would make characters that may be strikingly realistic in still images from the movies. But as soon as you’d see them in action, there was the appearance of what was referred to as “dead eyes”. In this case, the creepy quality came from some extremely subtle movements and interactions with light that made the moving character more than a little disconcerting.

This is not a paradigm that is unique to computer generated content. It is based in real life situations. Seeing something that is believed to be alive from an almost instantaneous initial perception. Then finding out through additional sensory input (which has more primal impact than simply having information verbally relayed) like touch, that the subject, is not in fact alive. This creates a visceral response. If the viewer does not touch the subject, then, unless informed by other means, they will continue to believe that it is alive.* So the perception and assessment clearly progress over time.

familiarity

DAVID PAUL MORRIS—BLOOMBERG/GETTY IMAGES. PHOTO ILLUSTRATION BY STEPHEN BLUE FOR TIME FOR KIDS

Deep fakes

Think of the classic con man. This still relevant term is an abbreviation of the original confidence man. This was someone who gained your confidence, trust, and belief through lies, distraction, and misdirection to trick you into freely giving them money or property. The process being known as simply “the con.”

This is the con of deep fakes. Most deep fake successes come from not observing too closely or too long. Of course, this type of reduced observation is dependent on the observer not having any critical need or available time to assess validity. Realizing that humans have the ability to observe extremely fine variations in facial movements, imperfections, and details that when absent, the viewer may not be able to enumerate or describe, they are quite aware of something being “off” or missing or wrong. Combine this with the aspect of the effectiveness of a fake being time based. The length of time the person perceiving the fake, affect their decision of its authenticity. If they perceive there is sufficient information to assess something as authentic, then if that decision is made quickly and the fake succeeds. This is not a new concept since art forgery has been around (documented) since at least as far back as the Italian Renaissance (500-600 years.) The effectiveness of a forgery is determined by the lack of knowledge of the victim, the time to assess it, and the effort to determine whether and how a forgery has been committed. Additionally, the openness of an individual to accept or the cynicism to not accept, affect its effectiveness. Either of these attributes can be exploited to gain their mistaken confidence. Simple awareness of cognitive biases and their complexities shows how these interactions can not only be predicted but also directed.

When I personally have unknowingly viewed deepfake videos, I realized that my recognition of a personality is the first thing I notice. Then next it is the contextual indicators. These are physical discrepancies, mannerisms, voice tone, inflection, cadence, dialect, and accent. Another aspect is content, context, and presentation. Any one of these things can create a level of doubt, which if initiated sets up a situation of looking of other flags for confirmation or dismissal. I cannot say that I have never fallen for a deep fake, but I have seen and recognized many, even prior to the having heard of the “deep fake” moniker. The point when you fully recognize it though, you are typically in the uncanny valley.

The important consideration her is when biases like confirmation bias come into the perception where the fake confirms what we want to be true, then we are more likely to believe it in spite of perceptual anomalies.

The faking of content in synthetic media isn’t just in video. It can be applied to other areas as well. These can include the arts which of course, include language. The misperception of a fake being real doesn’t have to have nefarious intent. To start with consider some of the visual aspects of DALL-E 2.

If you look at output from DALL-E 2 (as on one image generation model), it is easy to see the images from less complex inputs, as being very realistic. This is simply because we tend to view it from the standpoint of “Look at how realistic this is.” The viewers perspective is one where the attributes of realism are the intended (either directly or indirectly) focus. Unless you precede the direction to look at an image with “What’s wrong/missing/odd about this image?”, the viewer is not likely to focus on it. Of course it must be considered that a negative precedent would also bias a viewer against an authentic image where the viewer may not have sufficient experiential knowledge to confidently and accurately assess it. Willing suspension of disbelief may be an unconscious choice we make to validate what we want to see. Most do this for entertainment’s sake. It is not that we are fully unaware of it but it can become almost like breathing where we can control it if we wish to.

The feature of a new technology that surpasses an old one or is simply a new feature tends to target that capability in a sort of myopic manner. Consider the development of CGI. When raytracing was a new rendering technology, it was constantly referred to as being photorealistic. This is because the capabilities of reproducing a specific physical aspect of light in a mathematically accurate way hadn’t been done before. The effect was amazing, but there weren’t a lot of people dismissing it because it didn’t address radiosity of materials and colors. Most viewers were focused on the improvement from where we were and not on the distance between this new process and actual light on real surfaces.

Language

There are many programs and projects focused on (directly or indirectly) generating text content. This aspect is in the combination of NLUs, NLPs, and libraries (like GPT-3). When the user knows that they are dealing with an NLU AI they quickly curtail their dialog to be fairly concise, eliminating pleasantries, or any extraneous information. When the NLU expresses pleasantries, the user is likely to react but is also likely to try to engage. If this engagement inadvertently reveals that they are not talking to a person, they are likely to have a negative reaction whether it is anger, frustration, or embarrassment. There are ways for the AI to address this such as using phrasing and directives that focus on the ego of the person. Evading complex questions can be clouded by asking the person to continually talk about themselves. This isn’t a deceptive practice in that it is used by many people in social situations to engage with someone. While not 100% effective method, it still fairly high in maintaining a conversation that can be perceived to be successful.

Sounding natural in conversation doesn’t automatically correlate to having a substantive conversation. While there are conversations that clearly small talk which is typically polite, follows at a minimum, a base set of social rules, these are usually non-productive and insubstantial. Commonly though, conversations can lead to one or more subjects that elicit individual preferences, ideas, and conceptions. This is the area where even the best AI will fail. (It’s also why it’s not close to AGI.) Currently the most advanced systems in the world can mimic fluid conversations and given the consistency of ego, continue them for a long time. Ego can reinforce the belief in the conversation being with a real person in that not only do people like to have their egos supported in subtle ways, but also, the more that their egos are expanded, the more likely they are to defend the source of the expansion.

The problem happens when the human is looking for substance in the conversation. AI is not sufficiently advanced to propose independent and relevant (contextually) ideas to further a conversation. Also, humans can switch context midsentence to an abstract parallel or analogy and not be concerned that the person they were talking to didn’t easily recognize and process it on the fly and then switch back without losing the train of thought on either idea. This is where a conversion requires both collaboration and creativity. In this conversation, both parties are adding ideas and occasionally countering the others ideas. Here, AI has huge constraints in that it would have to not only manage huge amounts of continually shifting information, but it would have to be able to interpret what is the subtle difference between “creatively following a line of thought” and going off on a complete non-sequitur.

Also, in conversation, it is common for the subject to branch off tangentially in an contextually relevant way and continue in a new direction. This can also be easily guided back, but if it is guided back too quickly without confirmation and acknowledgement (which is necessary for most NLUs and AI systems) it is likely to be perceived as rude at best if not clearly unnatural.

The most advanced current experimental systems can have conversations that are very natural and realistic sounding, but if the person talking to the AI is not aware that it is an AI, unless they are really bored, lonely, or egocentric, there is a high possibility they are likely to end the conversation at best somewhat frustrated if not worse.

A language specific example.

Another issue that is prevalent in this conversational uncanny valley is that context is often determined at a granular level since it far easier to process a tremendous amount of data in a 2D matrix rather than addressing the geometrically larger (and more human) aspects of hierarchical context. For example, in a conversation about a technical problem of information entry which has a clear context: the response is a colloquial expression including a quote from a movie:

“If I answer all the questions on the form but I put my address in based on the way we always have referred to home as “The Regency” rather than the street address, the system may say “I’m sorry Dave, I’m afraid I can’t do that.””

In this response from a conversational exchange, the speaker (not named Dave) is talking to a familiar colleague about filling out an online form. The speaker, saying that if they enter the address in their commonly expected label rather than a formal street address, suspects that it may be rejected. The speakers use of a quote from “2001: A Space Odyssey” is using a quote as a metaphorical tool for the rejection in the form of a pop culture quote. This seems relatedly straightforward if somewhat complex.

Now, following this conversation, the colleague may respond later using a quote from another movie, perhaps now more obscure as there has been an establishment of pop-culture movie quotes as an acceptable metaphorical paradigm, not to be used too frequently as high frequency use would make it annoying. On the other hand, if the colleague doesn’t know the reference directly, they are still likely to understand the context and continue. In this case either ignoring the reference, or perhaps asking about it later.

“Now that we’re back, we can do the work we’re best at ‘cuz you know: “There’s no place like home.”

In a situation like this, an NLP may be able to correctly infer the sentiment, but may not understand, and more importantly recognize the context and more importantly the context tree. In that situation, there may be a follow up interaction were the context previously set up is referred to in a more obscure or equivocal manner. Now the context is lost because it was not defined. In addition to the definition, there was no context tree or hierarchy

e.g.
 popular culture:
  entertainment:
   movies:
    iconic:
     science fiction:
      ”2001: A Space Odyssey”
       “I’m sorry Dave,….”

You can see that this gets pretty deep when you look at each branch. Also, if the system creates this context tree, how can it assess the accuracy of the tree the ranking of the hierarchy? For that matter, if it proposes multiple context trees, which one does it validate and use and for how long does it hold it in memory? If it does hold it in memory for a period of time, what is the time and is there a time based scale to which it is progressively lowered in relevance before being purged?

This hierarchy becomes more clearly important when the next quote reference could stem from anywhere on the hierarchy. The confidence increases based on the proximity of the last, but not in equal measure. The next reference is more likely to indicate the position in the hierarchy as a predictor. All of this assumes that the context hierarchy was correctly interpreted. If not, then a quick redefinition needs to be made.

Add to this the temporal dimension. The likelihood over time of the persistence of the context and it’s implied hierarchy. Maintaining the context is easy but not only is it not normal in conversation, it can be recursive to the point of coming to a dead halt. When you consider recency bias, it helps to see how cognitive processes which are irregular progressively downgrade and discarded context trees as time passes and other contexts replace it the speed of this varies over time as contexts change. These shifts in context are based on the interactions of two independent systems (people) with different experiences, contexts, active processes, and perspectives at any one moment.

A more obvious and difficult to solve problem is that these language AI systems are still based on statistical likelihoods and proximal word usage. Shifting contexts and subtle cues that can be based on many different paradigms like emotion, personal experiences, culture, socioeconomics, and more, can significantly drive the meaning of a series of statements rendering statistical accuracy useless if not catastrophically wrong. Here, averaging can create results that are mathematically accurate but only apply in the natural world within small timeframes. This doesn’t prevent averaging from being used because it is perceived to be (in the short term) far more cost effective.

Speed of reaction and distraction.

As has been already noted, this is where the fake can breakdown. The best ways to avoid close scrutiny is with speed and distraction. The shorter the interaction, the fewer points of disconnect are likely to be perceived.

Sometimes simple methods can be motion and movement. This could be accomplished by rapid changes in the content or in a constant redirection of focus (whether rhythmic to maintain a context or erratic to maintain a level of disorientation.) Sometimes, even a ‘not quite’ subliminal approach is used by seemingly making the focus on one aspect while constantly reinforcing a less obvious one.

This, is not the only method of distraction. Another is via the manipulative use of confirmation bias through anger or fear of a shared target. Strong relational negative emotions are far more likely to engender acceptance of the content. This is how most commercial news present events if they are not already agenda driven. Bots along with NLUs have been quite prolific and successful in social media in this area.

All of these distractions combined with some element of speed, deflect the perceiver from gaining sufficient information to recognize the deep fake.

How do we become aware?

Often when we are experiencing a deepfake when presented as such, we can often see “what’s behind the curtain”, as it were.

When viewed from an historical perspective, an effective deception is first predicated on a situation where the person(s) being targeted are not only unsuspecting, but unaware. It is much more difficult to fool someone who is prepared or even just suspicious of being deceived. The deception needs to be:

  • Fast/brief
  • Manipulate at an emotional level (ego, mutual anger, empathy)
  • use suitable distractions (surprise, beauty, deflections of mistrust)
  • know the targets weak points (not merely to exploit them but also to avoid the strong areas of awareness)

It’s interesting that the best way to address these as a deception is with some very familiar old approaches. If it seems to good to be true, it’s probably not. If it seems like it doesn’t match what you know and what you know is based on real personal experience, then there a good reason to suspect it. There is much to be said for healthy skepticism, awareness through intentional slowing of personal reaction. If you want to believe something is true, and you don’t maintain a certain level of skepticism and critical thinking (in the literal sense), then you are likely to fall prey to simple confirmation biases.

*Research at the Plank institute demonstrated that combining two or more sensory inputs for a subject was require to initiate empathetic responses.

Subscribe card background
Subscribe
Subscribe to
the Shift!

Get emerging insights on emerging technology straight to your inbox.

Unlocking Multi-Cloud Security: Panoptica's Graph-Based Approach

Discover why security teams rely on Panoptica's graph-based technology to navigate and prioritize risks across multi-cloud landscapes, enhancing accuracy and resilience in safeguarding diverse ecosystems.

thumbnail
I
Subscribe
Subscribe
 to
the Shift
!
Get
emerging insights
on emerging technology straight to your inbox.

The Shift keeps you at the forefront of cloud native modern applications, application security, generative AI, quantum computing, and other groundbreaking innovations that are shaping the future of technology.

Outshift Background