Biomarkers of Scientific Deceit
The iBorderCtrl system contains a Do-It-Yourself “lie detector” where the traveller talks to an avatar on his/her own screen while secretive machine-learning AI in some server rack looks at the facial expressions through the subject's own camera to see if he/she is lying. Is that even real? In this blog post, dr. Vera Wilde examines some of the claims in detail.
The iBorderCtrl consortium (led by European Dynamics) repeatedly talks about how their ADDS/Silent Talker “lie detector” component is using micro-expressions to detect deception, even going so far as to claim that their system will identify and classify people based on “biomarkers of deceit.” But that claim is (grossly) insufficiently evidence-based. And because it is such a central claim, it is really a big red flag indicating that iBorderCtrl is engaging in pseudoscience for profit.
Just to be clear: there is absolutely no scientific basis for the assertion that unique “biomarkers of deceit” exist, or are about to be discovered after centuries of fruitless pursuit. Rather, the solid scientific consensus on physiological deception detection is that we can't do it. “Lie detection” doesn't exist, because there is no unique lie response to detect.
Micro-expressions to Detect Deception
the Automatic Deception Detection System (ADDS)* performs, controls and assesses the pre-registration interview by sequencing a series of questions posed to travellers by an Avatar. ADDS quantifies the probability of deceit in interviews by analysing interviewees [sic] non-verbal micro expressions. This, coupled with an avatar, moves this novel approach to deception detection to the pre-registration phase resulting in its deployment without an impact to the time spend [sic] at the border crossing by the traveller. The avatar also allows for a [sic] consistent and controllable stimuli across interviews in terms of both the verbal and non-verbal [sic] from the direction of the avatar agent to the traveller personalized to gender and language of the traveller, reducing variability compared to human interviewers and potentially improving the accuracy of the system. Despite the large use of biometrics on security applications including in border control with the advent of digital passports that contain fingerprints [sic] digital images and physical characteristics of individuals, a traveller with ill intentions using [sic] own documents, biomarkers would not reveal their attempted deceit. iBorderCtrl deploys well established as well as novel technologies together to collect data that will move beyond biometrics and onto biomarkers of deceit.1)
Using micro-expressions to detect deception is most commonly associated with Paul Ekman's work.2) Micro-expressions research remains in its infancy, with little peer-reviewed scientific work published to date.3) Ekman's work is poorly regarded in the relevant scientific and legal communities.4) Existing micro-expressions research more broadly tends to be problematic in a number of ways explored below. Automating it does not solve its main problems.
But first and foremost, the main claim embedded in iBorderCtrl's use of micro-expressions—that they can be used to detect deception—lacks sufficient theoretical rationale or empirical proof. There is simply no reason to suspect that fleeting facial expressions (or vocal changes, gestures, postures, or physiological responses) uniquely correlate with deception. No behavioral or physiological response has been shown to “tell” if we are lying. Rather, behavioral cues often thought to be such “tells,” such as throat-clearing, swallowing, looking away, and nervous tics of various sorts, can be just that—nervous tics, or expressions indicating a range of emotions or internal states ranging from anxiety, fear, excitement, and embarrassment to fatigue, pain, distraction due to internal or external intrusive stimuli (such as thoughts or noises), or processes relating to hunger, thirst, or digestion. There is no established one-to-one correspondence between a given emotional state, or disconnect between macro- and micro-expressed cognitive, emotional, or physiological states, and truthfulness or deception. Therefore it is incorrect to assert, as the iBorderCtrl team explicitly and consistently does, that there are known “biomarkers of deceit.”5)
The conflation between various emotional states (or behaviors in response to them) with deception is insufficiently theoretically moored. This conflation could also include not just emotions like anxiety or signs of stress, but also efforts to refrain from strongly conveying one's true emotional state to others. But lots of people are nervous about travel or being questioned for innocuous reasons, or are upset about other things in their lives—and they know better than to intentionally telegraph those irrelevant facts to an avatar or border guard.
In other words, we lack a theory for why the relevant behavioral and physiological measures should correlate uniquely with lying as opposed to also correlating with other context-relevant states—such as fear of being labeled deceptive regardless of truthfulness in the context of lie detection generally, or fear of the consequences of being denied entry to a country when you have travel plans in the context of iBorderCtrl specifically. This is the project's core theoretical problem. It lacks a theoretical basis for inferring from whatever the computer says indicates deception (e.g., supposed signs of emotional states, or emotional macro- and micro-expression mismatches), to deception. Such a theoretical basis does not already exist in the generally accepted scientific literature, and none has been proposed publicly by the developers.
In addition to this persistent core theoretical problem in physiological deception detection, micro-expressions research more broadly has numerous weaknesses which would otherwise currently limit its utility in “lie detection” and other real-world decisions affecting human rights and security. 6)
Existing Micro-expressions Research Is Insufficient to Support Its Use in Mass Screenings
iBorderCtrl's Automatic Deception Detection System (ADDS) is an Artificial Intelligence technology that uses Artificial Neural Networks to ostensibly detect deception—after first detecting and then classifying micro-expressions. But the cutting-edge research on micro-expressions suggests the first steps here are as problematic as the last.
The usual steps for facial expression recognition are: detection of facial expression, classification as macro or micro, classification as spontaneous or posed, and classification of micro-expression into emotion categories.7) Sometimes that last step means classification into Ekman's six categories—disgust, anger, fear, happiness, sadness, and surprise. Sometimes a seventh is added—contempt (aka hatred). Sometimes the categories are collapsed, for instance, into positive (only happiness), negative (most of the others), and surprise.
The existing scientific literature on automated detection and classification of micro-expressions suggests it has poor accuracy. This research also shows what best practice in this field looks like. It looks different from what the iBorderCtrl / Silent Talker team is doing in a several ways. Most importantly theoretically, top micro-expressions researchers do not make unsubstantiated leaps from what they are doing to being able to detect deception. And empirically, they are much more transparent about their research methods and data.
A recent survey of micro-expressions research found that in some recent studies, micro-expressions were only spotted at a 74% true positive and 44% false positive rate.8)
Another paper, on automatically analyzing spontaneous micro-expressions, cites detection accuracy ranges from 58.45%-65.55% and recognition accuracy rates of 35.21%-52.11%.9) Knowably classifying micro-expressions as distinct from other facial movements (including macro-expressions, blinking, and head tilting) is just a first step towards then classifying micro-expressions into emotion categories. But even at this detection stage, accuracy rates are fairly low.
It's hard to say more about these rates in relation to iBorderCtrl because of the project's extreme nontransparency, a problem we'll say more about later. But it's also worth acknowledging such problems plague lie detection research. Federal polygraph programs are notoriously secret. They refused to release data on bias and efficacy to the National Academy of Sciences (NAS) in response to repeated requests as NAS researched and wrote their Congressionally commissioned report on the scientific evidence on polygraphs. They then also refused to release similar data to other researchers through the courts.
There also tends to be a significant methodological quality problem in lie detection research, as has long been noted by lie detection's academic critics as well as government evaluators. In the case of micro-expressions research, as the literature survey authors note, some data are not published or are defunct, and details of how everything is calculated are often not disclosed. The historical commonality of these problems in lie detection research highlights one of the reasons this terrain itself should give us pause. When you hear “lie detection,” history suggests you should think “fraud.”
Low accuracy rates of spotting micro-expressions in the first place drop even lower when we try to classify them in addition to merely spotting them. Accuracy rates then drop, and they also become harder to validate as correct. In this context, there's an additional accuracy-rate problem in micro-expressions. To get a single accuracy number, researchers are inevitably combining hit rates across (varying) emotions categories. This is in addition to combining true positive and true negative rates. That's a lot of combining of different measurements, resulting in a final accuracy rate that should not really be represented as that one number, as it does not reflect a singular construct in the data. Instead, it's usually preferred to report confusion matrices of performance by emotion category. It would also be more truly informative to report the total numbers of true and false positives and negatives. Indeed, that seems to be standard practice in the micro-expressions literature.
Ironically, micro-expressions research has not really figured out what to do with mixed emotions. A field of study built on recognizing that we can leak information about anguish while putting on a happy face has not agreed on a way to code happy-sadness, or anger-disgust, or any other combination of emotions. But what if, rather than being like numerical values that cannot coexist, emotions are more like colors that can?
There's also not broad agreement on who should be the source of information on ground truth in micro-expressions—a human interpreter, or the person feeling the feeling. There are good theoretical reasons we might want to take both the internal experience and the external information received into account when coding emotions, especially fleeting ones that might be expressing unconscious feelings. But there is not broad agreement about how to deal with this foundational problem, underscoring the shaky theoretical basis that underpins the contentious leap from micro-expressions to deception detection.
These shaky foundations get even shakier when we dig into iBorderCtrl's specific accuracy claims and expectations about the future accuracy of their deception detection system.
Accuracy Rising?
iBorderCtrl claims they can get their accuracy up from 76% in the lab to 85% in the real world.10) To be absolutely clear, there is not public evidence of a 76% accuracy rate here. There is not actually public evidence of any accuracy rate at all. This current 76% claim, which is not based on available, acceptably described scientific evidence, would appear to be a conceptually incoherent amalgamation of different accuracy figures across emotion categories and error types. Bracketing the importance of transparency and coherent accuracy rate reporting in science in the public interest to focus exclusively on the iBorderCtrl developers' repeated public claim that their tool's accuracy rate should significantly rise in the field: It seems much more likely to go the other way. The deception detection system's accuracy will likely plummet in the real world. Here are a few reasons why.
Different Environments
Testing of iBorderCtrl's deception detection software ADDS, and its predecessor Silent Talker, was under lab conditions. The controlled nature of the lab conditions versus the uncontrolled nature of field conditions would generally be expected to artificially raise accuracy rates, resulting in decreased accuracy in the field. Some of the reasons for this limited generalizability of lab results include more standardized (in the lab) versus more divergent (in the field) measurement technologies (such as people's phone cameras), lighting conditions, head positions (such as head tilting), and facial features (such as glasses or beards). And this is just considering a very narrow definition of environment.
Different People
Going from a small sample relatively similar to the sorts of people of whom the micro-expressions database initially consisted and on whom researchers thus initially trained the AI, to a larger sample with more heterogeneity under real-world circumstances, generally causes performance of this type of AI to plunge. iBorderCtrl developers publicly and repeatedly state that they hold the opposite expectation of their technology, without giving a reason for that expectation. But the history of this type of technology suggests the exact opposite.
For one thing, research conducted by the original Silent Talker team identified gender differences supposedly indicating lying. If true, this finding would suggest that the resultant technology would discriminate in scoring on the basis of gender, raising possible problems for non-binary and trans persons. It would also suggest different algorithm accuracies across gender categories. Would it be legally and normatively acceptable in contemporary European societies if the tool continued on its earlier trajectory of being more accurate for men than for women, having been initially trained on European males?
Similarly, facial biometrics technologies tend to be identify different racial and ethnic groups at different accuracy rates. These technologies are likely similar to iBorderCtrl's AADS / Silent Talker in several important ways. For instance, in facial micro-expression spotting, the same initial steps are often involved as in biometric identification (e.g., facial landmark detection and tracking, face masking and region retrieval). These sorts of biases tend to be specifically problematic in the neural networks context, where small datasets and imbalanced distribution of samples limit accuracy and generalizability. They predispose resultant models to over-fitting, a bias which would make them look relatively highly accurate in a given small sample—and then subsequently falling apart, with accuracy plummeting in the bigger world outside that sample.
The AI that forms the basis for iBorderCtrl's deception detection software was originally trained on European men. While typical of micro-expressions, lie detection, and many other fields of study, this sort of demographic heterogeneity has been identified as a frequent reason for racial/ethnic bias in technology: disproportionately white male developers train the algorithm on themselves and people who look (and act) like them, and then it works better for that group than for others. This seems to be exactly what happened in the development of iBorderCtrl. “The CA [total Classification Accuracy] for European men is higher than for person types not trained upon,” the developers note in an un-dated paper on the technology on which iBorderCtrl's deception detection AI is based, Silent Talker.11) It seems likely that the built-in bias problems in this technology remain as it rolls out in piloting across several EU border crossings—crossings which see a wide array of travelers who are not European men.
That sort of racial/ethnic bias is only one of many ways that the legacy of iBorderCtrl's lab development and the artificiality on which its results were predicated could cause its accuracy to drop when it gets to the field due to having to deal with different people. Another is that people from different backgrounds might express emotion differently for cultural reasons, resulting in not just decreased accuracy in a random way for travellers who are not European men, but also possibly in systematic biases against other groups.12)
Another is that the structure of micro-expressions research may also create biases against vulnerable groups by imbalancing what the machine knows to match in favor of people with positive affect. The reason is that spontaneous happiness is easier to elicit under lab conditions than other emotions, like fear or sadness (Coan and Allen, 2007). So micro-expressions databases will tend to be better at identifying it, because they trained on more samples of it per emotion and per subject. This might bias deception detection systems reliant on micro-expressions against people who are not happy. It might also be easier to elicit negative emotions like fear under real-world conditions. The distribution of emotions under lab and field conditions is likely to differ.
It would be obviously, logically wrong to equate consistent happiness-expressing with truthfulness. It would be even less logical to equate inconsistent happiness-expression, negative emotional expressions, or apparent attempts to hide negative emotions with deception. Yet it is unclear how iBorderCtrl works without doing just that.
There's a lot more to say about bias problems in this type of technology in the context of racial profiling, as well as reasons to suspect that vulnerable groups are likely to be disproportionately and erroneously flagged as deceptive by these sorts of technologies. But we'll save those topics for later posts. Suffice it to say that it's been broadly recognized in the tech community for many years that the tools we build are not somehow magically neutral just because they are tools. To the contrary, technology is just as fraught with normative problems as the rest of society; tools are just as prone to bias as their creators. Tech doesn't give you a free pass on being human.
Different Interactions
Artificiality—a lack of realness that tends to plague lab experiments especially—is a huge problem in micro-expressions research. Many relevant databases use posed expressions. When that was widely criticized as too artificial to plausibly generalize to the real world, then it became more common to use people's reactions to emotional movie clips. But these data, too, lack ecological validity, in the sense that they seem likely to differ from people's natural facial expressions in the real world in important ways. For instance, we may be more expressive when making eye contact with a loved one or a threat, because we may have evolved to excel at emotional communication in face-to-face social interactions with mates and threats. And we may express anxiety, fear, and fear of being perceived as deceptive quite differently when the stakes are real and potentially high, when there are low or no meaningful stakes at all.
But we don't know if or how this particular artificiality problem plagues iBorderCtrl, because we don't know what database they used for their micro-expressions component—or if they made their own, how. We don't know if they've done cross-database validation. There is simply so much that we don't know, underscoring the nontransparency preventing independent researchers like us from assessing the work of researchers who have huge financial stakes in this technology. Even with the best of intentions on all sides, secrecy and conflict of interest should not mix.
What we do know about iBorderCtrl is that its deception detection component has repeatedly used mock-crime studies as the basis for its development, both in its latest incarnation as the ADDS and its earlier incarnation as Silent Talker. In mock-crime studies, typically one group is assigned to commit a transgression such as a petty theft, and the other to not do it. In lie detection mock-crime studies, both groups are then interviewed in an effort to discriminate between them regardless of truth-telling status.
Mock-crime studies are poorly regarded in the scientific community because they are high in artificiality. Results from such studies are less likely to generalize to real-world contexts than results from research conducted under more realistic conditions. For one thing, the stakes in the lab tend to be relatively low compared to in the real world, resulting in less of a sense of threat to people from the possibility of being judged deceptive.
Other lie detection research has shown that stakes in this sense matter for the psycho-physiological responses at issue—so much so that the accuracy rate of deception detection can drop precipitously when more realistic threat is introduced. This was the case in Patrick and Iacono's (1989) constructive replication of a Raskin and Hare (1978) polygraph prison study that had reported a hit rate of 96%. Patrick and Iacono's research added a threat manipulation, and found a hit rate of only 72%.13) Both studies were typical mock theft studies in the mock-crime genre apparently used in iBorderCtrl/Silent Talker and other lie detection research. But in the replication, inmates were told that if more than 21% of them failed the polygraph, then no one would get the study payment of $20, and a list of who failed would be circulated so that they could be held accountable for everyone missing out on the money. Of course all participants got paid and no such list was circulated. But the realistic contingency threat caused such different physiological responses that deception detection accuracy plummeted.
Lab studies used in iBorderCtrl's development could have looked to critical deception detection research in order to better structure their own. For instance, they could have attempted to incorporate realistic contingency threat components into mock-crime studies. They also could have used more appropriate study participants than their own employees. The science suggests they may well have gotten different results if they had.
Different Outcome Variables
iBorderCtrl co-developer and Silent Talker co-creator James O'Shea has argued that he and his colleagues have collected evidence the tool works across cultures and gender.14) But the outcome variable of interest in the cited research is comprehension, not deception or its detection. Statements of research findings' generalizability in science are limited to contexts in which that generalizability is actually established by relevant evidence.
In Sum
Overall, there are many reasons to doubt accuracy of judgments of truthfulness or deception based on micro-expressions. While iBorderCtrl's developers claim that the tool's accuracy should grow in the field, there are numerous reasons to suspect it may plummet instead. Yet the project's extreme nontransparency, combined with the history of problematic accuracy and validity claims in lie detection more broadly, suggest that it may not be possible for independent scientists to evaluate these claims against the facts anytime soon.
They've been used to test for things that have happened… They’ve now been gathered up… to stop people at security points in airports. And I just don’t see that we have the science to support that. And every time I ask about that, or suggest, as the committee on privacy and terrorism mentions, suggests, that there be systematic, independent evaluation of such technologies before they’re implemented—we’re constantly told no, don’t worry. We’re evaluating them, and by the way, you don’t have enough security clearance to actually see the results. And every time somebody tells me that, I get more suspicious. Because of the context of the polygraph, every time I ask—there really is a study, you just haven’t seen it yet. And then somebody who had the clearance would ask, and then, it didn’t exist. I don’t think there’s research here yet… It’s not that these might not work, but these folks are out there hyping all of this… pulled together in trailers and special arrangement. They pulled aside 150,000 people with behavioral observation techniques. And they haven’t caught a terrorist. So, how are we suppose to assess? That’s Project SPOT.The Government Accountability Office (GAO) has repeatedly echoed Fienberg's assessment that SPOT is insufficiently evidence-based and that TSA never produced validity or efficacy data of sufficient quality to support its introduction or continued use. In 2010, GAO issued a report warning that ”TSA did not validate the science supporting the program or determine if behavior detection techniques could be successfully used across the aviation system to detect threats before deploying the SPOT program.“ In other words, they rolled it out initially without sufficient scientific support or efficacy data. In a 2013 report, GAO reiterated these concerns, reviewed meta-analyses suggesting the accuracy of such a system would be about 54% (so, slightly better than chance), and recommended TSA limit future funding for such “behavior detection” activities. In a 2017 report, GAO noted TSA had followed that recommendation, reducing funding for its behavior detection activities at GAO's recommendation. But its previous report on SPOT had also asked TSA to show GAO the evidence supporting a revised behavioral indicators list for use in its behavior detection activities. Apparently in response, the TSA cited an array of news articles, articles citing secondary sources, and articles with primary data that did not meet generally accepted scientific standards to support SPOT. Thus GAO entitled its 2017 report “TSA Does Not Have Valid Evidence Supporting Most of the Revised Behavioral Indicators Used in Its Behavior Detection Activities.” But it made no further recommendations (e.g., kill the program because it's dangerous pseudoscience). Alarmed by repeated warnings from scientists, GAO, and people reporting harassment and profiling, the American Civil Liberties Union (ACLU) sued to obtain TSA documents on SPOT. It has been historically impossible to obtain federal “lie detector” program data on bias or efficacy, even through the courts. So it's unsurprising that ACLU could not obtain data to assess for systematic bias in SPOT. Nonetheless, the resultant document releases were enough to lead them to issue a warning that the program creates a "license to harass.”
Broad variations in personality and mood were expected to be present among the applicant population. Thus, anomalous responses needed to be identified only based on an individual baseline—no individual measurements would be compared to a population norm. In this way, simple nervousness or stress about being in an interview would not cause an individual response to be flagged.Both developer teams are making unsubstantiated claims with respect to conflating stress and deception, and requiring a baseline or not. If your facial expressions, head and bodily gestures, and physiological responses (micro or macro, steady or in response to baselines or prompts) are the measurements of interest, and your sensitivity to being autonomically aroused is heightened and/or your ability to recover from stress is diminished because you're nervous, sick, or neurodiverse, then correcting for the baseline would not necessarily help attenuate bias. For example, bias might still result from parasympathetic nervous system under-activation and/or sympathetic nervous system over-activation keeping people's responses from normalizing after a stress occurring after a baseline. So it is unclear why using or refraining from using a baseline measurement would result in higher accuracy for either system. There is insufficient theoretical rationale or empirical basis for both, conflicting claims.