Why Most Published iBorderCtrl Research Findings Are Likely to Be False

In a previous post, dr. Vera Wilde detailed problems with the iBorderCtrl team's repeated claims regarding the tool's accuracy, likely future accuracy, and what it is the tool does in the first place. That post detailed why iBorderCtrl is almost certainly not as accurate as it claims.¹⁾ Its accuracy numbers are definitely not accurate themselves as a way of representing research findings of this type.²⁾ And this type of AI “lie detection” will probably never be as accurate as the iBorderCtrl team repeatedly claims theirs will be in the immediate future,³⁾ and is not doing what it claims to be doing.⁴⁾ In another previous post, iBorderCtrl's susceptibility to a number of vulnerabilities that make its published research findings more likely to be false was briefly noted. This critique extends those two. All three posts criticize iBorderCtrl from a scientific perspective with a focus on methods. There are also important criticisms to consider from the standpoints of current legal and ethical norms, and future possible threats to human rights from these sorts of technologies. Those will be explored in upcoming posts…

In 2005, physician and leading evidence-based medicine proponent John Ioannides published a paper in the peer-reviewed open-access journal PloS one entitled “Why most published research findings are false.”⁵⁾ His argument is at least as true for most published “lie detection” research findings as it is for medical and psychology publications—publications that tend to be subjected to greater scrutiny from various institutional sources, including typical university Institutional Review Board scrutiny for human subjects research and typical journal peer-review processes, than does industry- and government-led lie detection research. This means we should probably expect an even greater likelihood that most published research findings on lie detection are false.

Ioannides argues:

There is increasing concern that most current published research findings are false. The probability that a research claim is true may depend on study power and bias, the number of other studies on the same question, and, importantly, the ratio of true to no relationships among the relationships probed in each scientific field. In this framework, a research finding is less likely to be true when the studies conducted in a field are smaller; when effect sizes are smaller; when there is a greater number and lesser preselection of tested relationships; where there is greater flexibility in designs, definitions, outcomes, and analytical modes; when there is greater financial and other interest and prejudice; and when more teams are involved in a scientific field in chase of statistical significance. Simulations show that for most study designs and settings, it is more likely for a research claim to be false than true. Moreover, for many current scientific fields, claimed research findings may often be simply accurate measures of the prevailing bias.

Ioannides identifies six risk factors that lessen a finding's likelihood of being true. iBorderCtrl has all six. Here's how.

Point One: “a research finding is less likely to be true when the studies conducted in a field are smaller”

Smaller studies with a sample size of less than forty participants has been mentioned multiple times.

Specifically, the studies conducted thus far with iBorderCtrl and its predecessor “AI lie detector” technology Silent Talker are, by all appearances, smaller. For example, Silent Talker was trained and tested on data from a sample of thirty-nine volunteers. A subsequent test of iBorderCtrl at fake borders used only thirty volunteers before the pilot testing was rolled out in multiple EU countries with land borders to non-EU territory (Latvia, Greece, and Hungary).

Thirty-nine and thirty would be small sample sizes in an experiment, but it might matter less for assessing causality if the experimental method were in play. However, it does not appear to be. Rather, these seem to be smaller studies that rely on machine learning processes to build models that are then disproportionately accurate for people like the subjects they were trained on. For example, Silent Talker was disproportionately trained on data from male volunteers of European ethnic origin, and its resultant reported accuracy was then higher for that subgroup than for others. Small sample size matters in most scientific fields, but it matters even more in fields like machine learning where the experimental method is not advancing causal knowledge.  

Point Two: “a research finding is less likely to be true… when effect sizes are smaller”

It's actually not clear what the effect sizes are here—of what, on what. The independent variable(s) of interest are ostensibly those measures across multiple channels and times helping the algorithm learn to detect and then actually detect deception or truthfulness as probabilities cum dependent variable of interest. But there's not a clear statement of either of these things in the research thus far that would allow a reader to quantify a singular effect size or range of effect sizes. The research is just not well-defined enough to do that as it is.

In a perfect world, all published scientific findings would come with an abstract answering each of Ioannides's concerns in one sentence. This would be especially handy in updating the probability that findings are likely to be true when there is not really an effect size to evaluate. Effect sizes in statistics are quantified measures of the magnitude of the phenomenon of interest. Estimation statistics more generally include both effect sizes and the confidence intervals of the effect sizes, and there is currently a debate about whether estimation statistics should replace tests of statistical significance due to concerns about things like p-hacking. Concerns that are especially pertinent to observational research that seeks to find signal (predict relationships) in lots of noise (huge amounts of data) through large numbers of manipulations. This is also called data dredging; machine learning can be a glorified version of it.

Effect sizes are about impacts and confidence intervals are about the uncertainty around those impacts. So bottom line, bigger effects with less uncertainty are what we want to see in science. And we need to see both to know what we are talking about.

In research thus far on Silent Talker, there are findings presented as effect sizes that are arguably not effect sizes, while confidence intervals are not reported. There are no peer-reviewed papers out yet on iBorderCtrl that we know of.

The main problem with talking about effect size in machine learning contexts like iBorderCtrl is that there is no well-defined manipulation, experimental intervention, or cause that the researchers posit and seek to measure the effect of (or more accurately, technically, hypothesize and seek to falsify the null hypothesis of). There is also no static confidence interval. There are just too many relationships being tested, tinkered with, and retested to assert a single number that could be relevant here to even evaluate it.

Machine learning researchers are generally interested in accurate predictions. A good machine learning researcher might argue that her outcome of interest is accuracy in predicting a particular dependent variable outcome value, like truthful/deceptive. That's fine, but then accuracy numbers need to be reported accurately. Instead, as a previous post and footnote explained, iBorderCtrl researchers are repeatedly misrepresenting the tool's accuracy.   In addition, we have to ask what effect sizes here would actually mean in the real world. In the case of screening tools like iBorderCtrl, even with larger effect sizes somehow translating into high accuracy rates, base rates matter. A lot. If proponents want to talk about accuracy instead of effect size, then we have to also talk about efficiency—how frequently will innocent people be unfairly implicated as liars and subjected to heightened scrutiny, or worse, denied their rights? And how will that affect what it's really like to cross the border?

The National Academy of Sciences polygraph report noted “base rate makes more difference than the level of accuracy…” And in mass screening contexts, when base rates of offenses of interest are likely to be (far) less than 10%, “the false positive index is quite large even for tests with fairly high accuracy indexes.” In other words, when most people are telling the truth about important things, even highly accurate mass screening tests will incorrectly identify large numbers of them as lying. ⁶⁾

The perfect storm of lower effect size and other risk factors for false findings combined with low base rates means that mass “lie detection” screenings like iBorderCtrl threaten not just liberty and efficiency, but also the very security they are intended to prioritize. The National Academy of Sciences' conclusion that polygraph screening threatens security arguably also applies here: “Overconfidence in the polygraph—belief in its validity that goes beyond what is justified by the evidence—presents a danger to national security objectives because it may lead to overreliance on negative polygraph test results. The limited accuracy of all available techniques of employee security screening underlines the importance of pursuing security objectives in ways that reduce reliance on employee screening to detect security threats.”

A final point on effect size and arguments about it that proponents might try. It's also worth noting here that the National Academy of Sciences polygraph report noted “lie detection” screening tool proponents often argue that the tools have a deterrent effect. That would be great news for iBorderCtrl and systems like it, because it might mean that the real-world effect of the tool would be exactly as intended even if the science was bunk, the effect size was so small the data had to be statistically manipulated to generate measurable “effects” in the first place, and the published findings were false.

But there is a pronounced lack of scientific evidence for deterrence. So it's not only likely that the published research findings on iBorderCtrl will be false, but it's also likely that the real-world effects of the tool—if it's ever introduced into broader use on the basis of those findings—will be negative for the very efficiency of border crossing it's ostensibly meant to promote.

Similarly, the NAS polygraph report concluded that “From the information available, we find that efforts to use technological advances in computerized recording to develop computer-based algorithms that can improve the interpretations of trained numerical evaluators have failed to build a strong theoretical rationale for their choice of measures. They have also failed to date to provide solid evidence of the performance of their algorithms on independent data with properly determined truth for a relevant population of interest. As a result, we believe that their claimed performance is highly likely to degrade markedly when applied to a new research population and is even vulnerable to the prospect of substantial disconfirmation.” (p. 196-7)

Lack of evidence on incremental validity is also relevant. The NAS polygraph report noted: “We have not located any scientific studies that attempt directly to measure the incremental validity of the polygraph when added to any of these information sources. That is, the existing scientific research does not compare the accuracy of prediction of criminal behavior or any other behavioral criterion of interest from other indicators with accuracy when the polygraph is added to those indicators.”

Point Three: when there is a greater number and lesser preselection of tested relationships;

iBorderCtrl's predecessor “AI lie detector,” Silent Talker, involved assessing 37 channels' changing relationships over time. That's a greater number of tested relationships. The researchers indicated their intent to tinker with what relationships were assessed in the future, indicating lesser preselection of tested relationships.

As the National Academy of Sciences polygraph report noted, computerized polygraph (lie detection) scoring does not actually have a proven advantage over analogue scoring.

Computerized polygraph scoring procedures have the potential in theory to increase the accuracy of polygraph testing because they improve the ability to extract and appropriately combine information from features of psychophysiological responses, both obvious and subtle, that may have differing diagnostic values. However, existing computerized polygraph scoring methods have a purely empirical base and are not backed by validated theory that would justify use of particular measures or features of the polygraph data. Such theory simply does not yet exist. — The Polygraph and Lie Detection (2003), National Research Council, "Chapter: 7 Uses of Polygraph Tests"

The same is true for AI lie detection like iBorderCtrl. The basic problem of a lack of theoretical basis for the causal inferences at issue is compounded when more relationships are tested with less preselection. This is data dredging and the old methods expression “garbage in, garbage out” applies.

However, while it's possible to definitively assess this aspect of the research's propensity to publish false findings, it's difficult to more precisely address the issue due to non-transparency. For example, under EU Freedom of Information legislation, Italian journalist Riccardo Coluccini asked for the “project Deliverables” documents listed on the iBorderCtrl website (details and documents here). All documents relating to ethics questions surrounding iBorderCtrl were rejected in full. Only a few technical documents were released, with many pages entirely redacted. So we don't know currently how many relationships are being tested, or what the exact degree of preselection has been or is now.

In a perfect world, all published scientific findings would have started with a research design filed in a publicly accessible database. Researchers could specify what data they intend to collect and what statistical analyses they intend to run. Things might change along the way, and they could address that in the results write-up. But having that design on file from the start would help show how many preselected tested relationships there really were in a way that we cannot evaluate given existing lack of public scientific infrastructure.

In reality, it would be surprising if the iBorderCtrl researchers published a full list of tested relationships in a timeline showing what was preselected, what was rejiggered why, what the AI decided to include and how—that it can't give a reason for because artificial neural networks are not sentient… etc. They haven't done it yet, and may never. If and when they do, it's likely to show a long list of tested relationships with a lesser degree of preselection that would indicate likely true results.

Point Four: “a research finding is less likely to be true… where there is greater flexibility in designs, definitions, outcomes, and analytical modes”

Researchers reporting results from some of the research on which iBorderCtrl's “lie detection” component is based stated that they might add more channels including

nonverbal and auditory cues (e.g. finger/hand, foot, torso movement, voice pitch), speech content related (e.g. number of „self references‟) or some other cue, such as brain wave activity. — “Silent Talker: A New Computer-Based System for the Analysis of Facial Cues to Deception,” by Janet Rothwell, Zuhair Bandar, James O'Shea, and David McLean, p. 30

The fact that the researchers framed adding more channels as a way to probably improve classification accuracy without presenting a theoretical foundation supporting these very vague claims is concerning with respect to research methods in general and the likelihood that resultant published research findings will be false in particular. While it's true that machine learning algorithms often rely on flexibility in definitions, outcomes, and analytical modes, that's actually a red flag that a lot of machine learning is just glorified data dredging.

Point Five: “a research finding is less likely to be true… when there is greater financial and other interest and prejudice”

The iBorderCtrl research team has apparent financial and other interest and prejudice. In an effort to better quantify the stakes for Manchester Metropolitan University researchers, we have made and continue to pursue a Freedom of Information request regarding documents that would allow one to outline the relationship between MMU on the one hand and Silent Talker Ltd., European Dynamics SA and/or the iBorderCtrl consortium on the other. This includes (but is not limited to) emails, meeting minutes, contracts, financial or other arrangements and agreements of any kind. Specifically, we would like to receive copies of all documents that list (or allow us to compile a list of) all MMU students and staff that work or have worked at or for any of the entities mentioned above, including all details available at MMU of what the work entailed, what the deliverables were and who paid for any work performed, among other things. This request is important to assessing financial and other interest and prejudice in iBorderCtrl research that bears on the findings' credibility, because H2020 project co-investigator Dr. James D. O'Shea is also on the Silent Talker patent (along with Bandar, McLean, and Rothwell ), as well as being a co-founder of and consultant for Silent Talker.

Point Six: “a research finding is less likely to be true… when more teams are involved in a scientific field in chase of statistical significance.”

Previously, it seemed that “Of these six risk factors for false-positive research results, the first five apply to physiological deception detection research in general and iBorderCtrl in particular.” The last one did not seem to apply, which was no surprise because theory and empirics do not support mass lie detection screenings in any context.

However, since then I was contacted for a BBC article focusing mostly on other AI lie detection projects. So it seems that indeed, more teams are involved in this field in a chase of statistical significance. Mea culpa.

Conclusion

The iBorderCtrl research team has yet to publish its findings. But if and when it does, they are more likely to be false due to risk factors pinpointed by Ioannides and detailed above. That is, their likelihood of being false is raised by the fact that their studies seem to be smaller, their effect sizes are not really extant in a proper understanding of what statistical estimation means that coheres within their research designs, there seems to be a greater number and lesser preselection of tested relationships (but non-transparency makes it hard to know for certain), there seems to be greater flexibility in designs, definitions, outcomes, and analytical modes; there is greater financial and other interest and prejudice; and more teams are involved in a scientific field in chase of statistical significance.

¹⁾

As previously discussed in greater detail, existing micro-expressions and other lie detection research is insufficient to support its use in mass screenings. In addition to that general problem, iBorderCtrl accuracy figures specifically are probably artificially inflated by the use of different environments, people, and interactions than are relevant to field contexts.

²⁾

One-number accuracy claims like the ones iBorderCtrl has been consistently throwing out (i.e., 76% ) aren't generally used to report results in this type of research (e.g., micro-expressions research). Instead, confusion matrices are used, because the outcomes aren't binary—there are many more than two possible outcome variable values. And even if the outcomes of interest could be regrouped as binary—which would not be accurate and therefore is not possible in this context as a matter of good science—then researchers would still need to report these outcomes in a 2×2 Bayesian table to accurately reflect what is going on with accuracy. For examples of those tables in this context, please see the ones in this essay.

³⁾

The most obvious reason for this is that iBorderCtrl's lie detection component is based on a lie detection tool, Silent Talker, that was originally trained on disproportionately homogeneous samples skewed toward participants “of European origin.” We know from other AI, biometrics, and related research that results from a small, relatively homogeneous population don't usually generalize well to a larger, more diverse one. And that really matters in the case of iBorderCtrl, since it's ostensibly intended for use on non-EU citizens. So the target population is distinctly different from the population that the tool was trained on, and other research shows that will probably degrade accuracy in the actual population of interest.

⁴⁾

The iBorderCtrl team repeatedly claims their tool identifies biomarkers of deception. But leading scientific experts on the matter have long agreed the core scientific problem with lie detection is that there's no such thing as biomarkers of deception—there is no unique lie response to detect. So unless this team has a big scientific discovery to announce, on the order of a Nobel-winning advance in psychophysiology, their representation of their research is itself fundamentally dishonest.

⁵⁾

There are, of course, several Ioannides-style heuristics out there for assessing what is good science. Another one of the best ones is Robert Abelson's MAGIC criteria from his Statistics as Principled Argument (1995). Abelson suggests the criteria for a persuasive statistical argument are Magnitude (bigger are more compelling than smaller effect sizes), Articulation (more precise are more compelling than imprecise statements), Generality (more general effects and applications that would interest a broad audience are more compelling than less general effects and applications that wouldn't), Interestingness (more interesting and surprising effects are more compelling than less interesting or merely confirmatory ones), and Credibility (credible claims are more compelling than incredible ones). Obviously, the M here overlaps with Ioannides' Point One. But after this, notice that the statistician (Abelson) is less concerned with the methods and more concerned with the humanity, while the doctor (Ioannides) is more concerned with methods throughout. Is that because the practice of science and medicine so degraded between 1995 and 2005 that methodologists have to crack the whip better and harder to keep pace with what is at worst fraud or corruption and at best merely bad science? Or is there something about statistical methods that can seem paramount to outsiders, while statisticians tend to be more worried about things like the writing, whether the research is interesting, and other sorts of more aesthetic or social concerns?

⁶⁾

This application of Bayes' rule has been explored in-depth in the NAS report on polygraphs, as well as in the refugee screening context here. This passage makes a fuller explanation of the NAS's position on base rates and polygraph mass screenings at National Labs: “Given the very low base rates of major security violations, such as espionage, that almost certainly exist in settings such as the national weapons laboratories, as well as the scientifically plausible accuracy level of polygraph testing, polygraph screening is likely to identify at least hundreds of innocent employees as guilty for each spy or other major security threat correctly identified. The innocent will be indistinguishable from the guilty by polygraph alone. Consequently, policy makers face this choice: either the decision threshold must be set at such a level that there will be a low probability of catching a spy (thereby reducing the number of innocent examinees falsely identified), or investigative resources will have to be expended to investigate hundreds of cases in order to find whether there is indeed one guilty individual (or more) in a pool of many individuals who have positive polygraph results. Although there are reasons of utility that might be put forward to justify an agency’s use of a polygraph screening policy that produces a very low rate of positive results, such a policy will not identify most of the major security violators. In our judgment, the accuracy of polygraph testing for distinguishing actual or potential security violators from innocent test takers is insufficient to justify reliance on its use in employee security screening in federal agencies.”)