a particular feature may be a source of bias. However, it is also the case that in stylometric analyses a series of design decisions also need to be made—biases can creep in with regard to the selection of comparison materials, the identification of features to be elicited, and the statistical or other methods of prediction applied. Argamon (2018) usefully discusses the various possible decision points and pitfalls for computational forensic authorship analysis, and every decision point is also a point at which conscious or unconscious bias can enter an analysis.
Second, unlike with the analysis of a fingerprint or a DNA sample, in authorship analyses it is often not possible to isolate the sampled material from the story of the investigation. In fingerprint examination, it may be possible, as advised by the UK Forensic Regulator, to avoid telling an examiner much of the contextual information about a crime, but in linguistic analysis it is often the case that the texts themselves contain part of the wider story of the investigation. This unavoidable knowledge of contextual information clearly gives rise to potential bias.
Third, in authorship analysis, there is strong acknowledgment that cross-genre and the various interpersonal sources of linguistic variation give rise to considerable uncertainty around any conclusion of authorship. The issue of selection of materials of known authorship—either as direct comparison materials or as background materials to determine a base rate within a particular genre, context, or community of practice—is one of the most crucial decisions to be made in any authorship analysis. It is also a decision that is one of the most significant in producing an erroneous or biased result.
Opinion and judgment are always involved in assessing comparison material and in drawing a conclusion that the specific texts constitute an adequate comparison set in any particular problem. A bad choice of comparison corpus might mean that an analyst is led to believe that a particular feature is distinctive and associated with a particular author, when in fact the feature arises due to variation in register, genre, or a community of practice.2
For example, if an the analyst noticed that a particular author used “because” as a preposition as opposed to a subordinating conjunction in their business emails, then they might want to investigate how distinctive this is. In such a case what would make a good comparison corpus? A contemporary corpus of writings in computer-mediated communications might be available, or there may only be a slightly older corpus of business emails. The former might show that recently this is now fairly common usage across a range of computer-mediated communications such as social media and blog posts, and the latter may demonstrate it was rare for most authors writing business emails (but a few years ago).
The risk of error here is that the analyst comes to believe that the feature is an authorship marker, when in fact it indicates the queried text is of the register or the community of practice from which it was drawn. This is what in social science is referred to as a validity issue (see Grant & Baker, 2001).3 The risk of confirmation bias is that a comparison corpus is selected to best support the analyst’s preconceived hypothesis, and this can be exacerbated by the time and resource available to investigate a particular case. Examination of emails from the individual’s company might show, for example, that “because-as-preposition” is locally common within emails from that particular business, but it would take considerable effort to discover this.
In considering the sources of these biases more generally, we have to recognize the internal and external psychological pressures brought to bear on the analyst. These range from a natural desire to want to help through to broader concerns about building a reputation or a business. These pressures create potential bias in the decision-making process even before any client hypothesis has been heard, particularly where the decisions are a bit more balanced or nuanced. In any authorship analysis, significant decisions about the design of the analysis are very real and will affect the result. Very often, the best decision—the most expert judgment—is that no analysis is possible with the provided material.
One of the most fundamental decisions to be made concerns how much information an analyst should know about a case. There is an important tension between knowing as little contextual information as possible to avoid bias but knowing as much information as possible about the texts to ensure that there is a basis for a sound comparison in the texts analyzed. These judgments of text selection and of which type of analysis to apply are at the heart of expertise in forensic authorship analysis and are not mitigated by taking a wholly computational approach. This issue and all of the sources of bias need proper consideration, and, where possible, the biases need to be mitigated or designed out of the analyses. In turning to the Starbuck case, we show how we attempted this balance, reflect on the methods applied in this case, and make recommendations for future developments in this area.
EVALUATING THE DATA
One key feature of how we tackled the Starbuck case was a separation of roles between the two analysts Grant and Grieve (hereafter TG and JG). We decided from the outset to design our analysis in a deliberate attempt to mitigate confirmation bias through controlling the flow of information in the authorship analysis. We split the tasks in the following way:
TG liaised with police, received and evaluated data, decided on comparison sets, and then passed these materials to JG in a particular order. As TG had the principal liaison role and the contextual overview that came with it, he took principal responsibility for writing the report based on JG’s analysis. As primary author of the report, this meant he would have been more likely to be called to court as an expert witness (although both prosecution or defense would have had the right to call either TG or JG).
JG performed the authorship analysis, applying methods at his own discretion on the texts passed to him by TG. The analysis itself was purely his, and he reported his findings to TG at different stages, as will be detailed.
The police provided CFL with considerable data for analysis, at least for a case of forensic disputed authorship. The data set comprised three subsets.
1 The first subset consisted of 82 emails known to have been written by Debbie, totaling approximately 28,000 words. These were sent between August 2006 and April 2010, mostly covering the period of time from before she had met Jamie and included those sent during her previous trips abroad.
2 The second subset consisted of 77 emails written by Jamie, totaling approximately 6,000 words. These were sent between January 22, 2009, and October 18, 2012, and were primarily from the period of travel after the wedding.A substantial number of the emails from this period were sent to a personal assistant who Jamie had employed to help deal with his affairs while he was abroad. The genre of these “business” emails was thus very different from the personal emails Debbie had sent to describe her travels. In part, this explains the very clear difference in average email length between the two sets (approximately 370 words per email for Debbie compared with 70 words per email for Jamie).
3 Finally, a third subset consisted of 29 emails of disputed authorship, all sent from Debbie’s account between April 27, 2010, and May 23, 2012. All these emails were sent after the marriage on April 21, 2010—the majority after the couple had supposedly left on their honeymoon, but the first few before departure.
TG took on the initial evaluation of the data to determine whether an authorship analysis was possible in the first instance and determined the data provided was well suited to a forensic authorship analysis for several reasons.
First, it was clearly a closed-set problem—the briefing from the police was that there were only two candidates under consideration as the author of the disputed emails. Given the context of the case, this seemed like a reasonable assumption. The problem was therefore to determine which of the two candidate styles was the closer match for the disputed texts. This binary closed-set problem is perhaps the most straightforward structure for an authorship problem, because the analyst does not need to perform a strong identification as to who was the author of the questioned document(s). They simply need to provide an informed opinion on whose style is more consistent with the style of the questioned document(s) and to demonstrate that at least some of these points of consistency show comparative or pairwise distinctiveness (Grant, 2020).
In