Группа авторов

Methodologies and Challenges in Forensic Linguistic Casework


Скачать книгу

the weight of evidence for any style shift can be considered cumulatively after any identified break in style.

      Finally, one last complication with this data was that, although it consisted of emails, the police provided us with access only to screenshots of the texts. Because these were simple images, they could not be automatically analyzed computationally. As a result, we needed to convert these images into text using optical character recognition software, which was a relatively time-consuming process and required thorough checking against the image files to ensure that even minor punctuation features were correctly digitized.

      The outcome of TG’s evaluation phase of the analysis was the judgment that this data set as a whole was well suited for analysis. Cases like this with small, closed sets of authors, sufficient data, and register control do occur with some regularity, despite claims sometimes made in the stylometry literature in particular (e.g., Luyckx & Daelemans, 2011). Law enforcement agencies can often provide these types of problem—especially with online language use providing essentially permanent records of data available. Researchers with relatively little forensic experience appear to focus their efforts on more and more challenging problems. For practical casework problems, these more complex research projects are less relevant. Such academic authorship studies are, of course, important, but many issues around the “easier” sorts of cases have not yet been resolved. By sharing actual investigative linguistic casework with the researchers and the public, the forensic linguistic community can help provide a picture of the landscape of actual forensic problems.

      ANALYSIS

      As noted already, the purpose of separating the analysis into stages was to allow TG to pass the data in the case to JG in a controlled way. Specifically, in line with the protocol published in Grant (2012) and, given the time series nature of the data, TG began by providing JG with only the two sets of known writings for Debbie and Jamie Starbuck. TG had requested from the police contact that he, too, should not be informed of any particular suspected breakpoint in the data series. In spite of this, the emails were provided to TG in two files of known and disputed emails. To resolve this, TG removed the last few emails from Debbie’s known emails and added them to the disputed set to create a blind test set of emails for JG’s analysis. The advantage of having a second party manage the data access for the primary analyst is that it allows for practical issues such as this to be taken from the hands of the police, who may not fully understand the requests to provide data in certain ways to assist in the outcome.

Feature Debbie Jamie
Sentence length Long sentences (24 words per sentence average) I’m now back in Oz, after 5 weeks In NZ—had a good time, though it felt so much more remote than here (guess it is!) and I really felt that, being there. Short sentences (10 words per sentence average) I knew I’d forget something. 2 things in fact.
One-word sentences No tokens Occasional use Sorry. I thought I’d replied.
Run-on sentences Relatively common Are you enjoying your new car, what is it? No tokens
Awhile No tokens 3 tokens Shouldv’e done that awhile ago.
Inserts Relative uncommon ha ha—you’re entirely responsible for how or where it goes Relatively common Umm….you haven’t actully apologised for anthing despite your insistence otherwise.
Emoticon usage No tokens 9 tokens Its gorgeous:) hope you enjoyed your holiday.)

      The feature set provided by this initial stage of JG’s analysis in the Starbuck case was then provided to TG and also sent to the police. The purpose was to provide an evidence trail that the feature set had been “locked” prior to JG receiving the disputed material. This step in itself did not strengthen the mitigation of confirmation bias, but it did strengthen the robustness of the analysis as an evidential product.

      JG identified a wide range of linguistic forms that distinguished between the styles of these two authors, features that were predominantly used by one or the other in the known sets of emails. The process through which these features were identified involved a combination of close manual stylistic analysis and also computational analysis giving rise to some stylometric features.