Misogyny by numbers

Last week saw the launch of Reclaim the Internet, a campaign against online misogyny. Both the campaign and the (copious) media reports of it leaned heavily on research conducted by the think-tank Demos,  which investigated the use of the words ‘slut’ and ‘whore’ in tweets sent from UK-based accounts over a period of about three weeks earlier this year. The study identified 10,000 ‘explicitly aggressive and misogynistic tweets’ containing the target words, sent to 6500 different Twitter-users. It also found that only half these tweets were sent by men—or, as numerous media sources put it, that ‘half of all online abusers were women’.

So frequently and insistently was this statistic repeated, the message of the day almost became, ‘look, women are just as bad as men!’ Women like the journalist and feminist campaigner Caroline Criado-Perez, who were sought out for comment because of their experience of online abuse, got drawn into lengthy discussions about the misogyny of other women.

Of course, it isn’t news that some women call other women ‘sluts’ and ‘whores’ (or that women may be involved in the most serious forms of online abuse: one of the people prosecuted for sending death-threats to Criado-Perez was a woman). But ‘who sends abusive messages?’ is only one of the questions that need to be addressed in a discussion of online abuse. It’s also important to ask who the messages are typically addressed to and what effect they have, not just on their immediate recipient but on other members of the group that’s being targeted. But those questions weren’t addressed in this particular piece of research, and it was difficult to raise them when all the interviewers wanted to talk about was that ‘half of all abusers are women’ statistic.

These discussions reminded me of the way anti-feminists derail discussions of domestic violence with statistics supposedly showing that women are as likely to assault men as vice-versa. Feminists have challenged this claim by looking at the finer details of the data the figures are based on. They’ve pointed out, for instance, that female perpetrators are most commonly implicated in single incidents, whereas men are more likely to commit repeated assaults, and to do so as part of a larger pattern of coercive control. It’s also men who are overwhelmingly responsible for the most serious physical assaults, and for the great majority of so-called ‘intimate partner killings’.

Once you focus on the detail, it’s clear domestic violence isn’t an equal opportunity activity. Online misogyny probably isn’t either (especially if you focus on the kind that really does deserve to be called ‘abuse’—stalking, repeated threats to rape and kill, etc). But the Demos study didn’t capture any of the detail that would allow us to see what’s behind the numbers.

In this it is fairly typical of the kind of research which funders, policymakers and the media increasingly treat as the ‘gold standard’, involving hi-tech statistical analysis of very large amounts of information—what is often referred to as ‘big data’, though that term has come to be used rather loosely. Strictly speaking, the ‘big data’ label wouldn’t apply to the Demos study, whose sample of 1.5 million tweets is very small beer by big data standards. At the same time, it’s too much data to be analysed in detail by humans: the researchers employed NLP (natural language processing), using algorithms to make sense of text, and their findings are essentially statistical—figures for the frequency of certain kinds of messages, along with the gender distribution of their senders.

You may be thinking: but doesn’t it make sense to assume that ‘bigger is better’—that the more data you crunch through, the more reliable and useful your results will be? I would say, it depends. I’m certainly not against quantitative analysis or large samples: if the aim of a study is to provide information about the overall prevalence of something (e.g., online misogyny on Twitter), then I agree it makes sense to go large. Actually, you could argue that Demos didn’t go large enough: not only was their sample restricted to tweets which contained the words ‘slut’ and ‘whore’, the time-period sampled was short enough to raise suspicions that the findings were disproportionately affected by a single event (the surprisingly high number of woman-on-woman ‘slut/whore’ tweets may reflect the massive volume of abuse directed at Azealia Banks by fans of Zayn Malik after she attacked him publicly).

What I am against, though, is the idea that the combination of huge samples and quantitative methods must always produce better (more objective, more reliable, more revealing) results than any other kind of analysis. Different methods are good for different things, and all of them have limitations.

The forensic corpus linguist Claire Hardaker knows a lot about what can and can’t be done with the tools currently available to researchers, and she has explained on her blog why she’s sceptical about the Demos study. Her very detailed comments confirm something a lot of people immediately suspected when they first encountered the claim about men and women producing equal numbers of abusive tweets. That claim presupposes a degree of certainty about the offline gender of Twitter-users which is not, in reality, achievable. (This isn’t just because people disguise their identities online, though obviously a proportion of them do; Hardaker explains why it’s a problem even when they don’t.)

Another thing Hardaker is sceptical about is the researchers’ claim to have trained a classifier (a machine learning tool that sorts things into categories) to distinguish between different uses of ‘slut’ and ‘whore’, so that genuine expressions of misogyny wouldn’t get mixed up with ironic self-descriptions or mock insults directed at friends. Her observations on that point deserve to be quoted at some length:

We can guarantee that the classifier will be unable to take into account important factors like the relationships between the people using [the] words, their intentions, sarcasm, mock rudeness, in-jokes, and so on. A computer doesn’t know that being tweeted with “I’m going to kill you!” is one thing when it comes from an anonymous stranger, and quite another when it comes from the sibling who has just realised that you ate their last Rolo. Grasping these distinctions requires humans and their clever, fickle, complicated brains.

When you depend on machines to make sense of linguistic data, you have to focus on things a machine can detect without the assistance of a complicated human brain. A computer can’t intuit whether the sender of a message harbours particular attitudes or feelings or intentions; what it can do, though, is identify (faster and often more accurately than a human) every instance of a specific word. So, what happens in quite a lot of studies is that the researchers designate selected words as proxies for the attitudes, feelings or intentions they’re interested in. In the Demos study, these proxy words were ‘slut’ and ‘whore’, and the presence of either in a tweet was treated as a potential indicator of misogyny.

One obvious problem with this is that it excludes any expression of misogyny that doesn’t happen to contain those particular words. The researchers themselves were well aware that tweets containing ‘slut’ and ‘whore’ would only make up a fraction of all misogynist tweets (one of them told the New Statesman they were ‘only scratching the surface’).  But that point got completely lost once their research became a media story. The media need short-cuts: they’ve got no time for the endless qualifications that litter academics’ prose. Consequently, the figures given in the report for the frequency of ‘slut’ and ‘whore’ soon began to be presented as if they were a definitive measure of the prevalence of online misogyny in general.

The researchers were also aware that in context, ‘slut’ and ‘whore’ aren’t always expressions of misogyny. They may be being used in an ironic or humorous way; they may turn up in feminist complaints about ‘slut-shaming’ or ‘whorephobia’. So after searching for every instance of each word, the researchers used a classifier to filter out irrelevant examples and sort the rest into various categories.

Since the full write-up of the 2016 Demos study doesn’t seem to be available yet, I’ll illustrate how this works in practice using the report of a study which the same research group carried out in 2014, apparently using much the same methodology. In this earlier research they investigated three words, ‘slut’, ‘whore’ and ‘rape’. When they analysed the ‘rape’ tweets, they started by getting rid of irrelevant references to, for instance, ‘rapeseed oil’. Then they used a classifier to distinguish among tweets which were discussing an actual rape case or a media report about rape (these made up 40% of the total), tweets which were jokes or casual references to rape (29%), tweets which were abusive and/or threats (12%), and tweets which didn’t fit any of those categories and so were classified as ‘other’ (it’s possibly not a great sign that 27% of the sample ended up in the ‘other’ category).

This looks like the kind of classification task that computers aren’t very good at for the reasons explained by Claire Hardaker (distinguishing abuse from humour, for instance, calls for human-like judgments of tone). But the limitations of current technology may not be the only problem. As a check on the classifier’s reliability, a few hundred tweets from the sample were classified by human analysts. Some of these manually-categorized examples are reproduced in the report to illustrate the different categories. To me, what these examples show is that once messages have been extracted from their context, there’s often enough ambiguity about their meaning to cause problems for a human, let alone a machine.

Here’s a straightforward case—a tweet the humans categorised as a joke.

@^^^^ that was my famous rape face 😉 LOL Joke.

This is unproblematic because the tweeter has taken a lot of trouble to make its status as a joke explicit (adding a winky face, LOL, and then the actual word ‘joke’). But how about this tweet, which comes from the 12% of rape references which the humans categorised as ‘abusive/ threats’?

@^^^^ can I rape you please, you’ll like it

If you think that’s a repugnant thing to tweet at someone, you’ll get no argument from me. But I don’t think it’s self-evident that this strangely polite request is intended as a serious threat rather than a ‘mock’ or ‘joke’ threat. The original recipient will have decided how to take it by using contextual information (e.g., whether the tweeter was a friend or a random stranger, what if anything the tweet was responding to, whether there was any history of similar messages, etc.) Without any of that context, the significance of a message like this one for its original sender and recipient is something an analyst can only guess at.

The example I’ve used here is a ‘threat’ that might conceivably have been intended and taken as a ‘joke’, but it’s likely there are also cases of the opposite, tweets the researchers classified as ‘jokes’ which were intended or taken as serious threats. So I’m not suggesting that the proportion of actual rape threats was lower than the reported 12%; I’m suggesting that the classification—even when done by humans—is not sufficiently reliable to base that kind of claim on. And that the main reason for this unreliability is the way large-scale quantitative studies of human communication detach individual communicative acts from the context which is needed to interpret their meaning fully.

Whether we’re academics, journalists or campaigners, we all like to fling numbers around. There’s nothing like a good statistic to draw attention to the scale of a problem, and so bolster the argument that something needs to be done about it. And I’m not denying that we need the kind of (large-scale and quantitative) research which gives us that statistical ammunition. But two caveats are in order.

First, large scale quantitative research is not the only kind we need. We also need research that illuminates the finer details of something like online misogyny by examining it on a smaller scale, but holistically, with full attention to the contextual details. There’s a lot we could learn about how online abuse works—and what strategies of resistance to it work—by using a microscope rather than a telescope.

Second, if we’re going to rely on numbers, those numbers need to be credible. In that connection, the Demos study hasn’t done us any favours: I’ve yet to come across any informed commentator who isn’t at least somewhat sceptical about its findings. While some of the problems people have commented on reflect the way the media reported the research—pouncing on the ‘women send half the tweets containing “slut” and “whore”’ claim and then reformulating that as ‘women are half of all online abusers’ (an assertion whose implications go well beyond what the evidence actually shows)—there are also problems with the researchers’ own claims.

‘My issue’, says Claire Hardaker, ‘is that serious research requires serious rigour’.  When research is done on something that’s a matter of concern to feminists, its quality and credibility should be an issue for us too.