Some thoughts on Ethics and Big Social Media Research

Some thoughts on Ethics and Big Social Media Research

This week there have been a few stories in the tech news around social media research ethics. These range from the controversial Kirkegaard and Bjerrekær case involving data scraping from OK Cupid and subsequent public release, to new UK Cabinet Office guidelines on Data Science ethics and a new report from the Council for Big Data, Ethics, and Society. This post is not a commentary on these stories, but instead they prompted me to share some notes I’ve got on the topic that have been on lurking on my hard drive for a wee while. They are not particularly polished or well structured currently, however, hopefully are a few useful nuggets in here, and post PhD it would be nice to turn them into something more formal. But anyway,  here  we go for now, but be warned…there is law:)


1. There is a need to balance the end goals of data driven research projects that aim at fostering some notion of the ‘greater good’ against the regulatory context. Utilitarian research goals do not preclude studies from legal compliance requirements. This is especially so for privacy and data protection considerations, as these are fundamental, inalienable human rights which often enable other human values like dignity, autonomy, or identity formation. When dealing with vulnerable populations, the need to respect these elements heightens. Broadly, virtuous research goals do not override legal safeguards.

However, this becomes problematic when handling big social media data for research because of significant technical, and regulatory issues for researchers. The ‘velocity, variety and volume’[1] of big data is a challenge computationally. These large data-sets often involve personal data, and accordingly this brings the EU data protection regulations to the fore. The UK ICO has many concerns around use of big data analytics with personal data, yet they state ‘big data is not a game that is played by different rules’ and the existing DP rules are fit for regulation. [2] They are particularly concerned about a number of issues like: ensuring sufficient transparency and openness with data subjects about how their data is used; reflecting on when data re-purposing is compatible with original purposes of collection or not (eg data collected for one purpose is reused for another); the importance of privacy impact assessments; how big data challenges the principle of data minimisation and preserving data subjects access rights.  [3]

2) Researchers, as data controllers in their research projects i.e. those determining the purposes and means of data processing,[4] have a range of responsibilities under data protection rules. They need to ensure security, proportionate collection and legal grounds for processing, to name a few. Data subjects have many rights in data protection law from knowing what data is collected, why, by whom, for what purposes and to object to processing in certain grounds. From a data subjects’ perspective, they may not even know they are part of a big social media dataset, making it very hard for them to protect their own rights (eg with data scraped from Tweets around a particular topic). Furthermore, data subject rights are set to be extended in the new General Data Protection Regulation[5]. They will be able to restrict data processing in certain circumstances, have their data in a portable format they can move and even have a right to erasure.[6] Researchers need to reflect on the range of subject rights and controller responsibilities in the GDPR, and consider how to protect the rights of data subjects whose personal data are within their big social media datasets. A particular challenge is obtaining user consent. The status quo that some argue these datasets are too large to reasonably obtain consent of every ‘participant’ is not sustainable…(we return to that below…)

keyboard-417092_12803) To understand why the distinction between personal and sensitive personal data is important, we need to unpack the nature of consent in data protection law. In general, unambiguous user consent is not the only legal grounds for processing personal data (eg legitimate interests of the controller), but for handling sensitive personal data, explicit consent is required. To clarify, personal data is “any information relating to an identified or identifiable natural person (‘data subject’)”,[7] but sensitive personal data is information about “racial or ethnic origin, political opinions, religious or philosophical beliefs, trade-union membership, and the processing of data concerning health or sex life.”[8] Similarly, legally speaking, consent is “freely given specific and informed indication of his wishes by which the data subject signifies his agreement to personal data relating to him being processed”.[9] The informed requirement needs clearly visible and accessible, jargon free, easily understandable information to be directly provided to subjects before the point of consent. [10] Similarly, the unambiguous element needs “the indication by which the data subject signifies his agreement must leave no room for ambiguity regarding his/her intent”. [11]

With sensitive personal data, the consent needs to be ‘explicit’ but what this requires is not defined in the law. Importantly, for both types of consent, the EU Article 29 Working Party (an advisory institution for DP law) argues that how an indication or agreement is made is not limited to just writing, but any indication of wishes i.e. “it could include a handwritten signature affixed at the bottom of a paper form, but also oral statements to signify agreement, or a behaviour from which consent can be reasonably concluded.[12] Importantly, passive behaviour is not enough, and action is necessary. [13]

4) From the perspective of social media research, this leaves the mechanisms that can be used for obtaining explicit consent quite open, within these required parameters. However, more innovative approaches for communicating with and informing the end users are required. One proposal for studies using Twitter might be to send direct messages to the Twitter handles of those whose data is captured, explicitly explaining about data being used in research and asking permission. Privacy preserving approaches like removing Twitter handles at point of data collection might be another means, or at least pseudonymising them.  However, even these might not be compliant because Twitter handles themselves are likely personal data, and handling these, even prior to providing information about the nature of processing to end users, or prior to anonymisation/pseudonymisation, would still be subject to DP rules. Whilst these are difficult challenges to address, they are necessary to be considered.

5) Beyond compliance, we also need to consider the distinction between law and ethics. Form contracts eg privacy policies, end user licence agreements or terms of service, are a dominant approach for obtaining end user consent. No-one reads these, and they can not change them even if they did.[14] Whilst many of these contract terms are challengeable as unfair under consumer protection laws, for example jurisdiction clauses, this requires testing in a dispute. This costs money, time and resources that many consumers lack. Beyond this, the use of contracts highlights a distinction between law and ethics.

successful-1095545_1280Organisations will argue contracts that include clauses allowing for research provide legal compliance, and research based on such contracts may be ‘legal’. Whether it is ethical is another question,  as we saw with the Facebook ‘emotional contagion’ study.[15] However, people rarely read Ts&Cs, challenging any notion of ‘informed consent [16] and with the need for explicit consent to process data relating to political opinions, health, sex life or philosophical views of subjects, it is hard to argue such form contracts are enough. Indeed, sentiment analysis of Tweets, for example, may often focus on topics like political responses to different topics from different communities. However, even if you could be convinced by a good legal that orator that the legal footing for processing is sound, the uses are still questionable ethically. Fear of sanctions and implications from lack of legal compliance, like litigation, will likely foster more change than the aspiration of ethical practice. Sound ethical practices could be viewed as a carrot, and the law as a a stick, but increasingly we need both. Attempts to find a complementary approach between the two are growing. A good example is the European Data Protection Supervisor recently established ethics advisory group to help design a new set of ‘digital ethics’ that can help foster trust in organisations.[17]

6) Publicly accessible research data like Tweets are often argued to be fairly used in research, as they are broadcast to the online world at large, but this is not correct. As boyd argues[18], information is often intended only for a specific networked public made up of peers, a support network or specific community, not necessarily the public at large (boyd 2014). When it is viewed outside of those parameters it can cause harm. Indeed, as Nissenbaum states, privacy harms are  about information flowing out of the context it was intended for (Nissenbaum 2009). [19] Indeed, legally, people do have a reasonable expectation to privacy, even in public spaces (Von Hannover v UK).[20]

7) As researchers, our position within these contexts is unclear. We are often in an asymmetric position of power with regards to our participants, and we need to adhere to higher standards of accountability and ethics, especially when dealing with vulnerable populations. How we maintain public trust in our work has to reflect this. It becomes a question of who is looking at this data, how and in what capacity. The context of police analysis of open social media is a comparative example (i.e. not interception of private communications but accessing publicly available information on social media, news sites etc). There, the systematic nature of their observation[21], and their position as a state organisation brings questions about legality, proportionality or necessity of intrusions into private and family life to the fore. The same questions may not be asked about the general public looking at such data. The discussions and challenges around standards of accountability, transparency, and importantly legitimacy for the police using open social media,[22] has parallels with those of researchers.

8) The DPA provides exemptions from certain DP rules for research purposes, although ‘research’ is not well defined.[23] The UK ICO Anonymisation Code of Practice  clarifies to an extent, stating research includes [24]statistical or historical research, but other forms of research, for example market, social, commercial or opinion research”.[25] Importantly, research should not support measures or decisions about specific individuals, and be used in a way that causes, or is likely to cause, the data subject substantial damage or stress. [26]  The ICO affirms the exemptions are only for specific parts of data protection law, namley, incompatible purposes, retention periods and subject access rights – i.e. “if personal data is obtained for one purpose and then re-used for research, the research is not an incompatible purpose. Furthermore, the research data can be kept indefinitely. The research data is also exempt from the subject access right, provided, in addition, the results are not made available in a form that identifies any individual.”[27] All other provisions of DP law still apply, unless it is “anonymised data for research, then it is not processing personal data and the DPA does not apply to that research”.[28]

That being said, the ICO has also stated, “anonymisation should not be seen merely as a means of reducing a regulatory burden by taking the processing outside the DPA. It is a means of mitigating the risk of inadvertent disclosure or loss of personal data, and so is a tool that assists big data analytics and helps the organisation to carry on its research or develop its products and services.”[29] Scholars, like Ohm, have also argued against the assumption that anonymisation of data is a good policy tool, because the dominant anonymisation techniques are at risk of easy deanonymisation.[30]  Narayanan and Felton have argued similarly from a more technical perspective. [31]  Anonymisation is hard because of risks of linking data between and across databases, ‘singling out’ individuals from data or inferences on attributes from values of other attributes.[32]


[1] ICO Big Data and Data Protection (2014) p6-7 available at

[2] ICO Big Data and Data Protection (2014) p4

[3] ICO Big Data and Data Protection (2014) p5-6

[4] Art 2 Data Protection Directive 1995

[5] XXXX

[6] Article 17, 17a, Art 18

[7] Art 2(a) DPD

[8] Art 8(1) DPD

[9] Article 2(h); Art 7 – unambiguous consent

[10] Opinion 15/2011 p19-20

[11] Opinion 15/2011 p19-20

[12] Opinion 15/ 2011 p11

[13] Opinion 15/ 2011 p11





[18] d boyd ‘It’s Complicated: Social Lives of Networked Teens ’ (2014)

[19] H Nissenbaum ‘Privacy in Context’ (2009)

[20] Von Hannover v Germany; Peck v UK

[21] Rotaru v Romania

[22] See L Edwards and L Urquhart ”Privacy in Public Spaces” (2016) Forthcoming –

[23] Article 33

[24] p44 onwards

[25] ICO Anonymisation Code of Practice (2012) p44 – 48

[26] DPA 1998 s33(1)

[27] ICO Big Data and Data Protection (2014) para 84

[28] ICO Big Data and Data Protection (2014) para 86

[29] ICO Big Data and Data Protection (2014) para 46

[30] P Ohm “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization” (2010) UCLA Law Review p1704

[31] A Narayanan and E Felton “No Silver Bullet: De-identification still doesn’t work” (2014) where they argue “Data privacy is a hard problem. Data custodians face a choice between roughly three alternatives: sticking with the old habit of de-identification and hoping for the best; turning to emerging technologies like differential privacy that involve some trade-offs in utility and convenience; and using legal agreements to limit the flow and use of sensitive data. These solutions aren’t fully satisfactory, either individually or in combination, nor is any one approach the best in all circumstances.”

[32] Article 29 Working Party “Opinion 05/2014 on Anonymisation Techniques” (2014)

Originally posted at