Cover: Digital Social Research by Giuseppe A. Veltri

Dedication

To my family and friends

Digital Social Research

Giuseppe A. Veltri











polity

Abbreviations

ALAAM
auto-logistic actor attribute models
API
application program interface
CAQDAS
computer-assisted qualitative data analysis
CART
classification and regression trees
CFA
confirmatory factor analysis
CHAID
chi-square automatic interaction detection
CSS
computational social science
CTM
correlated topic model
DGP
data-generating process
DMI
Digital Methods Initiative
EDI
electronic data interchange
ERGM
exponential random graph models
FA
factor analysis
GDPR
General Data Protection Regulation
http
hypertext transfer protocol
ICT
information-communication infrastructure
IR
information retrieval
JSON
JavaScript Object Notation
LCA
latent class analysis
LDA
latent Dirichlet allocation
LSA
latent semantic analysis
MTMM
multitrait–multimethod matrix
NER
named entity recognition
NLP
natural language processing
NoSQL
not only SQL
OLFG
online focus group
OLS
ordinary least square
PCA
principal component analysis
RCT
randomized controlled trial
RDBMS
relational database management system
REST
representational state transfer
RSS
rich site summary
SEM
structural equation modelling
SNA
social network analysis
SOAP
simple object access protocol
SQL
structured query language
SRS
simple random sample
SVD
singular value decomposition
URL
uniform resource locator
UTOS
units, treatments, observation operations, settings
VAU
Voson Activity Units
VGI
volunteer geographic information
WWW
World Wide Web

Introduction

In July 2014, a group of researchers published a paper titled ‘Experimental evidence of massive-scale emotional contagion through social networks’ (Kramer et al., 2014), a study conducted on the social media platform Facebook involving hundreds of thousands of its users as participants. This study went under the spotlight not so much because of its scientific findings but because it exemplified how social scientific research has been changing, in terms both of opportunities and of associated risks. Hundreds of thousands of participants were manipulated, in experimental terms, without knowing that they were part of a study that was being conducted on a scale unprecedented in social science research. Since then, terms like ‘big data’, digital research and web social science have inundated conferences and paper abstracts. Social scientists have reacted fundamentally in one of two ways: wild enthusiasm or stern scepticism. For the enthusiasts, the availability of digital data collected by multiple means of recording our digital traces represented the long-awaited turn in an increasingly difficult reality of data collection. For the sceptic, while appreciating the potential, digital research raised questions about the quality of such data; it posed questions about issues of data access and ownership, and started a debate about integrating online and offline data. Of course, there are plenty of researchers who fall in between these two broad categories, and that is where this book ideally situates itself. Its approach is to provide a critical overview of the common methods used to carry out digital research, and, at the same time, to have a particular sensitivity to methodological principles and theoretical issues that are not easily dismissed by the new availability of data about society and people.

The enthusiasm exists, perhaps, because of a personal academic history. During my PhD years, I experienced at first hand the move from analogical to digital data, the increasing presence of software-assisted data management and analysis. Often, at that time, learning to use the latest software would provide a sense of thrill and new possibilities of research would open up in front of you. However, soon enough, after this feeling of great potential, old problems and questions of research methodology would return, throwing cold water on your enthusiasm.

A similar cycle of enthusiasm and doubt is part and parcel of being a digital researcher these days. There is little doubt that digital data are changing the way social science is done. One example of this is the perception of value that data now have. During my training as a researcher, I was taught that data are those precious things that are very hard to obtain and therefore, once in our grasp, should be exploited to the fullest. Today, most social scientists collecting digital data have datasets sleeping in their hard disks or in the Cloud. Because of the scarcity of data, each data collection was often maniacally crafted in terms of the instruments used – for example, a questionnaire – with already a pretty clear idea of how the analysis would be conducted. As data have become much cheaper, the amount of planning and analytical strategy appears to be decreasing. The latter point is also due to the increased obsolescence of data. In a context of fast, continuous and affordable data collection, data become ‘old’ very quickly and yet the use of archives is problematic because of access issues. As the entire research process speeds up, data are collected, analysed and archived very quickly in order to be able to move on to the next project.

At the same time, digital social research is fast-moving, and for several reasons. The first is that the division between online and offline research is increasingly fading. The usual distinction between research ‘about the Internet’ and ‘through the ‘Internet’, where the first term refers to human behaviour and social phenomena specific to the online world, while the second label refers to using the Internet as a field of research to conduct a study that could also be conducted offline, does not hold up well these days. Digital data defy formal definition, but for practical reasons; we can describe them as the digital traces of human behaviour and opinions recorded by a wide set of digital services operating in different domains of society (e.g., financial, transportation, health, commercial, social). The nature of digital data is continuously expanding as digital services emerge and many different objects acquire the capacity of recording information about their use and the environment in which they operate, the so-called ‘Internet of Things’.

The second reason to consider digital data, especially those from social media, as fast-moving is the consideration that their use by people is evolving over time. For example, Facebook has become a main social arena for many people; their initial naive use of the platform that probably existed when it started is long gone. People are now strategic in their use of their Facebook presence. One example of this is the positivity bias that Facebook content has (Spottswood and Hancock, 2016). Almost like a large-scale Hawthorne effect, in which individuals modify an aspect of their behaviour in response to their awareness of being observed, people know that their digital presence is observable and therefore adapt to this visibility.

The third reason is that social digital data are becoming increasingly complex. To illustrate this point, let’s take an example based on one of the fundamental research instruments in social research: the questionnaire. In the pre-digital age, a questionnaire would collect data designed by its makers in terms of answers to questions as well as some contextual data provided by interviewers if a door-to-door collection was part of the design. As surveys started to be conducted by telephone, different types of data became available to researchers – for example, the duration of the task of completing the survey (speed of response is used as a proxy of quality, as we will discuss in Chapter 3). Online surveys have enlarged the type of data that are collected in a questionnaire. Together with the outcome of questions, a plethora of metadata and paradata can be collected and analysed jointly with the ‘main’ data. Metadata (data about data) and paradata (data about process) are empirical measurements about the process of creating survey data, in other words recordings about the fieldwork process. They include time spent per screen, keystrokes and mouse clicks, change of answers, etc.

The latter is just one example of how a relatively uncomplicated type of data familiar to social scientists is now a potentially complex, multidimensional object of analysis. Needless to say, data collected from online sources are often of this type, with a degree of complexity sometimes much higher than previously encountered. This is a somewhat new situation for social scientists – too much choice can be overwhelming and confusing. The ‘digital challenge’ will be a crucial one because the ‘thirst’ for methodological innovation in social sciences is due to the enduring crisis that has characterized most of the widely used existing techniques. Surveys are exemplary in this case, a pillar methodology across so many different disciplines that is suffering a long-lasting crisis due to the increased difficulty in assessing response rates and sampling frames, and limited capacity in capturing variables that are not self-reported measures but important proxies. Similar considerations concern the in-depth interview, another important instrument of data collection in social science. One criticism concerns the translation of a technique developed before the advent of digital media and the question related to the implications of interviews carried via computer-mediated communication.

Increasingly, self-reported surveys and interviews measuring human motivations and behaviour are under scrutiny and being compared to more ‘organic’ sources of data (Curti et al., 2015). This is not to say that digital data do not raise a substantial amount of concern regarding the tendency to consider these as ‘organic’: the current debate relates to the kind of critical awareness that should accompany all methods used by social scientists. Perhaps for historical reasons, the artificial nature of traditional methods has been long forgotten until recently, when their capacity for generating quality data has become increasingly problematic.

Such limitations are even more clear if we consider two further aspects: first, the vast majority of social science data from surveys and interviews are cross-sectional without a longitudinal temporal dimension (Abbott, 2001); second, most social science datasets are coarse aggregations of variables because of the limitations in what can be asked from self-reported instruments. Digital data are forcing innovation on both accounts, moving from static snapshots to dynamic processes and from coarse aggregations to high resolutions of data. The interesting by-product of these innovations is the possibility of an increased focus in the social sciences on processes rather than structures. For the first time, we can obtain longitudinal baseline norms, variance, patterns and cyclical behaviour. This requires thinking beyond the simple causality of the traditional scientific method into extended systemic models of correlation, association and episode triggering. Network analysis is a good example here: the availability of longitudinal relational data sparked the recent methodological and theoretical innovations about the dynamics of networks (Barabási and Posfai, 2016).

The aim of this book is to provide an overview and understanding of the most used digital research methods of data collection and the associated analytical strategies, paying particular attention to the methodological theoretical issues that still need further reflection and discussion. This work is the outcome of my research experience as well as of teaching done over five years at the University of Leicester on the MA course in new media and society, at the methodology summer school of the London School of Economics and at my current institution, the University of Trento, with the addition of several courses taught across Europe and Asia. The backbone of this book constitutes material developed for two courses: ‘Research methods for the online world’ and its complementary and more applied module ‘New media, online persuasion and behaviour’.

Equally important has been the experience of several large-scale behavioural research projects conducted for several European institutions on topics such as online transparency of platforms (Lupiáñez-Villanueva et al., 2018), online advertising (Lupiáñez-Villanueva et al., 2016) and online gambling (Codagnone et al., 2014). These have all been invaluable in learning how to study digital phenomena within the context of providing evidence for the development of public policies.

There is a vast number of books dedicated to mastering specific research techniques and the aim here is not to emulate such texts. Instead, the aim is to contextualize each technique in the study of human behaviour and societies. The use of the term social sciences is meant to describe the reality that there are different disciplines dealing with human affairs. Sociology, political science, anthropology, psychology and economics all have their own epistemological positions and methodological preferences. Digital social research does not escape the same condition: it is conducted with different aims, theories and methods depending on the academic discipline of context. Most of the content of this book should be applicable to all social science disciplines, but it is inevitable that some discipline-related emphasis is present. Therefore, it is better to make it explicit that most of the author digital research has been carried out in the context of social psychological, behavioural and sociological studies. While there is increasing collaboration across the social sciences, those familiar with different approaches and from different disciplines will recognize an implicit set of assumptions and research goals, and will not, hopefully, be put off.

The first chapter is dedicated to the nature of digital data from the perspective of a social scientist. This is because the complexity of digital data is large and requires particular attention in their use for social research. The second chapter is dedicated to one category of data collection methods for the digital world. This category is labelled ‘unobtrusive’, meaning that these methods do not require the active participation of individuals: data are collected without directly engaging with people, who are not aware, unless they are notified, that their data are being used for research. However, these methods pose new challenges to researchers in that they have to take into account the design of these platforms when they draw conclusions about social phenomena. Both the so-called ‘affordance’ of technological infrastructures and their political economy need to be considered (Madsen, 2015; Fuchs, 2015).

The third chapter concerns methods that have found their digital evolution: surveys, focus groups, experiments have found a new life online, albeit with some caveats. These are obtrusive methods; they require active engagement by participants. While they have a consolidated history of practice, their extension to the digital domain also poses challenges and raises opportunities.

Chapter 4 is dedicated to what I believe is a crucial issue: the epistemological and methodological changes and challenges that digital data are bringing to social science research. The emphasis here is on the increasingly common use of analytical methods coming from computer science in the domain of digital social science. This point has been the object of debate among methodologists and it is a crucial obstacle in finding common ground with computer scientists in joint projects.

Chapter 5 presents an overview of network analysis, a longstanding tradition in social science research that has found new life thanks to the availability of digital relational data, data about relationships between actors, and that is largely, but not only, applied to datasets from social media (e.g., emails, telephone calls, text messages, etc.).

Chapter 6 deals with one of the most interesting developments in methodology for the social scientist: text mining. Text has been always part of the data collected by social science research, particularly in the qualitative tradition. Content analysis has been the dominant way of quantifying text characteristics and the most common analytical strategy adopted by researchers who have to manage large quantities of text. The digital age has, among other changes, brought an exponential growth of text produced by people. It is the experience of the everyday use of social media, but also blogs and forums. Never before has so much text spontaneously produced by people been available. At the same time, ever more sophisticated methods of automatic analysis of texts have been developed, allowing researchers to analyse anything from a few hundred to millions of documents (see, for example, Sudhahar et al., 2015). While these types of automatic analysis do not aim at substituting the in-depth understanding that human analysis can provide, they do provide a unique bird’s-eye view of a large set of documents that was simply impossible to have before. A bit like the Nazca Lines, a series of large ancient geoglyphs in the Nazca Desert in southern Peru, and observable in their entirety only from the sky above, so text-mining techniques can allow researchers to detect common patterns or even structures across many different documents.

The last chapter is dedicated to a few general remarks about doing digital social research, among which we will discuss the ethical aspects of this kind of studies. The recent Cambridge Analytica scandal about the misuse, for commercial purposes, of Facebook data has grabbed the attention of millions of citizens across the globe. At the same time, the introduction in the European Union of the new GDPR (General Data Protection Regulation) has changed the rules of the game, including those for social scientists (European Commission, 2018). There is no doubt that this is a complex and important issue that deserves an entire book by itself. I cannot provide a lengthy discussion of the ethical and legal implications of using digital data and will therefore limit discussion to what I believe are the most salient points. I will also mention the issue of access in terms of inequality of research opportunities across the research community. It is no mystery that many large digital platforms, including the largest social media, are owned by American companies. The consequence has been that the North American academic institutions had historically stronger relationships with these private entities compared to European and Asian universities.

After this rather long introduction, we move next to discuss the nature of digital data for the purpose of social research. Exciting opportunities are emerging, while, at the same time, old and new methodological challenges are not easily settled. These challenges are the future of social science research: if we do not include our digital social life in our research practices, our capacity to understand human societies is greatly diminished. And yet, the critical eye that social scientists have learned to exercise needs to be sharp in a research domain in which digital data have become the most valuable asset for very large sectors of the economy and are also more and more crucial for the political evolution of democracies and non-democracies alike.