When researchers want to study people, their behaviours, and patterns, they would have to budget most of the project’s timeline to observing, recording and capturing data. Today’s “Smart” devices are denoted as such due to their ability to connect to the internet, in addition to storing and sharing vast amounts of data. This means that mobile handsets, tablets, watches, televisions and other appliances are technically capable of both shortening the amount of time it takes to capture useful data as well as drastically increasing the amount of data available for answering questions about people. This is the age of Big Data.
At NIC.br’s sixth Annual Workshop on Survey Methodology in São Paulo, Brazil, the theme was Big Data. There was no attempt to exhaust the theme in the two allocated but no less than 11 speakers addressed the participants on the nature, applications, challenges and opportunities of Big Data.
The processes of storing, managing and studying Big Data are rather involved and resource intensive. The infrastructure and database requirements are much greater than for ordinary datasets. Large datasets in the human sciences typically do not capture records of more than a few hundred thousand, bar some exceptions, which amounts to a few terabytes of digital information. The enormous amount of electronic information yielded by Big Data takes up petabytes (thousands of terabytes) of disc space, which presents severe infrastructure constraints.
Although the data is easier to store, since its digital nature radically decreases is size, the data is not easily rendered into communicable and meaningful patterns. Complex algorithms need to be designed and run to filter undesirable variables and match the relevant ones to discover meaningful results. In other words, people proficient in coding languages are needed to provide researchers with the tools to experiment with the data once it has been stored and organised.
Applications are then required by anyone who wants to interpret this data and answer specific questions. The relevance of these research questions and the contribution of their answers are determined by the data, their relationships and results. The processes of sourcing and interpreting data are usually where controversy and research problems with Big Data become apparent.
One of the biggest challenges that people have with Big Data seems to be the manner in which the data is collected. The devices that capture the details of users are mostly purchased and owned privately, restricting the notion that information regarding their use belongs to those that sold the product. To get around this invasion of privacy, data collection involves a complicated process of anonymisation, which translates personal user data into arbitrary codes hiding personal identifiers associated with a particular set of data. The problem with this though is that such treatments and procedures are delicately planned and recorded, leaving a paper trail albeit a minuscule one.
One of the workshop presenters nevertheless advocates for the protecting of personal data treated in this way since industry’s three V’s: volume, velocity and variety, need to be supplemented by the three C’s: crumbs, capacities and community. His argument continued to illustrate how the latter includes a body of norms since the data relates to people with defendable rights. While the value of the data must be recognised and protected, this prescriptive perspective makes the case for doing the same with people.
Additional scepticism of Big Data was illustrated by the claim that democracy arose in political states where resources had spread beyond the control of autocrats. The free movements of such value-laden items dispersed power among a greater number of stakeholders who enforced regime evolution. This ought to take place in the field of Big Data so that its limitations and capabilities can be more acutely realised.
The need for reinforced research standards required for working with Big Data was emphasised during the workshop. Some of the more technical problems with such ‘deeper, better, faster, cheaper’ datasets were presented, highlighting the risks in variability, veracity and complexity in this new style of data collection, which together combine to compromise the consistency, accuracy and replicability of studies.
The fact that modern ICT devices are purchased and used as private market items violates the integrity of random selection and commits that all too pervasive research mistake of self-selection bias. What is more, one is forced to work with limited demographic variables (sometimes incorrectly captured) and a lack of stability in the research process, further undermining the integrity of Big Data studies. As a means of countering some of these problems, presenters advocated strongly for conceptual frameworks and the use of the UN’s Framework for Statistical Guidelines as a means to maintaining and increasing high research standards when using Big Data.
Strong conceptual frameworks and strict research standards can produce, and have indeed produced powerful Big Data research. However, most of its value can be extracted by supplementing it with more traditional data sources such as surveys. Research ICT Africa (RIA) collect a wealth of such data, usually at the household and individual levels of ICT access and use, but is not limited to these. Our demand-side studies are very important for evidence-based policy-making and can be complemented by the information captured through Big Data. Such studies are built upon random sampling and representative datasets, which construct accurate relationships and contexts in the ICT sector, attributes desperately needed for quality Big Data research. RIA’s data is hosted by DataFirst and can be accessed for use here.
The future of research will be determined by how we treat Big Data now – it is critically useful and in need of traditional quality standards.
Find full workshop report here