Methods Big Data Research

Big data has become a paradigm for a supposed new type of data and a form of knowledge discovery.
1 The hyperbole, if not necessarily the actuality, that surrounds big data now has an impact on nearly all aspects of our social life and increasingly has altered the way researchers and practitioners work with digital information. the language of big data dichotomizes our work as new versus old and big versus small. It affects the primary subjects of our inquiry, furthering the roles of people as sensors, communication as content, and living as behavior. he impacts are numerous, for example, with lucrative markets for data brokerage that supply the demand for a “data-now” business environment.
2 Governments, from the municipal to national levels, are transitioning from old to new ways of administering public services and resources via, for example, data analytics.
3 Skepticism accompanies public and private sector initiatives as big data passes “the top of the Hype Cycle, and [is] moving toward the Trough of Disillusionment.”
4 Critiques within academia attempt to clarify the bewildering claims of big data by exposing its varied theoretical pitfalls and practical shortcomings to produce meaningful or actionable results.
5 Recent literature does little to dent the exaggerated hopes for big data, even as these hopes persist within the academy. As Wilson notes, social scientists are often quick to mistake big data as being a window into the daily life of its creators, claiming that “for many [of the general public], tweeting, posting, retweeting, and sharing is akin to breathing.”
6 We were drawn to big data for a pragmatic reason. An author of this blog submitted a journal article focusing on Web 2.0 mapping platforms applied to local community development. According to a well-meaning reviewer, the number of contributions (n < 100) was deemed too small and too slowly contributed to offer the rigor expected in making informed statements about community needs or desires.
We argue that this represents an increasing tendency to frame geospatial technologies and diverse approaches for conducting social science research such that they conform to approaches related to big data. Social processes are thus articulated in terms of volume and take place on cloud-based platforms. his type of data determinism, in which technology dictates the manner in which we understand society, can lead to a new form of publication bias. Properties of the specific technology overrule the value of very small and slow data and revive periodic arguments within geography on quantitative over qualitative methods.
And we are not completely disheartened by the increasing role digital data has in social science inquiry, a phenomenon that can be studied and, contra, also can evince a phenomenon.
8 Our concern is that, by casting all data research in relation to big data, data scientists exacerbate internal issues within disciplines like geography and assist the momentum of a data-hungry social science. Framing data as needing to be big (e.g., large volume and high velocity) engenders expectations of validity and truth and creates a normativity in the social science of how research “ought to be” conducted. We address these issues by first discussing the multiple origins of the concept of big data, the emergence of small data in response to the failings of big data, and the plight of very small and slow data as it is compelled to complement the epistemology of big data.
Big Data, from Industry Purpose to Industrious Purpose
According to Hidalgo, little “hype has come from the actual people working with large datasets. Instead, it has come from people who see ‘big data’ as a buzzword and a marketing opportunity—consultants, event organizers, and opportunistic academics looking for their 15 minutes of fame.”
9It is important at the outset to state that there is no unifying definition of big data. Instead “big data” as a term has diverse origins and serves multiple needs stemming from academia, industry, and media. Mashey was perhaps the first to explain the concept before the term became common.
10 He talked of “infrastress”: vast amounts of data were placing excessive demands on existing computing infrastructures and required new approaches for data storage, database structures, instant access, and analytic services. To Mashey, big data offered no newly opened window into the social or customer world.
11 Rather big data presented a technical challenge to hardware and software engineers. Weiss and Indurkhya used the term in the name of a blog in their book Predictive Data Mining: A Practical Guide.
12 The authors, with backgrounds in artificial intelligence, viewed big data handling as an opportunity that was not without inherent challenges: “Very large collections of data . . . are now being compiled into centralized data warehouses, allowing analysts to make use of powerful methods to examine data more comprehensively. In theory, ‘big data’ can lead to much stronger conclusions for data-mining applications, but in practice, many difficulties arise.”
13 What began as a data management issue soon turned into a novel way of thinking about data. Tony Hey and fellow researchers at Microsoft went so far as to proclaim the emergence of a “fourth paradigm” of science, in which big data called for new methods to address the “current scientific data deluge.”
14 Elwood, Goodchild, and Sui explicated the three prior dominant paradigms of twenty-first-century science: “the empirical (by describing natural phenomena), the theoretical (by using and testing models and general laws), and the computational (by simulating complex phenomena using fictional/artificial or small real-world data sets) approach.”
15 That data deluge, according to Chris Anderson of Wired magazine, heralded the end of theory.
16 Current theory-driven models were inadequate and inappropriate in the face of the opportunities afforded by volume. We no longer needed to rely on hypotheses to deduce relationships in complex systems.
17 Instead, society required a data-driven science to explore and discover relationships where none were previously known to exist. According to Jim Gray, another Microsoft employee, data-driven science was akin to a “microscope”— a situation in which research problems were investigated as though one were peering at millions of interactions through a microscope.
18 An inductive, data-intensive approach to science would “serve as the new ark upon which we can survive the current big-data deluge.”
19 The argument was that big data fundamentally differed from a very large data set and that this diference demanded new approaches that broke from traditional norms of science. In the data, deluge precision has been offered as a replacement for accuracy, which cannot be evaluated because there are too many data. Big data sources such as Wikipedia utilize crowdsourcing to refine data.
20 The most common articulation of the diference between very large data sets and big data came from Laney, who reflected on his first-hand experiences with the challenges of big data storage and management. Laney characterized this new type of massive data set through what became known as the essential three Vs. He argued that data sets were far more voluminous than before; there were new platforms on which data continuously streamed or were available at much higher velocities.
21 Data also were now accessible and analyzable as individual records, as opposed to entire datasets. Many of these data now manifest in highly unstructured and messy forms. Volume, variety, and velocity posed challenges in data integration and system interoperability. Today diverse accounts of big data are occasionally accompanied by a fourth, ith, and even sixth V as necessary supports for the utility of specific big data applications (e.g., veracity, viability, variability, value). It is with the challenges (and opportunities) of these Vs that big data features began to substitute for a singular definition, which we illustrate with an equation:
big data = f(volume, variety, velocity, and maybe veracity, viability, variability, value . . .)
Big data has solidified around these first three features of volume, variety, and velocity. Difficulties in characterizing the phenomenon of big data nonetheless remain. A recent informal survey by the School of Information at the University of California, Berkeley, asked “more than 40 thought leaders in publishing, fashion, food, automobiles, medicine, marketing, and every industry in between how exactly they would define the phrase big data.”
22 Unsurprisingly their results revealed nearly forty unique definitions of big data that largely revolved around a particular contributor’s application needs. What becomes evident after just a few entries is that big data has become a colloquial phrase that valorizes the potential of realizing granular insights relevant to specific goals rather than the mythical access to a generalizable data set that can be leveraged for numerous unanticipated uses.
According to Floridi, many of the definitions of big data rest on ambiguity and circular reasoning: data is big only in relation to our current computational power.
23 As M. Graham and Shelton put it, the “modifier ‘big’ [is] always relative and represent a moving target.”
24 Contemporary “small data” were extraordinarily large a half-century ago, and contemporary notions of big data will likely be tiny just a half-century into the future. Baty argues that conceptual ambiguity has led to a joint focus on the creation and use of big data instead of concentrating merely on volume.
25 Just being big (e.g., larger than an Excel table) does not render data valuable. Big data, therefore, remains “an abstract concept” that is only set “apart from masses of data, [by] other features, which determine the diference between itself and ‘massive data’ or ‘very big data.’”
26 Additionally, big data becomes constitutively inextricable from the capabilities of available software and hardware of the day.
27 Thus if it exceeds the capacity of a spreadsheet, then perhaps it is still big. the irony is that big data is not entirely the product of machines. Big data can trace its origins to the period from the 1880s to the 1940s at the Harvard College Observatory, where one- half million observations of the night sky which were amassed entirely by humans.
28 Despite ambiguities or perhaps because of them, big data quickly transitioned from a term used to describe data collection and management to a marketing slogan that promised to enhance business practices and target customers.
29 The slogans shited from “Big Data as Boogeyman,” signaling that early misgivings over costs trumped the potential value of big data, to “the Big Data Gold Rush,” in which big data was credited with creating a data market worth $125 billion.
The emergence of big data as a commercial industry clarified big data as a commodity that could be leveraged for business intelligence and as a subject for data science that could lend is a competitive advantage. Big data also foregrounds the role of information technology (it) in businesses, from a function the firm relegated to it department (e.g., payroll, inventory, and projections) to a core function in which an agile firm respond quickly to constantly changing data.
31 Agility requires new it investments to manage data lows through, for example, new analytics, visualizations, and user interfaces. In this way data as marketing, slogan returns to the original realm of big data in computer science.
Early marketing campaigns were paired with phrases such as “data overload” or “info boat,” which portrayed big data science as the solution to excessive and unwieldy content.
32 Outside of a few anecdotal cases, big data continues to fail in delivering on the insights and value. A survey of more than three hundred it departments found many big data-oriented projects never let the planning stages, with the proofs-of-concept and prototypes failing to reap value for their firms.
33 The same survey found a lack of empirical knowledge has already resulted in costly mistakes. Survey results mirrored an increasing disillusionment in the business community. News sites like ZDNet and Forbes suggested that big data was “oversold” relative to results: “Big data is hard (and the domain of the few). Doing it at scale and waiting for the trickle-down benefits can take time.”
34 This has contributed to the ambiguity in big data definitions. Growing amounts of digital data are viewed as a panacea for industries and scientific endeavors while being accompanied by delated expectations. To achieve the value promised by big data, it appeared we needed to bring in more data under its umbrella.
From Big Data to Small Data
Despite the promises of analytics or data visualization Eureka moments, big data was failing to extract value. It was too big, too fast, and too heterogeneous; it was incomprehensible and impersonal. “While companies (and computers!) like big data, most people only need small data,” asserted Fidelman, because, compared to big data, “it is easier to analyze and test small data sets to differentiate signal from noise to extract meaning.”
35 Segments of big data were increasingly seen as a cure for big data’s lack of utility: “one good strategy to solve the ‘curse of big data’ . . . is the intentional and purposeful breakdown of large data sets into smaller data sets.”
36 A definition soon emerged from the private sector to formalize the concept of small data, which “connects people with timely, meaningful insights (derived from big data and/or ‘local’ sources), organized and packaged—often visually—to be accessible, understandable, and actionable for everyday tasks.”
37 Small data echoed similar ambiguity found in big data, but small data was viewed as a way to deliver on the promises of big data without inducing extraordinary effort to extract value.
We argue that the concept of small data emerged for two reasons. First, small data offers a way to derive value from data sets using the same data science and analytics designed to reveal value in big data. Second, small data asserts the primacy of big data in framing all data. We would have no small data without big data, because “prior to 2008, data were rarely considered in terms of being ‘small’ or ‘big.’ All data were, in effect, what is now sometimes referred to as ‘small data’ regardless of their volume.”
38 We turn to recent literature for several perspectives of the emergence of small data vis-à-vis big data. the emergence of small data helps us construct an epistemology of big data but at some cost to the integrity of small data:
small data = big data-some data
The first perspective is that small data is merely a digestible chunk of big data. Timely and meaningful insights derive from the deliberative extraction of subsets. On this smaller data set is extracted because it responds to a particular organization or need.
39 the initial process conducted on any big data is to reduce the data set in some meaningful way.
40 The process represents both a reduction and a recognition that “data in the wild” is never raw.
41 The reduction of big data minimizes the cost of data handling and presumably then maximizes the insights from otherwise bloated data sets. the hat is, the utility is achieved by sampling and removing redundant, erroneous, and irrelevant data. In that way, we produce working data sets. According to Lu and Li, data scientists rarely conduct analyses on big data; they ultimately utilize small data: “Most of the time, the direct access to the entire data is neither possible nor computationally feasible, forcing people to probe the properties of the data by looking at a sample. Because of the huge size of the data, quite often even a sufficient sample is too costly to obtain considering the network traffic involved and daily quota imposed. For practical consideration, we are often limited to the smallest possible sample.”
42 Jacobs provides an example of the computational challenges of handling big data.
43 the researcher generated a synthetic database consisting of 6.75 billion 16-byte records that were intended to emulate a census-like record (e.g., age, religion, income, and address) for each person on the planet. the value of such a data source would be undeniably useful to geography researchers and others, and it was easy to store the records for the world’s entire population on a single consumer-grade laptop in 2009. Jacobs argued that data storage does not present the limiting factor; analysis space is the challenge.
44 To derive insights (information) from massive number-crunching analyses, particularly when those data have temporal and spatial dimensions, requires the data scientist to respect the “aggregat[ion of] data in an order-dependent manner (for example, cumulative and moving-window functions, lead and lag operators, among others).”
45 The random access of most big data analytics destroys the temporal and spatial contexts of the data. Small data can maintain topology where big data could not.
Jacobs’s example illustrates a recurring contradiction to amassing large data sets. We acquire the data even as we fail to amass the concomitant technological resources to handle such troublesome “bigness,” and we may not acknowledge the uneven access to such big resources. Floridi clarifies the paradoxical challenge in which value from big data merits “more and better techniques and technologies, which will ‘shrink’ big data back to a manageable size.”
Thus this first formalization of small data emphasizes the distillation of data from larger counterparts to avoid existing computational limitations and analytic overload. Bigness realizes value only when it becomes small. Even as our storage capacities grow, we will still likely need to “chunk” the data so we can analyze it. Here small data becomes a datum of big data, which is groomed by machines to the needs of an individual actionable effort:
small data <= human brain
A reason for the failure of many big data projects can be attributed to a decision paralysis in the presence of all the possible tools, data sources, and potential applications available to big data.
47 The tools are essential since incomprehensibility is considered an intrinsic characteristic of big data. A second perspective on small data refers to its capacity to improve on understanding big data.
48 Here small data is cast as data that is small enough in size for human comprehension.
49 A working paper by Markowsky uses this humancentric definition of small data to justify human intervention in the subset of big data so that it “can be easily grasped by the human mind and easily visualized by the human eye.”
50 Small data in this perspective is similar to the above description in the attempt to render data into familiar and manageable small data models. Instead of relying purely on a technological solution to derive big data insights (i.e., through computational analytics), this perspective embraces a traditional approach to interpreting data. the human brain becomes the analytical computer rather than depending solely on algorithms or statistical correlations crunching otherwise incomprehensible data sets.
51 The purpose of this characterization of small data is to aid people in using big data, so they can derive information and establish their own insights. his process proves difficult to not only replicate and intelligently subset.
52 It is also difficult to share, as we discuss below in the section “When Small Data Isn’t the Answer, Regardless of Size.
small data = big data me
https://qph.fs.quoracdn.net/main-qimg-33c42eccb1ff8427dadaa9634d19d33f
A different narrowing of the digital deluge has small data identified as that which is only about yourself. We find this perspective in the realm of the quantified-self movement, which comes from the rise of wearable and mobile technologies.
Devices such as Fitbit create relatively large volumes and velocities of content about individuals. hese are the digital traces generated by nearly all aspects of technology we use, which can be in turn analyzed to derive insights about our own individual behav-iors. Estrin and the Small Data Lab at Cornell University consider this type of data to be small data, which “we can think of . . . as [a] new kind of medical evidence, evidence where n = me because it complements traditional big-N population studies with data that are just about me (or you) over time.”
54 Compared to big data, big data me is intended to be neither anonymous nor aggregated. It is intended to be comprehensible because it concerns a specific individual and mirrors the attempts to value big data by discretizing the data into digestible chunks. Small data thus acquires a personalized characteristic absent the collectivity of big data. his small data represents a new, highly personal source of valorizable informa-tion, where data reaches deep into the body, for example with embedded WiFi-connected medical devices like pacemakers. Big data I reveal an underlying moral quandary for small data. Individuals may generate the data, but this type of small data is largely out of reach for most individuals to obtain or effectively use and is further obfuscated by (lack of) rights to data ownership and privacy considerations.
In the same way, small data represented a break from big data in terms of distillation and comprehension, this perspective highlights a distinction in who the user is. In big data the end user equals analyst. In small data, the end user can be the source of data (or collector, as we will see with national censuses) but not necessarily the analyst.
It is likely the small data of the quantified self becomes aggregated across individuals for an analyst. the analytics and visualizations are built from the aggregations, which are then customized to an individual’s data stream. Neither the devices nor the software would be developed was it for a single individual. as we begin to see the inextricable interplay between big and small:
small data = big data/domain
The fourth perspective on small data refers to big data shrunk by specific domains like geography. It is often related to data about the self: “the data on my household energy use, the times of local buses, government spending—these are all small data.”
56 Importantly it also depicts a domain, for example, of energy, transportation, and public administration. Many of these feeds contain explicit (e.g., bus locations) or implicit (e.g., government spending, which is jurisdictionally bound) geolocations. At the 2012 meeting of the American Association of Geographers, there was a special session entitled “Whither Small Data? the Limits of Big Data and the Value of ‘Small Data’ Studies,” which led to a special edition of the GeoJournal.
57 At this session, Goodchild and Kitchin characterized geographic data under the category of small data. heir primary example is the national census because its volume resembles a common characteristic of big data and because of the central role, a census plays in many geographic inquiries.
58 Goodchild reinforces the perspective that small data is domain based as opposed to volume based. He argues that, in just the space of two years, small data has evolved from acting as proxy for big data to being a general term that situates the “traditional geographic approach” within the practice of small data studies: “Big data is distinguished from what I propose to term small data by its lack of the normal processes of quality control, documentation, and rigorous sampling. . . .
Small data, exemplified by the products of the census, has supplied all of those things, with the result that analysis of small data readily leads to generalization.”
59 Here big data is either reduced to a specific domain or big data becomes more comprehensible when it is rendered amenable to specific domain methods. In a later piece, H. Miller and Goodchild assert the value that geography brings to big data.
60 They argue that geographers possess lengthy experience with data volume (e.g., with Landsat remotely sensed imagery), as well as data velocity and variety (e.g., with volunteered geographic information [vgi] of multimedia geolocated content from social media platforms).
Traditional methods were developed throughout the quantitative revolution in the ield, survived the cultural geography backlash, and flourished in the GIScience backlash to the backlash. hese ren-dered the discipline as being arguably better prepared than some others to engage with big data and fuse that engagement with smaller social science research.
61 Traditional methods could be applied to newer geographic data, such as vgi, which resembles big data in its “messiness,” is “unstructured, collected with no quality control, and frequently accompanied by no documentation or metadata.”
62 This perspective on small data opens up traditional geographic data sources, such as a national census, to new analytic techniques and new data, like vgi, to traditional geographic methods.
the implication is that disciplines can assert their relevance by trans-forming big data—“taming” data—into more meaningful data. In turn, a specific knowledge domain gains relevance by its association with a new source of valorization. By positioning a discipline’s data in relation to big data, a discipline is shown to be equipped to tackle a new data source and to be sufficiently important to be heeded by other disciplines.
the value of this positioning vis-à-vis big data within geography is energized by claims for its powerful and unique ability to peer into layered and complex social systems. hese claims are advertised by statements like this: “imagine, for example, the human geography and broader social science research that could be undertaken with the data set put together by President Obama’s team for his 2008 and 2012 election campaigns.”
Hyman points out that media speculation artificially elevated the electoral data–crunching techniques utilized, which was accomplished with relatively small data capable of being analyzed with paper and pencil.
This mirrors the hype in the private sector about the promise of knowledge discovery that can combine domain expertise with big data and data-driven science. We desire to see the potential for big data and its analytics even when it may not exist.
Goodchild has posited small data as data with quality control, documentation, and rigor.
This mirrors Kitchin and Lauriault, who offer a formalized definition for small data in which “small data are . . . characterized by their generally limited volume, non-continuous collection, narrow variety, and are usually generated to answer specific questions.”
There is a general lack of distinction in this usage of small data—between data and information—or what role data have in the various techniques of collection, analysis, and use within current geographic research. Despite this omission, the authors illuminate a critical diference between big and small data: most data prior to the vaunted bigness were targeted and organized with intent.
Small data is goal-driven data, created with its specific goals and objectives. We will argue later that this intent, one of several distinguishing features of small data, can evaporate and thus damage the defense of small data. More signiicant for us, this definition cements the intrinsic ties between small data and big data, as the former is denied with the modifiers of the later and, as acknowledged by the authors, is susceptible to big data’s science and practices:
First, despite the rapid growth of big data and associated analytics, small data will continue to flourish because they have a proven track record of answering specific questions. Second, the data from these studies will more and more be pooled, linked, and scaled through new data infrastructures, with an associated drive to try to harmonize small data with respect to data standards, formats, metadata, and documentation, in order to increase their value through combination and sharing.
Third, scaling small data exposes them to the new epistemologies of data science and to incorporation within new multi-billion data markets being developed by data brokers, thus potentially enrolling them in pernicious practices such as dataveillance, social sorting, control creep, and anticipatory governance, for which they were never intended.
The prior definition highlights one last hoped-for perspective on small data—that small data is not related to big data but still serves as an input parameter of big data analytics:
small data ≠ big data, but value = big data analytics(small data)
Our central conclusion is that these varied small data perspectives, rather than offering different lenses on data, instead reassert the discourse on big data. Small data is more comprehensible, possesses more rigor, and so on, especially when those data are about us. However, that small data is also positioned vis- à-vis big data to presumably reap all of the big data’s advantages. As soon as we attempt to define big and small data we expose a circular problem: big data finds value only when made small, but small data, according to some, achieve value only when it is reassembled into something resembling big data.
We basically roll around and around between big and small data and consequently gain no greater clarity on either type. Small data perspectives also call attention to long relationships between a domain like a geography and Internet-related technologies.
Small data as georeferenced data represents a separation from and an insertion into a big data epistemology in which a census may embed purpose to the data set but the multiple Vs begin to mater for all kinds of data. In these inclusions of nominally big data into small data, the proponents of the definition also recast geography by separating previous geographic data and practices from the tenets of future data. All data, particularly the eminently mashable geographic data, become part of big data.
When Small Data Isn’t the Answer, Regardless of Size
Increased power and control, disruption, and new insights into company concepts of big and small data. Haklay, Singleton, and Parker identify how neologisms, especially those associated with the Internet, are common in many research fields that attempt to invoke legitimacy alongside dominant research agendas.
68 For us, neologisms serve as a shorthand for epistemology, a way of achieving truth that, for big data, lies in its capacity to be valued (e.g., monetized). By definition, most neologisms are benign or go unnoticed. Occasionally the mainstreaming of a neologism can offer less than productive framings. We argue that the neologism of big data can fail because it is ambiguous, often deliberately so. In large part, the ambiguity derives from a lack of context and intent, which presumably is remedied by small data.
Small data can likewise fail to retain intent and neglect the diversity in perspectives among researchers in both their theoretical and methodological understandings of data. In a special GeoJournal issue on big data and geography Burns and hatcher editorialize on the consequence of a neologism that provides less than productive framing:
“In organizing this issue, it became clear that even amongst a small set of authors working from a single set of prompts, big data, its influence upon society, and its meaning in day-to-day life will differ radically depending on the research contested as important, distinctive, or superfluous. What one author clearly demonstrates as a fundamental concern to epistemology stemming directly from big data analysis, another accepts as a prerequisite for consideration of another fundamental focus.”
Both Big and Small Data Experience Information Loss
Volume is the most important part of the neologism of big data. If size matters in the neologism, then the adage “more is better” captures the homage small data must pay to big data. Kitchin explains that big data lays claim to an exhaustive observation space where entire populations are captured compared to the planned sampling strategies representative of small data.
70 However, capturing entire populations is hardly the case in contemporary big data due to restricted access, unavoidable selection bias, and numerous other factors (e.g., digital divides of potential contributors and differing ontologies of online data sources). his focus on size is perhaps rooted in a conflation of data and information.
Wu is a principal data scientist at a big data firm but remains a skeptic. He explicates the “more is better” fallacy: “While data does give you information, the fallacy of big data is that more data doesn’t mean you will get ‘proportionately’ more information. In fact, the more data you have, the less information you gain as a proportion of the data. that means the information you can extract from any big data asymptotically diminishes as your data volume increases.”
71 Paradoxically the more data one has, the more information one may lose. the signal can become swamped by the noise and the biases. Small data supposedly offers greater contextual comprehension—data geared for the human brain— and therefore could decrease information loss. However, information can be lost in applying the neologism of big data to small data. the previous section mentions how Jacobs detailed the potential information loss to census analyses because the analysis cannot maintain the data’s topology.
72 This holds whether the data is big or small (recall that census data are considered small data by some). the analysis space may still be insufficient to the task. If small data is randomly sampled or “analyticalized” in a fashion similar to big data, then any underlying structure (e.g., the sequence of records) likely will be destroyed.
Big data results in information loss about its life cycle, primarily due to the need for repurposability. However, information loss about the life cycle of data collection can be obscured even with small data. Armstrong and
Armstrong critiqued Statistics Canada’s approach to collecting national census data.
They recommended reexamining data from the lens of those it was meant to represent and explicating data’s relation to the theoretical assumptions made throughout the various stages of each datum’s life cycle. his matches a popular ailment of big data in which the context of a datum’s creation is as important as the datum itself. As Snickers says in his critique of data mining used by companies like YouTube, “if the content is king, then context is its crown.”
Small Data Can Lose Verification as Easily as Big Data
Small data like a census seeks to be exhaustive in terms of capturing social demographics on entire populations at set periods in time. Census data lack the velocity and variety to be considered big data; such data are also constrained in terms of access because availability is restricted to sampled profiles. To protect the privacy of citizens, Statistics Canada limits the reporting of certain geodemographic characteristics to a 20 percent sample and provides the geodemographics in aggregated form unless the agency grants special authorization.
the advantage of this official or authoritative data set does not necessarily lie in its verified account of the population but in that it is a controlled and directed collection. The is consistent with the domain-based category of small data with its own internally consistent rules with regard to quality control, documentation, and rigor. hese rules offer tangible means for understanding possible biases of samples and/or the entire data set, with the potential to compensate for systematic errors.
Elevation of the census as the quintessence of small data implicates this type of data source as a more verifiable or authentic account of social insights than big data. One would be hard-pressed to gain similar levels of geographic or demographic granularity through popular social media services for numerous reasons, whether because of restricted access to proprietary data or inherent degrees of uncertainty induced by unverifiable profile information. Wilson illustrates the unverifiable nature of social media data with a ruse turned viral reporting on the death of actor Morgan Freeman.
75 As Wilson reminds us, the truth is no prerequisite of big data, but mistaken authenticity is possible also in small data.
76 Small or Large, Researchers Become Big Data Scholars
Independence of specific domains implies a transformation in scholarship in those domains. Small data research could turn scholars into junior big data scholars. As small data is upscaled, Kitchin sees new opportunities for data science and increased the availability of research funding.
This mirrors the emergence of science mentioned earlier. science, emerging from an academic backlash over cultural implications, as well as from a tool-versus-science debate about geographic information systems, centered on whether positioning gis as a science conferred greater legitimacy to the research.
78 gis as a form of tool use can be seen as inferior to a GIScience. Tool-using represents the domain of practitioners, whereas a science label could lead to greater standing in the academy, with the promise of more highly rated publications, larger grants, and more tenure lines. Transforming big data into a science and then positioning small data within big data could presumably achieve benefits similar to giscience. We already see the positioning regarding tenure lines with advertisements for academic positions in geospatial data science. If we can aggregate small data sets, for examination with data mining, then small data sets achieve a renewed and rebranded value in the academy.
According to Kitchin and Lauriault, “the data from these studies will more and more be pooled, linked, and scaled through new data infrastructures, with an associated drive to try to harmonize small data with respect to data standards, formats, metadata, and documentation.”
79 Exhortations to harmonize data do not automatically result in harmonization, in large part because these digital forms of standardized aggregation can conflict with institutional cultures. Culturally the ideals of the ivory tower may follow the democratic virtues touted by supporters of data sharing. hat same culture can punish the pooling of data. headage to publish or perish remains deeply embedded in the research culture.
In an increasingly neoliberal university, which injects market values like competition into academe, sharing data by enabling its pooling can mean a researcher loses one additional opportunity for career advancement and job security. Indeed structuration itself can form part of research discovery: “Scientists now have too much choice when it comes to data formats. In fact, it’s quite common for researchers to invent formats for each new technique and sometimes each experiment. he makes the work of integrating large data sets significantly more difficult.”
80 Trevor Garret is lead researcher on the Dutch national project to create an international data-sharing infrastructure.
81 He argues that effective scaling through data structures resembles a kind of magical thinking. Infrastructures may be desired but fail to even approach their asserted objectives. Canada’s auditor general disclosed information about a taxpayer-funded $15.7 million project to build a “trusted digital repository for records, but due to a change in the approach it was never used.”
The goal was to collect government data back to 1890, yet the host of the repository, Library and Archives Canada, currently has a backlog of almost one hundred thousand boxes, some of which have been untouched for more than twenty years. the repository’s search functions are reported as inefficient, which is particularly problematic with respect to information on Canada's shameful Indian residential school system, which is needed for the Truth and Reconciliation Commission of Canada. Magical thinking has pervaded preparations for archiving paper documents. Canada has yet to craft a strategy to manage the imminent arrival of digital-only documents.
More is known about why individuals refuse to share their data that is known about why they would share. Wallis, Rolando, and Borgman surveyed users of or contributors to a sensor network-sharing platform, where participants’ greatest concern was information loss in the pooling of data: aggregation on these portals separated data from documentation context that would allow for proper attribution to the original contributors of data.
83 When data sharing did occur, it prevailed in person-to-person interactions and not through impersonal digital infrastructures. he authors confirmed that few institutional incentives exist for rendering data interoperable and then using that shared data. When there is little incentive to share data, it is difficult to envision funding support for an infrastructure to make interoperable the “richness and variance that is likely to exist in . . . slices of the long tail of science and technology research.”
84 Whether the issue is small data or big data, enabling interoperability can demand a profound shift in what is valued in the research process. A drive toward interoperability can move the focus in the means-ends idiom. Instead of using data as a means to generate findings, they become an end unto itself. We have seen this shit in his implementation, in which data have long achieved a value separate from the reasons for their generation.
85 One struggle in gis has been documenting data sufficiently to retain institutional memory about its provenance, classification, and intent. Difficulties in creation and upkeep of spatial metadata have long been known; automation has not markedly improved its collection.
86 There also is the challenge of preparing data in a way that anticipates repurposing of that data for unknown audiences and undetermined usages. repurposability is a crucial assumption of big data, but compliance can move resources from data used to data preparation. Most researchers and practitioners are not meant to be data producers (i.e., producing data for the sake of data) but data collectors, in situations where data fuel predefined objectives.
The Plight of Very Small and Slow Data
We argue that we will see an impact on hyperlocal and very small and slow-to-achieve-results projects amid an urgency to transition to big data and its accompanying data-driven science. hese activities risk being transferred to what we call, for want of a better phrase, very small and slow data studies. We contend that very small and slow data is not necessarily subverted by big data but compelled to complement big data approaches.
This realignment occurs on numerous fronts, including creating expectations of having a “bigness” to one’s data set, which then represents the importance of a study, the data, and the rigor of its methods, as well as access to resources (e.g., funding set aside for big data–like studies).
Very small and slow data can be considered part of the process of qualitative social science methods, such as case studies, ethnographic reports, or biographical accounts. hese data sets may well be normative, for example, exploring aspects of social justice. Very small and slow data is the size at which much of social science data is collected.
the data tend to be highly particularized and require lengthy time periods to collect because they supposedly offer nuanced reflections and deep topological relations, are embedded in historical and anthropological contexts, and, arguably, lie within human comprehension. According to Ballantyne, these types of studies should describe the messiness of what happens on the ground and be distilled into stories by which we explicate the data of our research.
87 Some argue that big data allows us to escape an era of scarce data so we can live in data-rich environments.
88 Another perspective is the promise that our meager data stores can describe rich environments. the appeal of these studies is that value can be found in the very noise that gets discarded from big data to achieve the signal. Very small and slow data can be the reasoning or speculation that occurs behind the key-value coding in many content analysis approaches, or it can be the thought processes of researchers and their subjects in determining their “choice as to what is most real.”
89 While we are supportive of a very small and slow approach to data collection, even when done digitally, we do not automatically advocate that the only good data would be the smallest and slowest. Numerous reasons preclude these types of studies (e.g., objectives of the study, resource constraints, or objections of participants). Our definition for very small and slow data is imperfect and subject to the very critiques we offered above. We choose the term as a provocation and simply question the drive to subsume all data to the assumptions embedded in a neologism.
We rely on the first equation, big data = f(volume, variety, velocity, and maybe veracity, viability, variability, value . . .), and the Vs to suggest some ways in which this shit is particularly toxic to very small and slow data.
Normative Positioning through Size
Very small and slow data brings the assumptions of volume into high relief in how big data normatively positions small-scale social science research. Because volume can be measured numerically and categorically (i.e., ordinally), it embeds a hierarchy. Bigger is better. Any hierarchical system or dichotomous pairing presupposes an ethic, either where one choice is instrumentally superior to another choice or where one “ought” to select one choice over another (i.e., there are right and wrong practices).
Smallness also reflects the type of data represented. Referring back to the introductory anecdote about Web 2.0 for community development, a small number of vgi observations could be consid-ered lawed relative to a large number, in part because the observations are asserted and not emergent from experts.
By implication, very small and slow data sets would require strengthening, whether by imposing accuracy or, in a Wikipedia crowdsourced model, by precision. Only by this layering, this accretion of assertions does one approach value. here may be no refinement, yet the quantity constructs validity. In an epistemology of big data in which ways of knowing are attached to large numbers of contributions that serve to triangulate each other, precision presumably fixes the mistakes.
A larger assumption concerns the way the size of big data (or small data as envisioned with very large data sets like a country’s census) convinces us that with volume, reach, and scalability we can obtain new insights. In comparison to big and small data, very small and slow data, unless we pool it, limits our ability to maximize insights.
Hardt exposes big data biases in his article, “How Big Data Is Unfair.” Hardt charts the methods by which big data can dilute minority views, which are statistically overwhelmed by the volume of majority opinions. Instead of reaching the long tail of public opinion, big data can result in a regression to a “white” mean that defuses minority voices while giving the appearance that minority voices are heard.
92 Just because anyone can participate on a social media platform does not mean everyone will participate. Contrary to the assumption that big data is neutral while very small and slow data is biased, both big data and the analytics offer a social mirror to our biases:
As we’re on the cusp of using machine learning for rendering basically all kinds of consequential decisions about human beings in domains such as education, employment, advertising, health care and policing, it is important to understand why machine learning is not, by default, fair or just in any meaningful way.
his runs counter to the widespread misbelief that algorithmic decisions tend to be fair, because, ya know, math is about equations and not skin color.
Small data, as compared to big data, suffers from a lack in “real-time” velocity in both its creation and collection. Very small and slow data can allow us to contemplate the good things that come to those who wait. On the waiting occurs whether researchers want it or not.
An important feature characterizing the collection of very small and slow data is the building of trust between the researcher and research subjects (not to be confused with algorithmically calculated trust used in many big data social network projects).
To gather very small and slow data from in-depth interviews, one must allow time to cultivate a personal relationship and trust. Interview respondent numbers may similarly be small in size with perhaps a few dozen respondents; interviews may have no or irregular periodicity (one-time interviews or a sequence over a number of months), possess weak relationality, and be limited in variety (e.g., only text transcripts).
Instead of assuming that very small and slow data is weak in insights, the information (insights) gathered from these kinds of data may be richly textured and supported by rigorous methods. Simultaneously, insights derived from slowness of certain methods may be incompatible with an instant-access age, in which data are constantly updated in a continuous stream. Numerous situations may require instant access. We may save lives because of the speediness afforded by citizens’ sensing of crises.
However, in conforming to the assumptions of big data we risk abandoning the slow study in preference to the speedy superficial.
Harmonizing the Smallest and Slowest Data
Finally, let us consider variety. Efforts needed to maintain the value of very small and slow data in a big data future may oblige researchers to ensure their data are linkable and scalable, as in the case of small data. the assumption underlying harmonization is that data gains value in its aggregation. he converses could be characterized as “a pixel unused is a pixel wasted.”
If the data exist only in the proprietary silo of a research report, then they fail to achieve their potential. Why should such data not be used again? As sharing and reuse, particularly digital repurposing, has become intrinsic to current research, the questions become argumentum ad hominem.
What does the objector have against the reuse of data, especially if that reuse generates new knowledge? the subtext is that the researcher is immoral for not attempting to wring more insights out of preexisting studies if insights can be accrued in combination with other data sets.
A clear expression of morality lies in the attribution of life-saving properties to linked data: “Examples of the power of linked data arise daily. In Britain, the Times picked up raw, linked data about bicycle accidents from DirectGov and published a mashup map showing where bicycle accidents had occurred, so cyclists could be aware of the many dangerous spots along the city’s roads.”
95 Very small and slow data, therefore, resembles this expression:
big data = n*(very small and slow data)
where n is the linkage threshold at which the data become legitimate in the epistemology of big data. However, very small and slow data allows us to examine the converse: “why shouldn’t we waste the pixel?” Certain data cannot and should not be repurposed. Sacred data exemplifies the conflicted nature of data sharing. Rundstrom wrote that in many indigenous cultures certain knowledge could be known only by a small number of people (e.g., elders).
96 Others had no rights to that knowledge.
Certain bands would accept the loss of indigenous knowledge if there was no incoming elder rather than allow that knowledge to be recorded. In another instance, an indigenous group would allow its sacred site to be destroyed rather than permit it to be mapped and potentially expose that knowledge to a broader public. the supposition that some data sets should be lost and not be repurposed or ever made public violates the ethos that all data should be available for linking.
When we conduct research with or about marginalized populations in very small and slow data studies, we frequently place our research in a critical context. hese, include positionality and subjectivity vis- à-vis the individuals we conduct research with or on (e.g., “One author is a white middle-class cis-gendered woman co-conducting research with indigenous peoples, who are actually a subset of a larger indigenous grouping”).
Technically a harmonized, linked-data approach can attach these details to the extracted data because of their polymorphism (e.g., a document file to an individual record). An initial linking, however, cannot guarantee the link is subsequently maintained. A linking also may exclude an ethics review. Indeed ethics may not permit a repurposing. Whether we are linking or pooling, we could lose much in the harmonization of very small and slow data. Perhaps certain data sets should not scale.
Conclusion
In this blog we moved from big data to small data to very small and slow data and back again. his structure allows us to meditate on the seesaw rhetoric of big data. Namely, big data is too big or fast to comprehend or to manage computationally. It fails to produce value as advertised, so we shrink the data to a manageable size.
Small and very small and slow data offer value through purpose-driven data collection, but they can be considered inconsequential for newer analytics, visualization, and, ultimately, insights. Consequently, we are urged to employ various aggregations to scale them to big data. But the resultant data may lose their context and become too big to comprehend. So we shrink the data; repeat the rhetoric as needed.
We do not argue against the value of any one size (i.e., the Vs suggested by big data) of data set over another. Instead, we argue that the hyperbole of big data permeates all data. Regardless of data size, the temptation is to position all data within the opportunities— the insights and the new valuations—offered by big data. Generalizing one size of data as being representative of all social science research misrepresents the nature of small and smaller data and the value of all sizes of data in the future of a discipline such as a geography.
Even considering small data as a unary representation of geography issues a misapplied philosophical reduction to the discipline. Traditional unary representations and neologisms serve more to artificially obfuscate our work than elucidate the discipline’s future. We hoped to demonstrate this by the ironic coinage of our own term: very small and slow data.
We expect more numerous calls for very small and slow data to be repositioned as amenable to small data, which in conjunction is repositioned as a contributor to big data. We prefer the acceptance of diverse datasets and of varied approaches, one of which has some data never warehoused, shared, or linked. In this blog we sought to delete some big data hubris being prematurely attached to small and smaller data and which likely will face difficulty in matching the exaggerated claims regarding the utility of resembling “big.”
We urge, along with many others, restraint in adopting such epistemologies for social science disciplines like geography, because they miss bigger issues that could derail the relevancy of big data in the future of social science research.
By adopting the pluralistic acceptance of different sizes of data in geography, we should strive to balance the critical and the opportunistic somewhat in the fashion of M. Graham and Shelton: “We believe that a broader conversation into the big data meme itself and the ways that it is able to redirect and displace attention, conversation, resources, and practices away from other pressing issues will not only allow us to avoid the most problematic implications of big data but also work toward a more productive integration of big data with existing research paradigms.”
Reference
Office 2019 - INTRODUCTION TO BIG DATA
Office 2019 - Big data Glossary
Office 2019 - How Big Data Will Change
Office 2019 - Big Data Hype
Office 2019 - Big Data Strategy
Office 2019 - Technology for Big Data
Office 2019 - Why Big Data
Office 2019 - Big data in Big Cities
Office 2019 - Data Discovery
Office 2019 com setup
Office 2019 - Share Data
Office 2019 - Data Diversity
Office 2019 - What is Data


Comments

  1. We place a high value on establishing long-term relationships with our clients, eventually becoming virtual extensions of their organizations. Our consultants and engineering teams address our clients' specific requirements with best-in-class support solutions across a broad scale of service areas. Address : 1010 N. Central Ave,Glendale, CA 91202,USA Toll free no : 1-909-616-7817.

    ReplyDelete

Post a Comment