What is Analytics?

Big Data, Analytics & Data Science: The Big, the Smart & the Sexy?

The Wild West

The growth of analytics, both in discussions and in practice, has coincided with the growing use of terms such as "big data" and "data science". But how are these concepts related? Are they synonyms for the same thing, complimentary concepts, or entirely separate entities? This post will attempt to address this issue by briefly reviewing and comparing the literature concerning each.

This article is split into three parts: the first discussing big data, the second data science, and the third comparing each with analytics.

This is part one. For part two click here and for part three click here.

A definition of big data is seemingly more straight-forward as in the main it refers to what it describes, namely large datasets. However, as discussed on the open source information management community and framework, MIKE2.0, a dataset used in big data does not necessarily have to be particularly large in terms of the number or records it contains. Instead the ‘big’ aspect more accurately refers to complexity such as "the number of useful permutations of sources making useful querying difficult (like the sensors in an aircraft) and complex interrelationships making purging difficult" (MIKE2.0, 2013).

Another perspective on big data, as promoted by IBM (2013), is to categorise it by four of its unique characteristics:

1. Volume: the scale of the data involved and the increased complexity of the datasets involved.

2. Velocity: refers to the speed of response that businesses seek to apply big data insights. Whilst not specifically referred to in the IBM framework, another perspective on this may be the velocity within which this data is generated, updated and dispersed.

3. Variety: big data typically includes different forms of data as opposed to highly-structured datasets such as financial or market research data. Instead unstructured datasets are common such as user generated content on social networks, videos, images and more.

4. Veracity: the final aspect refers to the reliability of the data. This is essentially a qualifying aspect. The argument goes that whilst many datasets might be very large, complex and challenging, if it is to have value for a business it needs to be reliable and believed. In other words, to consider a dataset as big data it needs to be useful to a business (a pre-requisite of which being that it needs to be reliable) or else it is merely a large collection of useless information.

The development of big data can also be viewed from the perspective of the tools and techniques used to manage and manipulate it. In traditional business intelligence (BI) this has been through the creation of data warehouses. Data warehouses often represent the primary storage point for a business' information resources. In order to populate them all the relevant business data needs to be collected and transformed into a uniform state via the process of Extract, Transform, Load (ETL). The data is then stored in a logical format with rows representing each record and columns pre-specified values. (e.g. Massa and Testa, 2005; March and Hevner, 2007; Chaidhuri et al, 2011).

Whilst many of these techniques are still relevant in the majority of instances of analytics applications, big data has introduced a variety of new challenges, not least introducing data with such scale and complexity so as to make the loading and storing of it in data warehouses highly problematic. To meet these new challenges a new approach was required, and therefore new technologies and techniques.

The most prominent of these is technologies (such as NoSQL databases) and architectures, in particular Hadoop. In contrast to traditional BI architectures data is not amalgamated into a single database, and is instead stored across a distributed file system (DFS). When running a query three significant benefits are offered due to this structure. Firstly additional metadata can be included (i.e. ‘columns’ in a data warehouse) which may not occur in other files. Secondly, for data of these types and sizes, the loading is considerably faster than traditional ETL methods. Thirdly, and perhaps most importantly, this structure allows the user to process the large volumes of information associated with big data (e.g. Stonebraker, 2010).

Whilst these technologies and techniques are becoming more commonplace, such methods are still in their infancy and still pose many significant issues, not least the loss of much of the control offered by traditional relational database managements systems (RDBMS) and data warehouse approaches. This in itself can provide a suitable definition for big data. As traditional relational databases and data warehouses provide many benefits and easier control, if a dataset or process is deemed too large or complex to use this architecture in favour of tools such as Hadoop, then it can effectively be considered 'big data' (Jacobs, 2009). However, the amount of data which may have value for businessess that is suitable to be stored and processed in relational databases is estimated to be less 10% ( Bizer et al, 2011).

In summary, big data has the potential to offer much value to businesses and allow them to find new insights in the 90% of data that was unused in previous technologies and architectures. However, the tools and approaches are in their infancy and there are many benefits of traditional approaches that are lost in these architectures. Whilst many suggest that "data warehouses are dead" (e.g. Howard, 2011; Woods, 2012), most businesses who already have such architectures in place will still see the benefit of these tools, however not all data and processes available can be performed within them.

Part two of this article will discuss the development of data science and data scientists, before the final part compares big data and data science with analytics.

 

REFERENCES

Chaudhuri S, Umeshwar D and Narasayya V (2011). 'An Overview of Business Intelligence'. Communications of the ACM, 54: 88-98.

Howard P (2011). The EDW is dead, [Online]. Bloor Research. Available from: http://www.bloorresearch.com/blog/IM-Blog/2011/6/edw-dead.html, [accessed March 2013].

IBM (2013). What is Big Data?, [Online]. Available from: http://www-01.ibm.com/software/data/bigdata/, [accessed March 2013].

Jacobs A (2009). 'The Pathologies of Big Data'. Communications of the ACM, 52: 36-44.

Massa S and Testa S (2005). 'Data Warehouse-in-Practice: Exploring the Function of Expectations in Organizational Outcomes'. Information & Management, 42: 709-718.

March ST and Hevner AR (2007). 'Integrated Decision Support Systems: A Data Warehousing Perspective'. Decision Support Systems, 43: 1031-1043.

MIKE2.0 (2013). Defining Big Data, [Online]. Available from: http://mike2.openmethodology.org/wiki/Big_Data_Definition, [accessed March 2013].

Stonebraker M (2010). 'SQL Databases v. NoSQL Databases'. Communications of the ACM, 53: 10-11.

Woods D (2012). What Is a Data Scientist?: Michael O'Connell of TIBCO Spotfire, [Online]. Forbes. Available from: http://www.forbes.com/sites/danwoods/2012/01/25/what-is-a-data-scientist-michael-oconnell-of-tibco-spotfire/, [accessed March 2013].

You are here: Home Analytics Articles What is Analytics? Big Data, Analytics & Data Science: The Big, the Smart & the Sexy?

Contact us

  • This email address is being protected from spambots. You need JavaScript enabled to view it.     Connect via LinkedIn    |    In assosciation with:    The OR Society    Loughborough University    |    About the Project

  • Address:

    The ORATER Project, C/O MJ Mortenson, School of Business & Economics, Loughborough University, Leicestershire, LE11 3TU