Big Data Systems Architecture E-Learning (Week1)

To lay down the difference between Data Science and Big Data, it would be good to start by defining the terms in their own perspectives.

Data Science (or also called as “data-driven science“) implies a synthesis of academic studies on methods, processes, and systems in extracting knowledgeable information from structured or unstructured data. It is a concept to unify statistics, mathematics, programming, and other related methods in aligning and understanding phenomena with data. In short, Data Science is “an umbrella term for techniques used when trying to extract insights and information from data.” (Monnappa, 2017)

On the other hand, Big Data is coined to holistic information management strategy of massive volumes of data on a day-to-day basis using traditional and contemporary applications. Big Data is also defined by the four V’s: Volume (amount of data);  Velocity (the rate at which data is received and acted on); Variety (new unstructured data types); and Value (essential value of data to which it must be discovered). (Oracle, 2016)

Activity 1- Differences in Evolution

Explain the differences between history of data science and big data. Also share your thoughts on difference between Traditional Data Analysis and Data Science

History of Data Science

Big Data 5Vs – Volume, Velocity, Variety, Veracity, and Value

The term Data Science (originally used to also denote Computer Science or Datalogy), has been used for some time during the 1960s. In 1974, Danish computer scientist Peter Naur published Concise Survey of Computer Methods. This book is a survey of contemporary data processing methods that are used in wide range of applications. This book is organized around the perception of data as representations of facts or ideas in a formalized manner capable of being communicated or manipulated by some process. Wherein once the science of managing the data has been set, the relation of it is then delegated to the other fields and sciences. (Press, 2012)

In Tokyo of 1996, the International Federation of Classification Societies (IFCS), has included the term “Data Science” in the title of their conference. The aim of these classification societies is to: support the study of the principle and practice of classification in a wide array of disciplines, including research, data and statistical analysis, and systematic methods of classification and clustering.

The launch of the journal “Knowledge Discovery and Data Mining: Advances in data gathering, storage, and distribution” in 1997 introduced a need for tools and techniques to aid in computing data analysis. At this point, there’s a rapid growing interest in the area of research and application in building techniques and theories including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing, and high-performance computing. Data Mining and Knowledge Discovery in Databases (KDD) is concerned in the issues of scalability and discovery process for extracting patterns and models from raw data stores — including data cleaning, noise modelling, and issues in making discovered patterns comprehensible.

In 2001, William S Cleveland (an American computer scientist known for data visualization) publishes “Data Science: An Action Plan for Expanding Technical Areas of the Field of Statistics.” This publication proposes to enlarge the technical work of the field of statistics. The plan was so ambitious and implied substantial changes, that the altered field from this was called “Data Science.” Cleveland proposed new discipline in computer science and data mining. This advocated statisticians to look computing for knowledge just as data science looked to mathematics before. This merger of knowledge produced a powerful innovation and advances in computing data.

In 2002, the Committee on Data for Science and Technology (CODATA) started publishing the Data Science Journal. This publication focused on the issues that surrounds data science such as the description of data systems, the internet publication, applications and legal issues. And in 2003, Columbia University began their publication of The Journal of Data Science, which serves as a platform for all data workers to present views and exchange of ideas.

As we reach 2009, the National Science and Technology Council Washington DC Committee on Technology published the report Harnessing the Power of Digital Data for Science and Society which entails the emergence of new type of data science and management experts who are given little recognition but are very vital to the current and future success of the scientific enterprise. From this report mentioned  new “information disciplines” such as the Digital Curators, Digital Archivists, and Data Scientists.

In 2012, Harvard Business Review‘s article Data Scientist: The Sexiest Job of the 21st Century claims that data scientist is “a new breed” and that their shortage is becoming a serious constraint in some sectors.

But as much as the spread Data Science has been in the business environments, many academics and journalists sees no distinction of data science from statistics. Gil Press wrote in Forbes that Data Science is “a buzzword without a clear definition” and is simply replacing the term “business analytics” in most graduate degree programs. And in the 2013, Nate Silver’s (American statistician) keynote address at the Joint Statistical Meetings of American Statistical Association waters it down saying that “data-scientist is a sexed up term for a statistician.”

History of Big Data

Big Data Storymap Infographic (by Bill Schmarzo, emc.com)

In 1944, Fremont Rider (American librarian) publishes “The Scholar and the Future of the Research Library. A Problem and Its Solution” wherein he estimates that the American universities libraries doubles in size in every sixteen years. And in speculation, he said that the Yale Library will approximately have 200,000,000 volumes, which will occupy over 6 miles of shelves and staffed by over six thousand personnel.

In 1961, Derek Price (credited as the father of scientometrics) publishes “Science Since Babylon“, a humanistic understanding of science. In here, Price charts the growth in the number of scientific journals and papers. He concludes that the journals has been growing exponentially, doubling every fifteen years and increasing by a factor of ten during every 50 years.

To act on such increase in growth, Harry J. Gray and Henry Ruston published “On Techniques for Coping with Information Explosion (1964).”  The authors proposed solution to the problem of how to read all the stuff being published today, saying that only “short” papers are to be published. And by definition, short means not more than 2500 characters counting characters such as “space” and punctuation marks. An important after-effect of the suggested practice would be the reduction of the burden on personnel selection committees.

In 1967, B. A. Marron and P. A. D. de Maine published “Automatic data compression” to state that the “information explosion” in recent years makes it essential that storage requirements for all information be kept to a minimum.

Meanwhile in 1975 Japan, the Ministry of Posts and Telecommunications in Japan conducted the Information Flow Census, tracking the volume of information circulating in the country. The census introduces “amount of words” as a unifying unit of measurement across all communication channels. In 1978, they reported that “the demand for information provided by mass media, which are one-way communication, has become stagnant“, while the call for information provided by personal telecommunications media, characterized by two-way communications, has drastically increased. Drawing conclusion that more priority is placed on segmented detailed information to meet individual needs, instead of conventional mass-reproduced conformed information.

In April 1980, a live-talk titled “Where Do We Go From Here?” by IA Tjomsland touched on those associated with storage devices long ago realized that Parkinson’s First Law may be paraphrased to describe the industry wherein “Data expands to fill the space available.” He believed that that large amounts of data are being maintained because we have no way of knowing what data are already obsolete data; and that the penalties for storing obsolete data are less apparent than are the disadvantages for forsaking potentially useful data.

In 1983, Ithiel de Sola Pool published “Tracking the Flow of Information”. It echoes the phenomena of Japan’s 1975 census on information flow. It states that the growth trends in 17 major communications media from 1960 to 1977, words made available to Americans (with age over 10 years old) through these media grew at a rate of 8.9 percent per year. In the period of observation, the growth in the flow of information was attributed to the advancement in broadcasting. But toward the end of 1977, the situation was changed. Point-to-point media (such as telephone calls) were growing faster than broadcasting.

In 1996, the technological advancements for digital storage has led storing data to become more cost-effective. In “The Evolution of Storage Systems” by RJT Morris and BJ Truskowski explains how storage systems have evolved over the decades to meet the changing customer needs. They also discuss how this trend in storage systems has led to the development of autonomic storage.

In 1997, Michale Cox and David Ellsworth’s “Application-controlled demand paging for out-of-core visualization” discusses that visualization provides an interesting challenge for computer systems. Given that data sets are generally quite large, it demands a lot from the main memory, local disk, and even remote disk. This is the problem of big data, where the most common solution to data sets that do not fit in core or local memory is to acquire more resources. This is the first article in the ACM digital library to acknowledge the term “big data.”

During 1998, the internet is fast in becoming a household necessity. In a publication called “The Size and Growth Rate of the Internet,” the growth rate of traffic on the internet, even though lower than often cited, is still about 100% per year, much higher than for traffic on other networks. And if present growth trends persevere, data traffic in the US alone will overtake voice traffic around the year 2002 and will be dominated by the internet. (Coffman etal, 1998)

The Communications of the ACM publication “Visually exploring gigabyte data sets in real time” states that powerful computers are a blessing to many fields of inquiry, yet they are also a curse considering fast computations spew out massive amounts of data. But understanding the resulting high-end data processing is a significant effort. Although it is plain difficult to look at all the numbers, the purpose of this computing is insight. (Bryson etal, 1999)

In October 2000, Peter Lyman and Hal R. Varian published “How Much Information?” at UC Berkeley. The study shows that in 1999, the world produced and stored an approximate of 1.5 exabytes of unique information — what the publication calls the “democratization of data”. These newly created information from four information flows (telephone, radio, television, and the internet) are mostly born digital and are rapidly growing.

In 2001, Doug Laney (Meta Group analyst) publishes “3D Data Management: Controlling Data Volume, Velocity, and Variety.” This research lays out the “3Vs” that have become the accepted defining dimensions of Big Data — Volume, Velocity, and Variety. Later on, Veracity and Value were added to make the Big Data’s 5Vs.

In 2005, the advent of Web 2.0 has made data management a core competency in web enterprises. Tim O’Reilly (founder of O’Reilly Media) publishes “What is Web 2.0” claims that “data is the next Intel inside.” O’Reilly suggests that Web 2.0 applications are more of ‘infoware‘ rather than just software.

In 2008, RE Bryant, R Katz, and E Lazowska publishes “Big-Data Computing: Creating Revolutionary Breakthroughs in Commerce, Science and Society.” The authors present that Big-data computing is the biggest innovation in computing in the last decade, as the case may be. We have only been introduced to see the potential in collecting, organizing, and processing various data. Kenneth Cukier discusses that the world contains a vast amount of digital information that grows in a rapid rate. The effect of this extensive data is felt everywhere, from business to science, from governments to the arts and such. Scientists and computer engineers have coined the term ‘Big Data‘ to this phenomenon. (Cukier, 2010)

In 2015, big data has become a Business Operations tool to help employees work more efficiently and streamline the collection and distribution of Information Technology (IT). With various big data principles being applied, IT has been able to predict potential issues and provide solutions before even the problem transpire. IT Operations Analytics (ITOA) has played a major role in systems management by coming up with platforms that aggregates data silos with generation of insights from the whole system rather than isolated pockets of data.

Big Data has yet to hit a brick wall. As of now, data are still being collected, saved, and stored endlessly. But experts have identified 3 major concerns with this continuous growth: Data Silos; Lack of data scientists; and Lack of communication;

The Difference

Although the history of both Data Science and Big Data stems out from the need to handle information, there are some differences that can be seen. The history of Data Science involves the evolution of inter-disciplinary field which uses various techniques in processing massive data to gain value and insights from it. On the other hand, the history of Big Data refers to sudden awareness of growing volumes of data and the means in accommodating complex information generated constantly.

Traditional Data Analysis vs Data Science

The collection and analysis of data has grown to be very important to the business world. Many companies rely on data analysts to interpret actionable trends based on given data. More importantly, companies rely on knowledge from this information into generating executable business strategies. Although Traditional Data Analysis and Data Science may appear alike, they do have their distinguishing factors.

  • Descriptive vs. Predictive Analytics
    Data analysts employs descriptive and exploratory methods of analysis. These methods reveal performance results and patterns that can be linked to trends and issues within the collected data. Data analysts tend to focus on current and past data make reports or solutions to problems that are disclosed within the data set However, Data scientists practice predictive and prescriptive methods of analysis in predicting emerging trends and endorse actions to optimize business goals. Constructing these predictions requires some assumptions that are not employed by analysts when interpreting data.
  • Data Architecture
    Traditional data analysis tend to use centralized database architecture where a single computer system is used to solve problems. Data Science may employ the use of distributed database architecture where large blocks of information are spliced into several smaller sizes for solving by different computers in a given network. The distribution provides a more efficient way in managing data.
  • Types of Data
    Traditional data analysts rely on structured (“clean“) data which appears more of numerical information such as web visits, satisfaction ratings, and other measurable metrics. The advantage of having such structure is the comparative ease of compiling storing, and organizing measurable data sets. Unstructured or “dirty” data consists of qualitative rather than quantitative information. These data come from obscure sources such as emails or engagement across social media, to name a few. To process such information, probability and statistical algorithms are involved to convert learned information into advance applications for machine learning or artificial intelligence (AI).
  • Accuracy and confidentiality
    Since traditional data analysis is more expensive to maintain, all the data cannot be stored. This presents challenges in the accuracy and confidence in the data analysis. Data Science employs methods in storing voluminous data efficiently, therefor it can provide higher accurate results.

 


Bibliography:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s