big data

Lecture



Big data (English big data , [ˈbɪɡ ˈdeɪtə]) in information technology - a set of approaches, tools and methods for processing structured and unstructured data of huge volumes and significant diversity to obtain human-perceptible results that are effective in a continuous growth, distribution across numerous nodes computing networks formed in the late 2000s, alternative to traditional database management systems and business intelligence solutions [1] [2] [3] . This series includes means of mass-parallel processing of indefinitely structured data, first of all, solutions of the NoSQL category, MapReduce algorithms, software frameworks and Hadoop project libraries [4] .

“Three Vs” are noted as defining characteristics for big data: volume (English volume , in the sense of the value of physical volume), speed (English velocity in terms of both the growth rate and the need for high-speed processing and obtaining results), diversity (English variety , in the sense of the simultaneous processing of various types of structured and semi-structured data) [5] [6] .

Content

  • 1. History
  • 2Sources
  • 3 Analysis methods
  • 4Technologies
    • 4.1NoSQL
    • 4.2MapReduce
    • 4.3Hadoop
    • 4.4R
    • 4.5 Hardware solutions
  • 5Notes
  • 6 Literature
  • 7Links

Story

The introduction of the term “big data” refers to Clifford Lynch, the editor of Nature magazine, who prepared a special issue for September 3, 2008 with the topic “How can technology influence the future of science, opening up opportunities to work with large amounts of data?” the phenomenon of explosive growth in the volume and diversity of the processed data and technological perspectives in the paradigm of the likely jump “from quantity to quality”; The term was proposed by analogy with the “big oil” , “large ore” metaphors in the business English-speaking environment [7] [8] .

Despite the fact that the term was introduced in an academic environment and, above all, the problem of growth and diversity of scientific data was understood, since 2009 the term has been widely spread in the business press, and by 2010 the first products and solutions related exclusively and directly to the problem of processing big data. By 2011, most of the largest information technology providers for organizations use the concept of big data in their business strategies, including IBM [9] , Oracle [10] , Microsoft [11] , Hewlett-Packard [12] , EMC [13] , and the main analysts of the information technology market devote the concepts to dedicated research [5] [14] [15] [16] .

In 2011, Gartner notes big data as the trend number two in the information technology infrastructure (after virtualization and as more significant than energy saving and monitoring) [17] . It is predicted that the introduction of big data technologies will have the greatest impact on information technologies in production, health care, trade, government, as well as in areas and industries where individual movements of resources are recorded [18] .

Since 2013, big data as an academic subject has been studied in the new university programs on data science [19] and computational sciences and engineering [20] .

Sources

As examples of sources of origin of big data, there are [21] [22] continuously supplied data from measuring devices, events from radio frequency identifiers, message flows from social networks, meteorological data, data of remote sensing of the Earth, data flows on the location of subscribers of cellular networks, devices audio and video recordings. It is expected that the development and commencement of the widespread use of these sources initiates the penetration of big data technologies both into research and development activities, as well as into the commercial sector and public administration.


Methods and techniques of analysis applicable to large data, highlighted in the McKinsey report [23] :
Analysis Methods

  • Data Mining class methods : teaching association rules ( engagement rule learning ), classification (methods for categorizing new data based on principles previously applied to data already available), cluster analysis, regression analysis;
  • crowdsourcing - categorization and enrichment of data by the forces of a wide, indefinite circle of people attracted on the basis of a public offer, without entering into an employment relationship;
  • data fusion and integration (a combination of data fusion and integration ) - a set of techniques that allow you to integrate heterogeneous data from a variety of sources to enable in-depth analysis, as examples of such techniques that make up this class of methods are digital signal processing and natural language processing );
  • machine learning , including teaching with and without a teacher, as well as Ensemble learning ( eng. ) - using models built on the basis of statistical analysis or machine learning to produce complex predictions based on basic models (eng. constituent models , cf. with a statistical ensemble in statistical mechanics);
  • artificial neural networks , network analysis , optimization , including genetic algorithms;
  • pattern recognition ;
  • forecast analytics ;
  • simulation modeling ;
  • spatial analysis (eng. Spatial analysis ) - a class of methods that use topological, geometric and geographical information in the data;
  • statistical analysis ; A / B testing and time series analysis are given as examples of methods;
  • visualization of analytical data - the presentation of information in the form of figures, diagrams, using interactive features and animation both for obtaining results and for use as source data for further analysis.

Technology

The most commonly cited as the basic principle of processing big data into an SN architecture (English Shared nothing architecture ), which provides massively parallel processing that is scalable without degradation to hundreds or thousands of processing nodes [ source not specified 1908 days ] . At the same time, McKinsey, besides most of the NoSQL, MapReduce, Hadoop, R technologies considered by most analysts, also includes business intelligence technologies and relational database management systems with SQL support in the context of applicability for big data processing [24] .

NoSQL

MapReduce

Hadoop

R: R (programming language)

Hardware solutions

There are a number of hardware and software systems that provide preconfigured solutions for processing big data: Aster MapReduce appliance (Teradata, Inc.), Oracle Big Data appliance, Greenplum appliance (EMC, based on solutions acquired by Greenplum). These complexes are supplied as ready-to-install telecommunication cabinets in data centers containing a cluster of servers and control software for mass-parallel processing.

Hardware solutions for in-memory analytical processing, in particular, the proposed Hana hardware-software complexes (pre-configured hardware-software solution from SAP) and Exalytics (Oracle-based complex based on the Timesten relational system and multi-dimensional Essbase), are also sometimes to solutions from the field of big data [25] [26] , despite the fact that such processing is not initially mass-parallel, and the amount of RAM in one node is limited to a few terabytes.

In addition, sometimes, solutions for big data also include hardware-software complexes based on traditional relational database management systems - Netezza, Teradata, Exadata, as capable of efficiently processing terabytes and exabytes of structured information, solving problems of fast search and analytical processing of huge volumes of structured data . It is noted that the first mass-parallel hardware and software solutions for processing extremely large amounts of data were the machines of the companies Britton Lee ( English ), first released in 1983, and Teradata (started to be produced in 1984, moreover, in 1990, Teradata devoured Britton Lee) [27] .

Hardware solutions for DAS — storage systems that are directly attached to nodes — in terms of independence of processing nodes in an SN architecture are also sometimes referred to as big data technologies. It was with the advent of the concept of big data that they associated a surge of interest in DAS solutions in the early 2010s, after they were supplanted in the 2000s by network solutions of NAS and SAN classes [28] .


Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Databases, knowledge and data warehousing. Big data, DBMS and SQL and noSQL

Terms: Databases, knowledge and data warehousing. Big data, DBMS and SQL and noSQL