The concept of data storage (DS). Differences between data warehouses and databases

Lecture



Data storage is a domain-specific, integrated, time-dependent data set designed to support decision-making by various groups of users.

Since the repository is domain-specific, its organization is aimed at meaningful analysis of information, and not at automating business processes. This property defines the architecture of building a storage and the principles of designing a data model that are different from those used in operational systems.

HD is built on the basis of client-server architecture, relational DBMS and decision support utilities.

Third-party software products are added to the repository, which make models based on intellectual, rather than statistical data analysis and get hidden patterns. But they include genetic algorithms, neural networks, nonlinear dynamics, clustering, hybrid systems — quite a large number of technologies for building models in the repository. This is necessary when the amount of data is such that a direct search and statistical methods for analyzing the result do not give.

The main components of the data warehouse:

1. Subject orientation

Local databases contain megabytes of information that is absolutely not necessary for analysis (addresses, postal codes, record identifiers, etc.). Such information is not stored in the repository, which limits the range of the considered data when making decisions to a minimum.

For decision-making, some strictly defined set of data is required, which is pulled out of the database to the CD, minor non-essential attributes are eliminated.

2. Integration (integrity and internal interconnection)

Despite the fact that the data is immersed from various sources, they are united by common naming laws, methods for measuring attributes, etc. This is of great importance for corporate organizations in which computing systems of different architecture can be operated simultaneously, representing the same data in different ways. For example, several different date formats can be used, or the same indicator may be called differently. In the process of immersion, such inconsistencies are eliminated automatically. This is the most time consuming part of creating HD.

3. Temporary binding

Operating systems cover a short time interval, which is achieved through periodic archiving of data. The CD, on the contrary, contains data accumulated over a long time interval (from five to seven to ten years).

4. Unbreakable data set.

Data modification is not performed, as it may lead to violation of their integrity.

Differences:

If the database is small, highly specialized and there is a qualified programmer who can write non-standard queries that collect data into an array and analyze this data, then you can use a regular database instead of CD. The storage is not intended for a programmer - an analyst, a manager, a person who does not have the skills to write complex queries should be able to use it.

Disadvantages of using database in decision support:

· Invalid data;

· Poor performance with non-standard queries;

· The impossibility of converting heterogeneous data, as they often do not have time stamps;

· Problems with the preparation of reports arise from the fact that:

- it is difficult to understand where the data needed for analysis and decision making is located;

- most databases are focused only on standard queries;

- it is required to involve programmers to perform non-standard requests.

Data Warehouse Features:

- Data warehouses contain information collected from several operational databases.

- Storage, as a rule, is an order of magnitude larger than operational bases, often having a volume from hundreds of gigabytes to several terabytes.

- As a rule, the data warehouse is maintained independently of the operational databases of the organization, since the requirements for functionality and performance of analytical applications differ from the requirements for transaction systems.

- Data warehouses are created specifically for decision support applications and provide cumulative over time, consolidated and consolidated data that is more acceptable for analysis than detailed individual records.

- Workload consists of non-standard, complex queries that access millions of records and perform a huge number of scanning, merging and aggregating operations. The response time to the request in this case is more important than bandwidth.

продолжение следует...

Продолжение:


Часть 1 The concept of data storage (DS). Differences between data warehouses and databases


Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Databases, knowledge and data warehousing. Big data, DBMS and SQL and noSQL

Terms: Databases, knowledge and data warehousing. Big data, DBMS and SQL and noSQL