Extract, transform, load (ETL)

Lecture

In calculations, extraction, conversion, load ( ETL ) refers to the process in the database using and especially in data warehouses. The ETL process became a popular concept in the 1970s. [1] Data extraction, where data is extracted from homogeneous or heterogeneous data sources; data transformation, where the data is converted to be stored in the proper format or structure for query and analysis purposes; data downloads, where data is loaded into the final target database, more specifically, operational data storage, data marts or data storage.

Since data retrieval takes a lot of time, it is common to perform three phases in parallel. While the data is being retrieved, another conversion process performs during the processing of the data already received and prepares it for loading while data loading starts without waiting for the completion of the previous steps.

ETL systems typically integrate data from multiple applications (systems), usually developed and maintained by various vendors or hosted on separate computer equipment. Heterogeneous systems containing source data are often managed by different employees. For example, a cost accounting system may combine data from payroll, sales and purchases.

content

1 Fetch
2 Transform
3load
4Real-life cycle ETL
5 problems
6Productivity
7 Parallel processing
8.Rerunnability, recoverability
9 Virtual ETL
10Work with keys
11Tools
12 See also
13Links

Extract (Extract)

The first part of the ETL process involves extracting data from the source system. In many cases, this is the most important aspect of ETL, since data extraction correctly sets the stage for success of subsequent processes. Most data warehouse projects combine data from different source systems. Each individual system can also use different organizations and / or data formats. Shared data source formats include relational databases, XML and flat files, but may also include relational database structures such as information management systems (IMS) or other data structures such as Virtual Storage Access (VSAM) or indexed sequential The access method (ISAM), or even formats, is chosen from outside sources using tools such as web spiders or screen scraping. Streaming the extracted data source and loading on the fly to the target database is another way to perform ETL when no staging is required. In general, the extraction phase aims to convert the data into a single format suitable for conversion processing.

The inner part of the mining involves checking the data to confirm whether the data was pulling from the sources the correct / expected values in the given area (for example, as a template / default or in a list of values). If the data fails the validation rules it is rejected in whole or in part. The rejected data ideally reported the source system for further analysis in order to identify and correct incorrect records. In some cases, the extraction process itself may have to make a data validation rule in order to accept the data and flow to the next stage.

Transform

At the data conversion stage, a number of rules or functions are applied to the extracted data in order to prepare it for loading into the final goal. Some data does not require any conversion at all; Such data is known as a “forward move” or “passes through” data.

An important conversion function is data cleansing, which aims to transmit only the “correct” data to the target. The problem is when different systems interact in corresponding systems interacting and communicating. Character sets that may be available on one system may not be the same on others.

In other cases, when one or more of the following types of transformations may be required to meet the needs of a business and technical server or data warehouse:

Select only specific columns to load: (or selecting zero columns so as not to load). For example, if the data source has three columns (the so-called "attributes"), roll_no, age and salary, then the choice may take only roll_no and salary. Or, the selection mechanism can ignore all those records where there is no salary (salary = zero).
Translation of coded values: ( for example , if the original system of codes is male, as “1” and female, as “2”, but male warehouse codes, as “M” and female, as “F”)
(: Coding of free-form value, for example , displaying “man” in “M”)
Receiving a new calculated value: ( for example , sale_amount = count - in * unit_price)
Sort or organize data based on a list of columns to improve search efficiency
Combine data from multiple sources ( for example , search, merge) and data deduplication
Aggregation (for example, a cumulative package — summing up several lines of data — total sales for each store and for each region, etc.)
Creating surrogate value keys
Transferring or turning (turning several columns in several lines or vice versa)
Splitting a column into multiple columns ( for example , converting a comma to a list, specified as a row in one column, into separate values in different columns)
Breakdown repeating columns
Looking up and checking relevant data from tables or reference files
Applying any form of data verification; failed validation, can lead to a complete failure of the data, a partial failure, or no failure at all, and, thus, none, some or all of the data is transferred to the next step, depending on the design and exception handling rules; Many of the above transformations can lead to exceptional situations, for example, when translation codes parse unknown code in extracted data.

Load

The loading phase loads data into an end goal, which can be simple flat file delimiters or data storage. Depending on the requirements of the organization, this process varies widely. Some data stores may overwrite existing information with accumulated information; Updating Extracted data is often done on a daily, weekly, or monthly basis. Other data stores (or even other parts of the same data store) may add new data in historical form at regular intervals, for example, hourly. To understand this, consider the data warehouse, which is required to maintain sales of last year’s records. This data store overwrites data older than one year with new data. However, data entry for any one year window is done in a historical manner. Time and opportunity to replace or append strategic design decisions that depend on time, availability and business needs. More complex systems can maintain the history and audit of all changes in the data uploaded to the data warehouse.

Since the loading phase interacts with the database, the constraints defined in the database schema — as well as triggers activated when the data is loaded — apply (for example, uniqueness, referential integrity, required fields), which also contribute to the overall performance of the process data ETL.

For example, a financial institution may have customer information in several departments and each department may have that customer information listed differently. The membership department can list the customer by name, while the accounting department can list the customer by number. ETL can combine all these data elements and combine them into a single presentation, for example, for storage in a data or data repository.
Another way that companies use ETL is to transfer information to another application on an ongoing basis. For example, a new application may use a different database provider and, most likely, a completely different database schema. ETL can be used to convert data to a format suitable for a new application to use.
An example would be the cost system and cost recovery (ECRs), such as those used by accountancies, consulting and law firms. The data usually ends in a time-based billing system, although some enterprises may also use the raw data for human resources employee performance reports (personnel department.) Or reports on the use of equipment for managing facilities.

Real-Life ETL Cycle

The typical real life cycle of an ETL consists of the following steps:

cycle initiation
Build reference data
Extract (from sources)
check
Transform (clean, apply business rules, check data integrity, create aggregates or disaggregates)
Stage (loading in staging tables, if used)
Audit reports (for example, on compliance with business rules. In addition, in case of failure, it helps to diagnose / repair)
Publish (target table)
Archiving

Problems

ETL processes can involve considerable complexity, and significant operational problems can arise from improperly designed ETL systems.

The range of data values or data quality in the operating system may exceed the expectations of designers during the specified validation and conversion rules. Profiling data from the source during data analysis can determine the condition of the data that must be managed transform the specification of the rules, which leads to the amendment of the validation rules explicitly and implicitly implemented in the ETL process.

Data warehouses are usually collected from various data sources with different formats and purposes. Thus, ETL is the key process to bring all the data together in a standard homogeneous environment.

Analysis The design should establish scalability in the ETL system for the life of its use --- including an understanding of the amount of data that needs to be processed under service level agreements. The time available for retrieval from source systems may vary, which may mean the same amount of data may have to be processed in less time. Some ETL systems must scale to process terabytes of data to update data stores with dozens of terabytes of data. Increasing data volumes may require designs that can scale from a daily batch of days to micro batch to integrate with message queues or in real time with changing capture data for continuous conversion and updating.

Performance

ETL vendors reference their RECORD systems with multiple TB (terabytes) per hour (or ~ 1 GB per second) using powerful servers with multiple processors, several hard drives, several gigabits of network connections, and large memory.

In real life, the slowest part of the ETL process usually occurs in the database load phase. Databases can run slowly because they need to take care of concurrency, maintaining integrity and indexes. Thus, for better performance, it may make sense to use:

Direct method Extract Path or volume offload whenever possible (instead of querying the database) to reduce the load on the source system when obtaining a high extraction rate
Most conversion processing outside the database
Bulk download operations whenever possible

However, even when using bulk operations, access to the database is usually a bottleneck in the ETL process. Some common methods used to improve performance are:

Table partitions (and indexes): try to keep partitions similar in size (hours for zero values that can distort the partition)
Perform all checks in the ETL layer before loading: disable check integrity (disable limit ...) in the target database tables at boot time
Disable triggers (disable trigger ...) in target database tables at boot time: simulate their effect as a separate step
Generating identifiers in the ETL layer (not in the database)
Drop the indexes (on the table or section) before loading - and restore them after loading (SQL: index of the fall ..., create an index ...)
Use parallel bulk loading, if possible - works well when the table is split or not indexed (Note: an attempt to parallelize loading into one table (partition) usually causes a lock - if not on data lines, then on indexes)
If there is a requirement to make inserts, updates, or deletions, find out which rows should be processed in the ETL layer and then process these three operations in the database separately; You can often make a bulk load to insert, but updates and deletes usually go through the API (using SQL)

Regardless of doing certain operations in the database or outside it, may include a compromise. For example, removing duplicates using distinct can be slow in the database; So it makes sense to do it outside. On the other hand, when used differs significantly (x100) reduces the number of rows to be extracted, it makes sense to remove duplications as early as possible in the database before unloading.

A common source of problems in ETL is a large number of dependencies between ETL jobs. For example, task “B” cannot begin until task “A” is not completed. You can usually achieve higher performance by visualizing all the processes on the graph, and try to reduce the graph, making maximum use of parallelism, and make the "chain" of sequential processing as short as possible. Again, separating large tables and their indexes can really help.

Another common problem occurs when data is distributed across multiple databases, and processing is performed in these databases sequentially. Sometimes database replication may be involved as a way to copy data between databases — this can slow down the whole process significantly. The general solution is to reduce the processing graph with only three layers:

sources of
ETL center layer
Goals

This approach allows processing to maximize the benefits of concurrency. For example, if you need to load data into two databases, you can run the download in parallel (instead of loading into the first one - and then replicating to the second).

Sometimes the treatment must take place sequentially. For example, three-dimensional data (links) are needed before you can get and check the rows for the main “fact” tables.

Parallel processing

The recent development in ETL software is the implementation of parallel processing. This allowed several methods to improve the overall performance of ETL when dealing with large amounts of data.

ETL applications implement three basic types of parallelism:

Data: Splitting one serial file into smaller data files to provide concurrent access
Pipeline is the development process (preparation, production), a software pipeline. : allows simultaneous operation of several components on the same data stream, for example, looking at the value of record 1 at the same time as adding two fields to record 2
Component: Simultaneous progress of several processes on different data streams in the same work, for example, sorting one input file, while removing duplicates into another file

All three types of parallelism usually works combined in one task.

Additional difficulty comes from making sure that the data loaded is relatively consistent. Since several source databases may have different update cycles (some of them may be updated every few minutes, while others may take several days or weeks), the ETL system may need to be silent for some data until all sources are synchronized. Similarly, where a warehouse may have to be aligned with the content in the original system or with the general ledger, the establishment of synchronization and verification of points becomes necessary.

Rerunnability, Recoverability

Data warehousing procedures typically subdivide a large ETL process into smaller pieces, working in series or in parallel. In order to keep track of data flows, it makes sense to tag each data row with a “ROW_ID”, and tag each part of the process with a “run_id”. В случае выхода из строя, имея эти идентификаторы помогают откатить и перезапустить отказавший часть.

Наилучшая практика предусматривает также контрольно - пропускные пункты , которые являются государствами , когда определенные фазы процесса завершены. После того, как на контрольно - пропускном пункте, это хорошая идея , чтобы записать все на диск, очистить временные файлы, журнал состояние, и так далее.

Виртуальный ETL

По состоянию на 2010 , виртуализация данных начали продвигать обработку ETL. Применение виртуализации данных в ETL позволило решить наиболее распространенные задачи ETL по миграции данных и интеграции приложений для нескольких распределенных источников данных. Виртуальный ETL работает с рассеянным представлением объектов или лиц , собранными из различных реляционных, полуструктурированных, и неструктурированных источников. ETL инструменты могут использовать объектно-ориентированное моделирование и работу с изображениями сущностей постоянно хранящихся в центре города хаб и спицы архитектуры. Такой набор , который содержит представление сущностей или объектов , собранные из источников данных для обработки ETL называется хранилищем метаданных , и он может постоянно находиться в памяти [2] или быть стойкими. Используя постоянное хранилище метаданных, ETL инструменты могут переходить от разовых проектов к стойкому промежуточному слою, выполняя согласование данных и данные профилирования последовательно и в близком к реальному времени.

Работа с ключами

Уникальные ключи играют важную роль во всех реляционных базах данных, так как они связывают все вместе. Уникальный ключ представляет собой столбец , который идентифицирует данный объект, в то время как внешний ключ представляет собой столбец в другой таблице , которая относится к первичному ключу. Ключи могут содержать несколько столбцов, в этом случае они являются составными ключами. Во многих случаях, первичный ключ является автоматически генерируемый целое число , которое не имеет никакого значения для субъекта предпринимательской деятельности представлены, но только существует для реляционной базы данных - обычно называют суррогатным ключом .

Поскольку, как правило , больше , чем один источник данных получения загружено в хранилище, ключи являются важной задачей для рассмотрения. Например: клиенты могут быть представлены в нескольких источниках данных, их номера социального страхования в качестве первичного ключа в одном источнике, их номер телефона в другой, и суррогата в третьем. Тем не менее , хранилище данных может потребовать консолидации всей информации о клиенте в одном измерении .

Рекомендуемый способ справиться с беспокойством предполагает добавление хранилища суррогатного ключа, который используется в качестве внешнего ключа из таблицы фактов. [3]

Как правило, обновления происходят в исходных данных размерностью, которая, очевидно, должны быть отражены в хранилище данных.

If the primary key of the source data is required for reporting, the size already contains this piece of information for each row. If the data source uses a surrogate key, the warehouse must keep track of it, even if it is never used in queries or reports; This is done by creating a lookup table that contains the surrogate storage key and the TAKING key. [4] Thus, the size is not contaminated by surrogates from different source systems, while the ability to update is maintained.

The lookup table is used in various ways depending on the nature of the source data. There are 5 types to consider; [5] three are included here:

Type 1

Размер строка просто обновляется в соответствии с текущим состоянием исходной системы; склад не отражает историю; поисковая таблица используется для идентификации размера строки, чтобы обновить или переписать

Type 2

Новое измерение добавляется строка с новым состоянием исходной системы; новый суррогатный ключ назначается; Основной источник больше не является уникальной в таблице поиска

Полностью войти

Новое измерение добавляется строка с новым состоянием исходной системы, в то время как предыдущее измерение строка обновляется, чтобы отразить это больше не является активным и время дезактивации.

Instruments

Используя установленную структуру ETL, один может увеличить свои шансы на прекращение с лучшей связью и масштабируемостью . [ Править ] Хороший инструмент ETL должен быть в состоянии общаться с многими различными реляционными базами данных и читать файлы различных форматов , используемых в организации. ETL инструменты начали мигрировать в Enterprise Application Integration , или даже Enterprise Service Bus , систем , которые в настоящее время охватывают гораздо больше , чем просто извлечения, преобразования и загрузки данных. Многие поставщики ETL теперь имеют профилирование данных , качество данных и метаданные возможности. Обычный пример использования ETL инструментов включает преобразование CSV файлы в форматы считываемых реляционных баз данных. Типичный перевод миллионов записей способствуют ETL инструментов , которые позволяют пользователям вводить каналы CSV-как данные / файлы и импортировать их в базу данных с минимальным количеством коды , как это возможно.

ETL инструменты, как правило, используется в широком круге специалистов - от студентов в области информатики, желающей быстро импортировать большие наборы данных для архитекторов база данных, ответственных за управление компанией счета, ETL инструменты стали удобным инструментом, который можно положиться, чтобы получить максимальную производительность , ETL инструменты в большинстве случаев содержат графический интерфейс, который помогает пользователям легко преобразовывать данные, с помощью визуального картографа данных, в отличие от написания больших программ для анализа файлов и редактирования типов данных.

В то время как ETL инструменты традиционно для разработчиков и ИТ - специалистов, новая тенденция заключается в предоставлении этих возможностей для бизнес - пользователей , так что они сами могут создавать связи и интеграции данных , когда это необходимо, а не собирается ИТ - персонала. [6] Gartner , относится к этим неопытным пользователям как Citizen интеграторы. [7]