processor - cache memory 1 2 and 3 levels

Lecture



Cache [1] [2] [3] or cache [4] [5] [6] (English cache , from French cacher - “hide”; pronounced [kæʃ] - “cache”) - intermediate buffer with fast access containing information that is most likely to be requested. Access to data in the cache is faster than fetching source data from a slower memory or a remote source, but its volume is significantly limited compared to the source data storage.

History

For the first time the word “cache” in a computer context was used in 1967 during the preparation of an article for publication in the journal “IBM Systems Journal”. The article dealt with improvements in memory in the developed model 85 of the IBM System / 360 series. Magazine editor Lyle Johnson asked for a more descriptive term than “high-speed buffer,” but because of the lack of ideas, he suggested the word “cache”. The article was published in early 1968, the authors were awarded by IBM, their work became widespread and subsequently improved, and the word “cache” was soon used in computer literature as a generally accepted term. [7]

Functioning

CPU memory cache diagram
processor - cache memory 1 2 and 3 levels

A cache is a memory with a higher access speed, designed to speed up access to data that is kept permanently in memory with a lower access speed (hereinafter referred to as “main memory”). Caching is used by the CPU, hard drives, browsers, web servers, DNS, and WINS.

A cache consists of a set of entries. Each record is associated with a data item or a data block (a small part of the data), which is a copy of the data item in main memory. Each entry has an identifier that defines the correspondence between data items in the cache and their copies in main memory.

When a cache client (CPU, web browser, operating system) accesses data, the cache is first examined. If an entry in the cache is found with an identifier that matches the identifier of the requested data element, then the data elements in the cache are used. This is called a cache hit . If an entry containing the requested data item is not found in the cache, then it is read from the main memory in the cache, and becomes available for subsequent accesses. This is called a cache miss . The percentage of accesses to the cache when a result is found in it is called the hit level or cache hit ratio .

For example, a web browser checks the local disk cache for a local copy of a web page that matches the requested URL. In this example, the URL is the identifier, and the content of the web page is the data item.

If the cache is limited in size, then in case of a miss, it may be decided to discard some entry to free up space. Different preemption algorithms are used to select the record to be dropped.

When modifying data items in the cache, they are updated in the main memory. The time delay between modifying data in the cache and updating the main memory is controlled by a so-called write policy .

In an immediate write cache, each change causes a synchronous update of data in the main memory.

In the write - back (or write-back ) cache, an update occurs when a data item is preempted, periodically or at the request of a client. To keep track of modified data items, cache entries store a modification sign ( modified or “dirty” ). A deferred cache miss may require two accesses to main memory: the first to write replaceable data from the cache, the second to read the required data item.

In case the data in the main memory can be changed independently of the cache, the cache entry may become irrelevant . Inter-cache communication protocols that maintain data consistency are called cache coherence protocols .

Hardware implementation

CPU cache

Due to the increase in the frequency at which the processors operate, and the increase in the performance of the RAM subsystem, the data transfer interface has become a bottleneck in the computing system.

Cache memory can give a significant performance gain in the case when the RAM clock frequency is much less than the processor clock frequency. A number of processor models have their own cache to minimize the access time to RAM (RAM), which is slower than registers (these registers and I / O buffers can be considered a zero-level cache). The clock speed for the cache is usually slightly less than the CPU frequency.

In processors with virtual addressing support, a small high-speed address translation buffer (TLB) is often introduced. His speed is important, as he is polled at every memory call.

The problem of synchronization between different caches (both single and multiple processors) is solved by the coherence of the cache.

There are three ways to exchange information between different levels of cache memory, or, as they say, cache architectures: inclusive, exclusive and non-exclusive.

Exclusive cache memory assumes the uniqueness of information located in different cache levels (preferred by AMD).

In non-exclusive caches can behave as you like.

Cache levels

The CPU cache is divided into several levels. The maximum number of caches is four. In the universal processor at the present time, the number of levels can reach three. The N + 1 cache is usually larger and slower in access speed and data transfer than the N cache.

  • The fastest is the first level cache - L1 cache (level 1 cache). In fact, it is an integral part of the processor, since it is located on the same chip with it and is part of the functional blocks. In modern processors, L1 is usually divided into two caches — the instruction (instruction) cache and the data cache (Harvard architecture). Most processors without L1 cannot function. L1 operates at the processor frequency, and, in general, it can be accessed every clock cycle. It is often possible to perform several read / write operations at the same time.
  • The second in speed is the second-level cache - L2 cache, which usually, like L1, is located on the same chip as the processor. In earlier versions of the L2 processor, it is implemented as a separate memory chipset on the motherboard. L2 capacity from 128 KB to 1−12 MB. In modern multi-core processors, the second-level cache, being on the same chip, is a separate-use memory — with a total cache size of n MB per core accounts for n / c MB, where c is the number of processor cores.
  • The third level cache is the least fast, but it can be very large - more than 24 MB. L3 is slower than previous caches, but still much faster than RAM. In multiprocessor systems it is in common use and is intended to synchronize data of different L2.
  • There is a fourth level of cache, the use of which is justified only for multiprocessor high-performance servers and mainframes. Usually it is implemented as a separate chip.

Cache Associativity

One of the fundamental characteristics of cache memory - the level of associativity - displays its logical segmentation, which is due to the fact that sequential enumeration of all cache lines in search of the necessary data would require dozens of cycles and would nullify the entire gain from using the memory integrated in the CPU. Therefore, RAM cells are tightly bound to cache lines (each line can contain data from a fixed set of addresses), which significantly reduces search time.

With the same cache size, the scheme with the most associativity will be the least fast, but the most effective (after four-thread implementation, the increase in the “specific efficiency” per stream increases little).

Caching external drives

Many peripheral storage devices use internal cache to speed up work, in particular, hard drives use cache memory from 1 to 64 MB (models with support for NCQ / TCQ use it for storing and processing requests), CD / DVD / BD-reader devices The disks also cache the read information to speed up reuse.

The operating system also uses part of the RAM as a cache of disk operations (for example, for external devices that do not have their own cache memory, including hard disks, flash memory, and floppy disks). Often all the free (not allocated to processes) RAM is provided for caching hard drives.

The use of caching external drives due to the following factors:

  1. processor access speed to the RAM by hundreds and more times more than to the memory of external drives;
  2. The performance of disk storage devices (hard, floppy, optical disks) is maximum when reading and writing several blocks in series and is significantly reduced with single requests to different places of the disk, which is due to the inertia of the mechanical drive of the head.
  3. extremely uneven frequency of accessing different memory blocks of external drives:
    1. use of part of blocks by several processes simultaneously, by reading and writing (for example, in databases)
    2. very frequent reading of parts of blocks (index files, directories in the file system)
    3. very frequent recording of part of blocks (log files, log files, databases; file system metadata).

When reading, the cache allows you to read the block once, then store one copy of the block in RAM for all processes and display the contents of the block “instantly” (compared to a disk request). There is a technique of "pre-request" - in the background, the operating system also reads several next blocks in the cache (after the required one).

When writing the cache, you can group short entries into larger ones that are more efficiently processed by the drives, or avoid recording intermediate modifications. In this case, all intermediate states of the block are visible to the processes from the RAM.

Caching external storage devices significantly increases system performance by optimizing I / O utilization. The advantage of the technology is a transparent (imperceptible for programs) automatic optimization of the use of memory-disks with the invariance of the logic of applications working with files.

The disadvantage of write caching is the time lag between a write request from a program and the actual writing of a block to a disk, as well as a change in the order in which records are made, which can lead to information loss or inconsistency of structures during a power failure or system hang. This problem is smoothed by forced periodic synchronization (writing modified cache lines) and file system journaling.

Software implementation

Write caching policy

When reading data, the cache memory gives an unambiguous performance gain. When recording data, the gain can be obtained only at the cost of reduced reliability. Therefore, one or another cache write policy may be selected in various applications.

There are two basic write cache policies: write-through and write-back:

  1. Write-through - write directly to the main memory (and duplicated to the cache), that is, the record is not cached.
  2. Delayed write - write data is made to the cache. Record in the main memory is made later (in case of crowding out or after the expiration of time), grouping several write operations into neighboring cells in one operation. The writeback technology for some time makes the data in the main memory irrelevant, for the CPU itself these irrelevance are not noticeable, but before accessing the memory of another master system bus (DMA controller, bus-master device of the PCI bus), the cache must be written to the memory forcibly. When using writebacks in a multiprocessor system, the caches of different CPUs must be consistent (or processors must use the same cache memory).

Pending Write Cache Algorithm

Initially, all buffer headers are placed in the free buffer list. If a process intends to read or modify a block, then it executes the following algorithm:

  1. tries to find the header of the buffer with the specified number in the hash table;
  2. in case the received buffer is busy, waiting for its release;
  3. if the buffer is not found in the hash table, takes the first buffer from the tail of the free list;
  4. if the list of free buffers is empty, then the crowding algorithm is executed (see below);
  5. if the received buffer is marked as “dirty”, it performs asynchronous writing of the buffer contents to external memory.
  6. deletes the buffer from the hash table, if it was placed in it;
  7. puts the buffer in the hash table with the new number.

The process reads the data into the received buffer and frees it. In case of modification, the process marks the buffer as “dirty” before release. When released, the buffer is placed in the head of the free buffer list.

In this way:

  1. if a process has read some block into the buffer, then it is likely that another process will find the buffer in RAM when reading this block;
  2. Data writing to external memory is performed only when there are not enough “clean” buffers, or on request.

Allocation algorithm

If the list of free buffers is empty, then the buffer preemption algorithm is executed. The preemption algorithm significantly affects the performance of the cache. The following algorithms exist:

  1. LRU (English Least Recently Used ) - buffer is preempted, unused the longest;
  2. MRU (Eng. Most Recently Used ) - pushed out the last used buffer;
  3. LFU ( Eng. ) (Eng. Least Frequently Used ) - preempted buffer used less often than all;
  4. ARC ( Eng. ) ( Born Adaptive Replacement Cache ) is an extrusion algorithm that combines LRU and LFU, patented by IBM.

The application of a particular algorithm depends on the data caching strategy. LRU is most effective if data is guaranteed to be reused soon. MRU is most effective if the data is guaranteed not to be reused soon. In case the application explicitly specifies a caching strategy for a certain set of data, then the cache will function most efficiently.

Operating system caching

RAM cache consists of the following elements:

  1. a set of pages of RAM, divided into buffers, equal in length to the data block of the corresponding external memory device;
  2. a set of buffer headers describing the state of the corresponding buffer;
  3. a hash table containing the corresponding block number to the header;
  4. lists of free buffers.

Caching web pages

The process of transferring information over the network can use caching of Internet pages — the process of storing frequently requested documents on (intermediate) proxy servers or a user's machine, in order to prevent their constant download from the source server and to reduce traffic. Thus, the information moves closer to the user. Caching is managed using HTTP headers.

Alternatively, caching of web pages can be performed using a CMS of a specific site to reduce server load with high attendance. Caching can be done both in memory and in the file cache [8] . The lack of caching is that changes made on one browser may not immediately be reflected in another browser, in which data is taken from the cache.

Caching work results

Many programs write somewhere intermediate or auxiliary results of the work, so as not to calculate them every time they are needed. This speeds up the work, but requires additional memory (operational or disk). An example of such caching is database indexing.

processor - cache memory 1 2 and 3 levels

It makes sense to equip your multi-core processor with dedicated memory for use by all available cores. In this sense, fast L3 cache can provide quick access to the necessary data. Kernels should not need a slow computer RAM.

This is at least a theory. The recent AMD test Athlon II X4 which is practically the same Phenom II X4 only without a Level 3 memory cache, showed that L3 cache is not always so necessary. Kernels should not need slow RAM.

How does the cache work?

Before deep analysis of our tests, it is necessary to understand some basic concepts. The cache principle is pretty simple. Data is buffered as close as possible to the kernel in order to avoid memory swapping from slow sources at long distances.

Requests for data that can be answered within this cache are called hits. If the cache does not contain the data you are looking for and you must go through a request for subsequent memory structures, these are Missas. Modern computers have a margin of 3 cache levels before they start using slow system memory. The second and especially 3 levels of cache memory serve not only for buffering data. Their goal is also to prevent overloading the processor bus during data exchange between the cores.

Hit / miss cache

Эффективность кэш архитектуры измеряется понятием Хит рэйта- hit rate. Запрос данных который может быть прочитан с кэша называется hit. Miss-данные которые не могут быть взяты с кэша, эти данные берутся с системной медленной памяти. Другими словами Хиты делают работу быстрой а Мисы медленной.

Кэш Writes, Exclusivity, Coherency

Политика замены "занимается" распределением новой, входящей в кэш, информации. Информация записанная в кэш должна быть доступна и в системной памяти. Система может либо перезаписать новую информацию в кэш или отметить место перезаписанной информации как "грязным", тем самым давай системе понять что данное место кэша можно освободить.

Данные на разных уровнях кэша хранятся единственно, это означает отсутствие избыточности. Тоесть вы не найдете одни и те же данные на разных уровнях кэша. Кроме того, кэш может работать всеобъемлюще, кэш более низкого уровня может взаимодействовать с кэшем более высокого уровня. Phenom AMD работает с эксклюзивным L3 кэшем, в то время как Intel следует стратегии инклюзивного кэша. Слаженность протоколов заботиться о сохранении данных на нескольких уровнях, ядрах, и даже процессорах.

Объем кэша

Кэш большего размера хранит больше информации, но тут как правило фигурируют более высокие задержки. Так как кэш требует большое количество процессорных транзисторов, важно найти баланс между стоимостью транзисторов и их жизненным ресурсом, энергопотреблением и производительностью.

Associativity

Кэш-память N-канальной ассоциативности означает, что информация, размещённая по неком адресу оперативной памяти, может кэшироваться в N местах этой кэш-памяти. Высшая ассоциация обеспечивает лучшую гибкость кэширования, потому, что существующие кэш данные не нуждаются в перезаписи. Другими словами высшая N-канальная ассоциативность обеспечивают высокие Хит-рэйты, но при этом увеличивают задержки, потому что сравнение ассоциаций хитов занимает больше времени. Многоканальная ассоциация имеет смысл для кэша последних уровней потому, что тут сосредоточена основная емкость кэша, и это снимает потребность процессора работать с более медленной оперативной памятью.

Несколько примеров: Процессоры Core I5 и Core I7 работают с 32Кб 8-канальным ассоциативным кэшем 1-го уровня и 32Кб 4-канального ассоциативного кэша инструкций 1-го уровня. Ясно, что Intel хочет что б инструкции выполнялись быстрее одновременно увеличивая количество хитов кэша 1-го уровня. Кэш 2-го уровня тоже 8-ми канальный. Кэш 3-го уровня несколько умнее, тут внедрен 16-канальный кэш который увеличивает количество хитов.

Однако, AMD следует другой политики в своих Phenom II X4, тут расположен 2-х канальный ассоциативный кэш первого уровня, что обеспечивает меньшие задержки. Что бы компенсировать возможные промахи увеличен объем памяти в два раза-64кб кэша данных и инструкция. Кэш память 2-го уровня такая же как у Intel. Но L3 кэш работает с 48-канальной ассоциативностью.

Of all this, it is impossible to judge a lot without looking at the entire processor architecture. Naturally, only benchmarks (performance tests) can show real results, and the meaning of this entire tour is to show the complexity of multi-level caching.


Comments


To leave a comment
If you have any suggestion, idea, thanks or comment, feel free to write. We really value feedback and are glad to hear your opinion.
To reply

Digital devices. Microprocessors and microcontrollers. computer operating principles

Terms: Digital devices. Microprocessors and microcontrollers. computer operating principles