home - Data
The impact of cache memory on computer performance. What is hard drive cache memory and why is it needed? What is the cache size responsible for?

It's not about cash, it's about cache-processor memory and more. From volume cache-memory traders have made another commercial fetish, especially with the cache of central processors and hard drives (video cards also have it - but they haven’t gotten to it yet). So, there is a XXX processor with a 1MB L2 cache, and exactly the same XYZ processor with a 2MB cache. Guess which one is better? Ah - don’t do it right away!

Cache-memory is a buffer that stores what can and/or needs to be put aside for later. The processor is doing work and situations arise when intermediate data needs to be stored somewhere. Well, of course in the cache! - after all, it is orders of magnitude faster than RAM, because... it is in the processor die itself and usually runs at the same frequency. And then, after some time, he will fish this data back and process it again. Roughly speaking, it’s like a potato sorter on a conveyor belt, who, every time he comes across something other than potatoes (carrots), throws it into a box. And when it’s full, he gets up and takes it out his to the next room. At this moment, the conveyor is standing still and downtime is observed. The volume of the box is cache in this analogy. AND How many his Do you need 1MB or 12? It is clear that if his the volume is small you will have to spend too much time on removal and it will be simple, but from a certain volume his further increase will do nothing. Well, the sorter will have a box for 1000 kg of carrots - but he won’t have that much during his entire shift and because of this he WILL NOT BECOME TWO TIMES FASTER! There is one more subtlety - big cache may cause an increase in delays in accessing it, firstly, and at the same time the likelihood of errors in it increases, for example during overclocking - secondly. (about HOW to determine the stability/instability of the processor in this case and find out that the error occurs specifically in his cache, test L1 and L2 - you can read here.) Thirdly - cache eats up a decent amount of chip area and the transistor budget of the processor circuit. The same goes for cache hard drive memory. And if the processor architecture is strong, it will have a cache of 1024 KB or more in demand in many applications. If you have a fast HDD, 16MB or even 32MB are appropriate. But no amount of 64MB of cache will do it his faster if it is a trim called the green version (Green WD) with a speed of 5900 instead of the required 7200, even if the latter has 8MB. Then Intel and AMD processors use this differently cache(generally speaking AMD is more efficient and their processors are often comfortable with lower values). In addition, Intel cache general, but for AMD it is personal for each core. The fastest cache L1 for AMD processors is 64 KB for data and instructions, which is twice as much as for Intel. Cache third level L3 is usually present in top processors like AMD Phenom II 1055T X6 Socket AM3 2.8GHz or a competitor Intel Core i7-980X. First of all, games love large cache volumes. AND cache Many professional applications do NOT like it (see Computer for rendering, video editing and professional applications). More precisely, those who are most demanding are generally indifferent to him. But what you definitely shouldn’t do is choose a processor based on cache size. The old Pentium 4 in its latest manifestations also had 2 MB of cache at operating frequencies well over 3 GHz - compare his performance with a cheap dual-core Celeron E1*** operating at frequencies of about 2 GHz. He will leave no stone unturned from the old man. A more relevant example is the high-frequency dual-core E8600, which costs almost $200 (apparently due to the 6MB cache) and the Athlon II X4-620 2.6GHz, which has only 2MB. This does not prevent Athlone from butchering his competitor.

As you can see from the graphs, there is no cache will not replace additional cores. Athlon with 2MB cache (red) easily beats Cor2Duo with 6MB cache, even at a lower frequency and almost half the cost. Many people also forget that cache is present in video cards because, generally speaking, they also have processors. A recent example is the GTX460 video card, where they manage not only to cut the bus and memory capacity (which the buyer will guess about) - but also CACHE shaders, respectively, from 512Kb to 384Kb (which the buyer will NOT guess about). And this will also add its negative contribution to productivity. It will also be interesting to find out the dependence of performance on cache size. Let's examine how quickly it grows with increasing cache size using the example of the same processor. As you know, processors of the E6***, E4*** and E2*** series differ only in the cache size (4, 2 and 1 MB each, respectively). Operating at the same frequency of 2400 MHz, they show the following results.

As you can see, the results are not too different. I will say more - if a processor with a capacity of 6MB had been involved, the result would have increased a little more, because processors reach saturation. But for models with 512Kb the drop would be noticeable. In other words, 2MB is enough even for games. To summarize, we can draw the following conclusion: cache it's good when there is ALREADY a lot of everything else. It is naive and stupid to change the speed of a hard drive or the number of processor cores for the cache size at the same cost, because even the most capacious sorting box will not replace another sorter. But there are also good examples.. For example, Pentium Dual-Core in an early revision on the 65 nm process had 1MB of cache for two cores (E2160 series and similar), and the later 45-nm revision of the E5200 series still has 2MB, all other things being equal (and most importantly - PRICE). Of course, you should choose the latter.

A cache is a fast-access intermediate buffer containing information that is most likely to be requested. Accessing data in the cache is faster than retrieving original data from operational memory (RAM) and faster than external memory (hard drive or solid-state drive), thereby reducing the average access time and increasing the overall performance of the computer system.

A number of central processing unit (CPU) models have their own cache in order to minimize access to random access memory (RAM), which is slower than registers. Cache memory can provide significant performance benefits when the RAM clock speed is significantly lower than the CPU clock speed. The clock speed for cache memory is usually not much less than the CPU speed.

Cache levels

The CPU cache is divided into several levels. In a general-purpose processor today, the number of levels can be as high as 3. Level N+1 cache is typically larger in size and slower in access speed and data transfer than level N cache.

The fastest memory is the first level cache - L1-cache. In fact, it is an integral part of the processor, since it is located on the same chip and is part of the functional blocks. In modern processors, the L1 cache is usually divided into two caches, the instruction cache and the data cache (Harvard architecture). Most processors without L1 cache cannot function. The L1 cache operates at the processor frequency, and, in general, can be accessed every clock cycle. It is often possible to perform multiple read/write operations simultaneously. Access latency is usually 2–4 core clock cycles. The volume is usually small - no more than 384 KB.

The second fastest is L2-cache - a second-level cache, usually located on the chip, like L1. In older processors, a set of chips on the motherboard. L2 cache volume from 128 KB to 1?12 MB. In modern multi-core processors, the second level cache, located on the same chip, is a separate memory - with a total cache size of nM MB, each core has nM/nC MB, where nC is the number of processor cores. Typically, the latency of the L2 cache located on the core chip is from 8 to 20 core clock cycles.

The third level cache is the least fast, but it can be very impressive in size - more than 24 MB. L3 cache is slower than previous caches, but still significantly faster than RAM. In multiprocessor systems it is in common use and is intended for synchronizing data from different L2s.

Sometimes there is also a 4th level cache, usually it is located in a separate chip. The use of Level 4 cache is justified only for high-performance servers and mainframes.

The problem of synchronization between different caches (both one and multiple processors) is solved by cache coherence. There are three options for exchanging information between caches of different levels, or, as they say, cache architectures: inclusive, exclusive and non-exclusive.

How important is L3 cache for AMD processors?

Indeed, it makes sense to equip multi-core processors with dedicated memory that will be shared by all available cores. In this role, a fast third-level (L3) cache can significantly speed up access to data that is requested most often. Then the cores, if possible, will not have to access slow main memory (RAM).

At least in theory. Recently AMD announced the Athlon II X4 processor, which is a Phenom II X4 model without L3 cache, hinting that it is not that necessary. We decided to directly compare two processors (with and without L3 cache) to test how the cache affects performance.

Click on the picture to enlarge.

How does the cache work?

Before we dive into the tests, it's important to understand some basics. The principle of how the cache works is quite simple. The cache buffers data as close to the processing cores of the processor as possible to reduce CPU requests to more distant and slow memory. On modern desktop platforms, the cache hierarchy includes as many as three levels that precede access to RAM. Moreover, caches of the second and, in particular, third levels serve not only to buffer data. Their purpose is to prevent the processor bus from becoming overloaded when the cores need to exchange information.

Hits and misses

The effectiveness of cache architectures is measured by hit rate. Data requests that can be satisfied by the cache are considered hits. If this cache does not contain the necessary data, then the request is passed further along the memory pipeline, and a miss is counted. Of course, misses lead to more time required to obtain information. As a result, “bubbles” (idles) and delays appear in the computing pipeline. Hits, on the contrary, allow you to maintain maximum performance.

Cache entry, exclusivity, coherence

Replacement policies dictate how space is freed up in the cache for new entries. Because data written to the cache must eventually appear in main memory, systems may do so at the same time as writing to the cache (write-through), or may mark the data areas as "dirty" (write-back) and write to memory. when it is evicted from the cache.

Data in several cache levels can be stored exclusively, that is, without redundancy. Then you won't find the same data lines in two different cache hierarchies. Or caches can work inclusively, that is, the lower cache levels are guaranteed to contain data present in the upper cache levels (closer to the processor core). AMD Phenom uses an exclusive L3 cache, while Intel follows an inclusive cache strategy. Coherency protocols ensure the integrity and freshness of data across different cores, cache levels, and even processors.

Cache size

A larger cache can hold more data, but tends to increase latency. In addition, a large cache consumes a considerable number of processor transistors, so it is important to find a balance between the transistor budget, die size, power consumption and performance/latency.

Associativity

Entries in RAM can be directly mapped to the cache, that is, there is only one cache position for a copy of data from RAM, or they can be n-way associative, that is, there are n possible locations in the cache where this data may be stored. Higher degrees of associativity (up to fully associative caches) provide greater caching flexibility because existing data in the cache does not need to be rewritten. In other words, a high n-degree of associativity guarantees a higher hit rate, but it also increases latency because it takes more time to check all those associations for a hit. Typically, the highest degree of association is reasonable for the last level of caching, since the maximum capacity is available there, and searching for data outside of this cache will result in the processor accessing slow RAM.

Here are some examples: Core i5 and i7 use 32 KB of L1 cache with 8-way associativity for data and 32 KB of L1 cache with 4-way associativity for instructions. It's understandable that Intel wants instructions to be available faster and the L1 data cache to have a maximum hit rate. The L2 cache on Intel processors has 8-way associativity, and the Intel L3 cache is even smarter, since it implements 16-way associativity to maximize hits.

However, AMD is following a different strategy with the Phenom II X4 processors, which uses a 2-way associative L1 cache to reduce latency. To compensate for possible misses, the cache capacity was doubled: 64 KB for data and 64 KB for instructions. The L2 cache has 8-way associativity, like the Intel design, but AMD's L3 cache operates with 48-way associativity. But the decision to choose one cache architecture over another cannot be assessed without considering the entire CPU architecture. It is quite natural that test results have practical significance, and our goal was precisely a practical test of this entire complex multi-level caching structure.

Every modern processor has a dedicated cache that stores processor instructions and data, ready for use almost instantly. This level is commonly referred to as Level 1 or L1 cache and was first introduced in the 486DX processors. Recently, AMD processors have become standard with 64 KB L1 cache per core (for data and instructions), and Intel processors use 32 KB L1 cache per core (also for data and instructions)

L1 cache first appeared on the 486DX processors, after which it became an integral feature of all modern CPUs.

Second-level cache (L2) appeared on all processors after the release of the Pentium III, although the first implementations of it on packaging were in the Pentium Pro processor (but not on-chip). Modern processors are equipped with up to 6 MB of on-chip L2 cache. As a rule, this volume is divided between two cores on an Intel Core 2 Duo processor, for example. Typical L2 configurations provide 512 KB or 1 MB of cache per core. Processors with a smaller L2 cache tend to be at the lower price level. Below is a diagram of early L2 cache implementations.

The Pentium Pro had the L2 cache in the processor packaging. In subsequent generations of Pentium III and Athlon, the L2 cache was implemented through separate SRAM chips, which was very common at that time (1998, 1999).

The subsequent announcement of a process technology up to 180 nm allowed manufacturers to finally integrate L2 cache on the processor die.


The first dual-core processors simply used existing designs that included two dies per package. AMD introduced a dual-core processor on a monolithic chip, added a memory controller and a switch, and Intel simply assembled two single-core chips in one package for its first dual-core processor.


For the first time, the L2 cache began to be shared between two computing cores on Core 2 Duo processors. AMD went further and created its first quad-core Phenom from scratch, and Intel again used a pair of dies, this time two dual-core Core 2 dies, for its first quad-core processor to reduce costs.

The third level cache has existed since the early days of the Alpha 21165 processor (96 KB, processors introduced in 1995) or IBM Power 4 (256 KB, 2001). However, in x86-based architectures, the L3 cache first appeared with the Intel Itanium 2, Pentium 4 Extreme (Gallatin, both processors in 2003) and Xeon MP (2006) models.

Early implementations simply provided another level in the cache hierarchy, although modern architectures use the L3 cache as a large, shared buffer for inter-core data transfer in multi-core processors. This is emphasized by the high n-degree of associativity. It is better to look for data a little longer in the cache than to end up with a situation where several cores are using very slow access to main RAM. AMD first introduced L3 cache on a desktop processor with the already mentioned Phenom line. The 65 nm Phenom X4 contained 2 MB of shared L3 cache, and the modern 45 nm Phenom II X4 already has 6 MB of shared L3 cache. Intel Core i7 and i5 processors use 8 MB of L3 cache.

Modern quad-core processors have dedicated L1 and L2 caches for each core, as well as a large L3 cache shared by all cores. The shared L3 cache also allows for the exchange of data that the cores can work on in parallel.


Hello guys Let's talk about the processor, or more precisely, about its cache. The cache of the processor can be different, for example, I now have a Pentium G3220 (1150 socket), this is a modern processor and has 3 MB of cache. But at the same time, the old model Pentium D965 (socket 775) has 4 MB of cache. But at the same time, the G3220 is several times faster than the D965, what I mean is that the cache is good, but the main thing is that the cache is modern. The cache memory of older processors is much slower than that of new ones, keep this in mind.

Let's talk about some devices that affect performance. Look, let's take a hard drive, does it have a cache? Yes, there is, but it is small, although it does have a slight impact on performance. Then what happens? Then comes the RAM, everything that the program or processor works with, all of this is placed in the RAM. If there is no data in the RAM, then it is read from the hard drive, and this is very slow. But the RAM is already very fast, there can be quite a lot of it. But the RAM is fast compared to a hard drive; for a processor it is still not very fast, and therefore the latter also has its own cache, which is already super fast!

What does the processor cache affect? It is in this cache that the processor stores what it often uses, that is, all sorts of commands and instructions. Accordingly, the more there is, the better, but this is not entirely true. How much cache do you have? If you don’t know, then I’ll show you how to find out, it’s all simple. Well, look what an interesting situation, let’s go back to the old processes again. It seems that if there is a lot of cache, then this is good. But there is a Q9650 processor (775 socket), which has 12 MB of cache, but it is not even close to modern Core i5 or even Core i3 models. The i5 has much less cache, that is, just 6 MB, and the i3 has even less cache - only 3 MB.

I understand that in general modern processors are much faster than old ones. But that's not what I'm talking about. Cache and cache are different; the top-end Q9650 simply has slow cache compared to processors on a modern socket. Therefore, those 12 MB are of no use. This is all I mean: don’t chase quantity, chase quality. So like this. I wrote all this as a note to you, I hope you find it useful

Here is a simple picture of how the cache works:

And here is another picture, another device is also indicated here, this is a controller, which just tells you whether there is data in the cache or not:

The cache memory is super fast. I’m not that knowledgeable about processors, but it would be interesting to know, if this cache were... 100 MB... or even 1 GB... would the processor be faster? This is, of course, fantastic even now, but now there are processors with a huge amount of cache... about 30 MB or more.. I’m not sure about this, but it seems that this cache memory is very expensive and it’s generally difficult to put it into a processor, I mean large volume

Well, now let me show you how to find out how much cache is in the processor. If you have Windows 10, then this is great, because it can show all the caches, there are three levels there. Although the third level seems to be the most important, it is also the largest. So, look, open the task manager and go to the Performance tab and here on the CPU tab you can see information about the cache, here it is:

Here you can see that I have a Pentium G3220, a fairly good processor, albeit inexpensive. But it’s actually faster than many models on socket 775, which can be called near-top models and which have much more cache... These are the things...

But I’ll tell you frankly that this is not a clear way to see how much cache a processor has. I advise you to use the CPU-Z utility if you are thinking like: yes, this is a program, I need to install it and all that, but come on... Then stop! This program is used by cool overlockers who overclock their processors. During installation, the utility does not create a bunch of files and in fact the installation is just unpacking the program into Program Files, then cpuz.exe can be copied anywhere and run, it will work! They just launched it and that’s it, she collected the information and look! You can easily download it on the Internet, since it is available on every corner. Just make sure that you don’t grab viruses. To do this, download for example on the soft portal. Just write in the CPU-Z soft portal search. CPU-Z works on almost all versions of Windows, except for the most ancient ones...

In general, you can download it on this site: cpuid.com, I just honestly didn’t know about it and was used to downloading from other sites!

Well, I hope that you can download it without problems. Now you launch it and everything about the processor is at your fingertips. So I launched CPU-Z and this is what it showed about my Pentium G3220:

Where I circled the box, that’s where the cache is displayed. What is a way, well, it says 8-way, 12-way, well, I don’t know what that is, sorry. But as you can see here, you can clearly see not only the cache, but also other information, frequency, cores and threads. Well, what’s also interesting is what one or two caches show here. Well, I just have 3 MBytes written here, that is, I just have 3 MB of cache.

But for example, as for the top-end Q9650, the situation there is a little different, even though there is 12 MB of cache, but these are essentially two blocks of 6 MB each and CPU-Z determines this:

By the way, as you can see, there is overclocking to 4 GHz, which is not bad. By the way, such overclocking may well be air-cooled. But that's a completely different story...

By the way, another interesting thing is that models on socket 775 do not have a third-level L3 cache... That is, there are only L1 and L2..., but I didn’t know...

So these are the things. I hope I wrote everything clearly. I repeat once again, do not chase quantity. So I don’t really regret it, but nevertheless... In short, I took it and built myself a computer on the 1150 socket. Well, I think everything is fine. But I felt a little offended when I found out that socket 1151 was released and that it costs the same, or even a little cheaper... But the processors are actually faster there... Well, okay. I just bought a computer for ages, but I was glad that my board, and this Asus Gryphon Z87, supports processors based on the Devil’s Canyon core! This was a gift, because Intel had previously stated that these processors would only be supported by the Z97 chipset, but I took the damn Z87!

In short, these are the things

That's all guys. I hope everything will be fine for you and this information was useful to you, good luck

To home! cache processor 07/30/2016

virtmachine.ru

The impact of cache memory on computer performance

All users are well aware of such computer elements as the processor, which is responsible for processing data, as well as random access memory (RAM or RAM), which is responsible for storing it. But not everyone probably knows that there is also a processor cache memory (Cache CPU), that is, the RAM of the processor itself (the so-called ultra-RAM).

Cache function

What is the reason that prompted computer designers to use dedicated memory for the processor? Isn't the computer's RAM capacity enough?

Indeed, for a long time, personal computers did without any cache memory. But, as you know, the processor is the fastest device on a personal computer and its speed has increased with each new generation of CPU. Currently, its speed is measured in billions of operations per second. At the same time, standard RAM has not significantly increased its performance during its evolution.

Generally speaking, there are two main memory chip technologies – static memory and dynamic memory. Without delving into the details of their design, we will only say that static memory, unlike dynamic memory, does not require regeneration; In addition, static memory uses 4-8 transistors for one bit of information, while dynamic memory uses 1-2 transistors. Accordingly, dynamic memory is much cheaper than static memory, but at the same time much slower. Currently, RAM chips are manufactured on the basis of dynamic memory.

Approximate evolution of the ratio of the speed of processors and RAM:

Thus, if the processor took information from RAM all the time, it would have to wait for slow dynamic memory, and it would be idle all the time. In the same case, if static memory were used as RAM, the cost of the computer would increase several times.

That is why a reasonable compromise was developed. The bulk of the RAM remained dynamic, while the processor got its own fast cache memory based on static memory chips. Its volume is relatively small - for example, the size of the second level cache is only a few megabytes. However, it’s worth remembering that the entire RAM of the first IBM PC computers was less than 1 MB.

In addition, the advisability of introducing caching technology is also influenced by the fact that different applications located in RAM load the processor differently, and, as a result, there is a lot of data that requires priority processing compared to others.

Cache history

Strictly speaking, before cache memory moved to personal computers, it had already been successfully used in supercomputers for several decades.

For the first time, a cache memory of only 16 KB appeared in a PC based on the i80386 processor. Today, modern processors use different levels of cache, from the first (the fastest cache of the smallest size - usually 128 KB) to the third (the slowest cache of the largest size - up to tens of MB).

At first, the processor's external cache was located on a separate chip. Over time, however, this caused the bus located between the cache and the processor to become a bottleneck, slowing down data exchange. In modern microprocessors, both the first and second levels of cache memory are located in the processor core itself.

For a long time, processors had only two cache levels, but the Intel Itanium CPU was the first to feature a third-level cache, common to all processor cores. There are also developments of processors with a four-level cache.

Cache architectures and principles

Today, two main types of cache memory organization are known, which originate from the first theoretical developments in the field of cybernetics - Princeton and Harvard architectures. The Princeton architecture implies a single memory space for storing data and commands, while the Harvard architecture implies separate ones. Most x86 personal computer processors use a separate type of cache memory. In addition, a third type of cache memory has also appeared in modern processors - the so-called associative translation buffer, designed to speed up the conversion of operating system virtual memory addresses to physical memory addresses.

A simplified diagram of the interaction between cache memory and processor can be described as follows. First, the processor checks for the presence of the information needed by the processor in the fastest first-level cache, then in the second-level cache, etc. If the necessary information is not found in any cache level, then they call it an error, or a cache miss. If there is no information in the cache at all, then the processor has to take it from RAM or even from external memory (from the hard drive).

The order in which the processor searches for information in memory:

This is how the Processor searches for information

To control the operation of the cache memory and its interaction with the computing units of the processor, as well as RAM, there is a special controller.

Scheme of organizing the interaction of the processor core, cache and RAM:


The cache controller is the key link between the processor, RAM and cache memory

It should be noted that data caching is a complex process that uses many technologies and mathematical algorithms. Among the basic concepts used in caching are cache writing methods and cache associativity architecture.

Cache Write Methods

There are two main methods for writing information to cache memory:

  1. Write-back method – data is written first to the cache, and then, when certain conditions occur, to RAM.
  2. Write-through method – data is written simultaneously to RAM and cache.

Cache associativity architecture

Cache associativity architecture defines the way in which data from RAM is mapped to the cache. The main options for caching associativity architecture are:

  1. Direct-mapped cache - a specific section of the cache is responsible for a specific section of RAM
  2. Fully associative cache - any part of the cache can be associated with any part of the RAM
  3. Mixed cache (set-associative)

Different cache levels can typically use different cache associativity architectures. Direct-mapped RAM caching is the fastest caching option, so this architecture is typically used for large caches. In turn, a fully associative cache has fewer cache errors (misses).

Conclusion

In this article, you were introduced to the concept of cache memory, cache memory architecture and caching methods, and learned how it affects the performance of a modern computer. The presence of cache memory can significantly optimize the operation of the processor, reduce its idle time, and, consequently, increase the performance of the entire system.

biosgid.ru

Gallery of processor cache effects

Almost all developers know that the processor cache is a small but fast memory that stores data from recently visited memory areas - the definition is short and quite accurate. However, knowing the boring details about the cache mechanisms is necessary to understand the factors that affect code performance.

In this article we will look at a number of examples illustrating various features of caches and their impact on performance. The examples will be in C#; the choice of language and platform does not greatly affect the performance assessment and final conclusions. Naturally, within reasonable limits, if you choose a language in which reading a value from an array is equivalent to accessing a hash table, you will not get any interpretable results. Translator's notes are in italics.

Habracut - - -

Example 1: Memory Access and Performance
How much faster do you think the second cycle is than the first? int arr = new int;

For (int i = 0; i // second for (int i = 0; i Example 2: influence of cache lines Let's dig deeper - try other step values, not just 1 and 16: for (int i = 0; i

Please note that with step values ​​from 1 to 16, the operating time remains virtually unchanged. But with values ​​greater than 16, the running time decreases by about half every time we double the step. This does not mean that the loop somehow magically starts running faster, just that the number of iterations also decreases. The key point is the same operating time with step values ​​from 1 to 16.

The reason for this is that modern processors do not access memory one byte at a time, but rather in small blocks called cache lines. Typically the string size is 64 bytes. When you read any value from memory, at least one cache line gets into the cache. Subsequent access to any value from this row is very fast. Because 16 int values ​​occupy 64 bytes, loops with steps from 1 to 16 access the same number of cache lines, or more precisely, all cache lines of the array. At step 32, access occurs to every second line, at step 64, to every fourth. Understanding this is very important for some optimization techniques. The number of accesses to it depends on the location of the data in memory. For example, unaligned data may require two accesses to main memory instead of one. As we found out above, the operating speed will be two times lower.

Example 3: Level 1 and 2 cache sizes (L1 and L2)
Modern processors typically have two or three levels of caches, usually called L1, L2, and L3. In order to find out the sizes of caches at different levels, you can use the CoreInfo utility or the Windows API function GetLogicalProcessorInfo. Both methods also provide information about the cache line size for each level. On my machine, CoreInfo reports 32 KB L1 data caches, 32 KB L1 instruction caches, and 4 MB L2 data caches. Each core has its own personal L1 caches, L2 caches are common to each pair of cores: Logical Processor to Cache Map: *--- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64 *--- Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64 -*-- Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64 -*-- Instruction Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64 ** -- Unified Cache 0, Level 2, 4 MB, Assoc 16, LineSize 64 --*- Data Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64 --*- Instruction Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64 ---* Data Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64 ---* Instruction Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64 --** Unified Cache 1 , Level 2, 4 MB, Assoc 16, LineSize 64 Let's check this information experimentally. To do this, let's go through our array, incrementing every 16th value - an easy way to change the data in each cache line. When we reach the end, we return to the beginning. Let's check different array sizes; we should see a drop in performance when the array no longer fits into caches of different levels. The code is: int steps = 64 * 1024 * 1024; // number of iterations int lengthMod = arr.Length - 1; // array size -- power of two

for (int i = 0; i (

// x & lengthMod = x % arr.Length, because powers of two

Arr[(i * 16) & lengthMod]++; ) Test results:

On my machine, there are noticeable drops in performance after 32 KB and 4 MB - these are the sizes of the L1 and L2 caches.

Example 4: Instruction Parallelism
Now let's look at something else. In your opinion, which of these two loops will execute faster? int steps = 256 * 1024 * 1024; int a = new int;

For (int i = 0; i // second for (int i = 0; i Example 5: cache associativity One of the key questions that must be answered when designing a cache is whether data from a certain memory area can be stored in any cache cells or only in some of them. Three possible solutions:

  1. Direct-mapped cache, the data of each cache line in RAM is stored in only one predefined cache location. The simplest way to calculate the mapping is: row_index_in_memory % number_of_cache_cells. Two lines mapped to the same cell cannot be in the cache at the same time.
  2. N-entry partial-associative cache, each line can be stored in N different cache locations. For example, in a 16-entry cache, a line may be stored in one of the 16 cells that make up the group. Typically, rows with equal least significant bits of indices share one group.
  3. Fully associative cache, any line can be stored in any cache location. The solution is equivalent to a hash table in its behavior.
Direct-mapped caches are prone to contention, for example, when two rows compete for the same cell, alternately evict each other from the cache, the efficiency is very low. On the other hand, fully associative caches, although free from this disadvantage, are very complex and expensive to implement. Partially associative caches are a typical trade-off between implementation complexity and efficiency. For example, on my machine, the 4 MB L2 cache is a 16-entry partial-associative cache. The entire RAM is divided into sets of lines according to the least significant bits of their indices, lines from each set compete for one group of 16 L2 cache cells.

Since the L2 cache has 65,536 cells (4 * 220 / 64) and each group consists of 16 cells, we have a total of 4,096 groups. Thus, the lower 12 bits of the row index determine which group this row belongs to (212 = 4,096). As a result, rows with addresses that are multiples of 262,144 (4,096 * 64) share the same group of 16 cells and compete for space in it.

For the effects of associativity to take effect, we need to constantly access a large number of rows from the same group, for example, using the following code: public static long UpdateEveryKthByte(byte arr, int K) (

const int rep = 1024 * 1024; // number of iterations

Stopwatch sw = Stopwatch.StartNew();

For (int i = 0; i p += K; if (p >= arr.Length) p = 0; ) sw.Stop();

return sw.ElapsedMilliseconds;

) The method increments every Kth element of the array. When we reach the end, we start again. After quite a large number of iterations (220), we stop. I made runs for different array sizes and K step values. Results (blue - long running time, white - short):

Blue areas correspond to those cases when, with constant data changes, the cache is not able to accommodate all the required data at the same time. A bright blue color indicates an operating time of about 80 ms, almost white - 10 ms.

Let's deal with the blue areas:

  1. Why do vertical lines appear? Vertical lines correspond to step values ​​at which too many rows (more than 16) from one group are accessed. For these values, my machine's 16-entry cache cannot accommodate all the necessary data.

    Some of the bad stride values ​​are powers of two: 256 and 512. For example, consider stride 512 and an 8 MB array. With this step, there are 32 sections in the array (8 * 220 / 262,144), which compete with each other for cells in 512 cache groups (262,144 / 512). There are 32 sections, but there are only 16 cells in the cache for each group, so there is not enough space for everyone.

    Other step values ​​that are not powers of two are simply unlucky, which causes a large number of hits to the same cache groups, and also leads to the appearance of vertical blue lines in the figure. At this point, lovers of number theory are invited to think.

  2. Why do vertical lines break at the 4 MB boundary? When the array size is 4 MB or less, the 16-entry cache behaves like a fully associative cache, that is, it can accommodate all the data in the array without conflicts. There are no more than 16 areas fighting for one cache group (262,144 * 16 = 4 * 220 = 4 MB).
  3. Why is there a big blue triangle at the top left? Because with a small step and a large array, the cache is not able to fit all the necessary data. The degree of cache associativity plays a secondary role here; the limitation is related to the size of the L2 cache. For example, with an array size of 16 MB and a stride of 128, we access every 128th byte, thus modifying every second array cache line. To store every second line in the cache, you need 8 MB of cache, but on my machine I only have 4 MB.

    Even if the cache were fully associative, it would not allow 8 MB of data to be stored in it. Note that in the already discussed example with a stride of 512 and an array size of 8 MB, we only need 1 MB of cache to store all the necessary data, but this is impossible due to insufficient cache associativity.

  4. Why does the left side of the triangle gradually gain in intensity? The maximum intensity occurs at a step value of 64 bytes, which is equal to the size of the cache line. As we saw in the first and second examples, sequential access to the same row costs practically nothing. Let's say, with a step of 16 bytes, we have four memory accesses for the price of one. Since the number of iterations is the same in our test for any step value, a cheaper step results in less running time.
The discovered effects persist at large parameter values:

Cache associativity is an interesting thing that can manifest itself under certain conditions. Unlike the other problems discussed in this article, it is not so serious. It's definitely not something that requires constant attention when writing programs.

Example 6: False Cache Partitioning
On multi-core machines, you may encounter another problem - cache coherence. Processor cores have partially or completely separate caches. On my machine, the L1 caches are separate (as usual), and there are also two L2 caches shared by each pair of cores. The details may vary, but in general, modern multi-core processors have multi-level hierarchical caches. Moreover, the fastest, but also the smallest caches belong to individual cores.

When one core modifies a value in its cache, other cores can no longer use the old value. The value in the caches of other cores must be updated. Moreover, the entire cache line must be updated, since caches operate on line-level data.

Let's demonstrate this problem with the following code: private static int s_counter = new int;

private void UpdateCounter(int position)

{

for (int j = 0; j ( s_counter = s_counter + 3; )

If on my four-core machine I call this method with parameters 0, 1, 2, 3 simultaneously from four threads, then the running time will be 4.3 seconds. But if I call the method with parameters 16, 32, 48, 64, then the running time will be only 0.28 seconds. Why? In the first case, all four values ​​processed by threads at any given time are likely to end up in one cache line. Each time one core increments a value, it marks cache cells containing that value in other cores as invalid. After this operation, all other kernels will have to cache the line again. This makes the caching mechanism inoperable, killing performance.

Example 7: Hardware Complexity
Even now, when the principles of cache operation are no secret to you, the hardware will still give you surprises. Processors differ from each other in optimization methods, heuristics and other implementation subtleties.

The L1 cache of some processors can access two cells in parallel if they belong to different groups, but if they belong to the same group, only sequentially. As far as I know, some can even access different quarters of the same cell in parallel.

Processors may surprise you with clever optimizations. For example, the code from the previous example about false cache sharing does not work on my home computer as intended - in the simplest cases the processor can optimize the work and reduce negative effects. If you modify the code a little, everything falls into place. Here's another example of weird hardware quirks: private static int A, B, C, D, E, F, G;

private static void Weirdness()

{

for (int i = 0; i ( ) ) If you substitute three different options, you can get the following results:

Incrementing fields A, B, C, D takes longer than incrementing fields A, C, E, G. What's even weirder is that incrementing fields A and C takes longer than fields A, C and E, G. I don't know for sure what are the reasons for this, but perhaps they are related to memory banks (yes, with ordinary three-liter savings memory banks, and not what you thought). If you have any thoughts on this matter, please speak up in the comments.

On my machine, the above is not observed, however, sometimes there are abnormally bad results - most likely, the task scheduler makes its own “adjustments”.

The lesson to be learned from this example is that it is very difficult to completely predict the behavior of hardware. Yes, you can predict a lot, but you need to continually validate your predictions through measurement and testing.

Conclusion
I hope that everything discussed above has helped you understand the design of processor caches. Now you can put this knowledge into practice to optimize your code. * Source code was highlighted with Source Code Highlighter. Tags:

One of the important factors that increases processor performance is the presence of cache memory, or rather its volume, access speed and distribution among levels.

For quite some time now, almost all processors have been equipped with this type of memory, which once again proves the usefulness of its presence. In this article, we will talk about the structure, levels and practical purpose of cache memory, which is very important. processor characteristics.

What is cache memory and its structure

Cache memory is ultra-fast memory used by the processor to temporarily store data that is most frequently accessed. This is how we can briefly describe this type of memory.

Cache memory is built on flip-flops, which, in turn, consist of transistors. A group of transistors takes up much more space than the same capacitors that make up the RAM. This entails many difficulties in production, as well as limitations in volume. That is why cache memory is a very expensive memory, while having negligible volumes. But from this structure comes the main advantage of such memory - speed. Since flip-flops do not need regeneration, and the delay time of the gate on which they are assembled is small, the time for switching the flip-flop from one state to another occurs very quickly. This allows the cache memory to operate at the same frequencies as modern processors.

Also, an important factor is the placement of the cache memory. It is located on the processor chip itself, which significantly reduces access time. Previously, cache memory of some levels was located outside the processor chip, on a special SRAM chip somewhere on the motherboard. Now, almost all processors have cache memory located on the processor chip.


What is processor cache used for?

As mentioned above, the main purpose of cache memory is to store data that is frequently used by the processor. The cache is a buffer into which data is loaded, and despite its small size (about 4-16 MB) modern processors, it gives a significant performance boost in any application.

To better understand the need for cache memory, let's imagine organizing a computer's memory like an office. The RAM will be a cabinet with folders that the accountant periodically accesses to retrieve large blocks of data (that is, folders). And the table will be a cache memory.

There are elements that are placed on the accountant’s desk, which he refers to several times over the course of an hour. For example, these could be phone numbers, some examples of documents. These types of information are located right on the table, which, in turn, increases the speed of access to them.

In the same way, data can be added from those large data blocks (folders) to the table for quick use, for example, a document. When this document is no longer needed, it is placed back in the cabinet (into RAM), thereby clearing the table (cache memory) and freeing this table for new documents that will be used in the next period of time.

Also with cache memory, if there is any data that is most likely to be accessed again, then this data from RAM is loaded into cache memory. Very often, this happens by co-loading the data that is most likely to be used after the current data. That is, there are assumptions about what will be used “after”. These are the complex operating principles.

Processor cache levels

Modern processors are equipped with a cache, which often consists of 2 or 3 levels. Of course, there are exceptions, but this is often the case.

In general, there can be the following levels: L1 (first level), L2 (second level), L3 (third level). Now a little more detail on each of them:

First level cache (L1)– the fastest cache memory level that works directly with the processor core, thanks to this tight interaction, this level has the shortest access time and operates at frequencies close to the processor. It is a buffer between the processor and the second level cache.

We will consider volumes on a high-performance processor Intel Core i7-3770K. This processor is equipped with 4x32 KB L1 cache 4 x 32 KB = 128 KB. (32 KB per core)

Second level cache (L2)– the second level is larger-scale than the first, but as a result, has lower “speed characteristics”. Accordingly, it serves as a buffer between the L1 and L3 levels. If we look again at our example Core i7-3770 K, then the L2 cache memory size is 4x256 KB = 1 MB.

Level 3 cache (L3)– the third level, again, is slower than the previous two. But it is still much faster than RAM. The L3 cache size in the i7-3770K is 8 MB. If the previous two levels are shared by each core, then this level is common to the entire processor. The indicator is quite solid, but not exorbitant. Since, for example, for Extreme-series processors like the i7-3960X, it is 15 MB, and for some new Xeon processors, more than 20.



 


Read:



Rating of the best wireless headphones

Rating of the best wireless headphones

Is it possible to buy universal ears inexpensively? 3,000 rubles - is it possible to buy high-quality headphones for that kind of money? As it turned out, yes. And speech...

The main camera of a mobile device is usually located on the back of the body and is used for taking photos and videos

The main camera of a mobile device is usually located on the back of the body and is used for taking photos and videos

An updated version of the tablet with improved characteristics and high autonomy. Acer smartphones are rarely visited...

How to switch to another operator while keeping your number

How to switch to another operator while keeping your number

The law on preserving a telephone number when a subscriber switches to another mobile operator came into force in Russia on December 1. However, it turned out that...

review of a phablet, expensive, but very competent

review of a phablet, expensive, but very competent

Review of a phablet, expensive, but very competent 03/20/2015 I am the only shoemaker in the world without boots, a smartphone reviewer without my own smartphone....

feed-image RSS