The eight byte size can reduce conflicts in double precision applications. Four bytes is the default setting. When I tested this and set it to eight byte, nvvp still showed me a bank size of four, so I am not sure if this actually does anything. The available amount and how shared memory can be configured is dependent on the GPUs compute capability. The most common values are either 64kB or 96kB per streaming multiprocessor. A table of Maximum sizes of all memory types and a lot more information on the available compute capability version can be found here.
From there we can see that the shared memory region is quite small and we have to think carefully of which data we load into it. You can checkout your GPUs limits with the deviceQuery sample which ships with the cuda package. On a lot of GPUs the shared memory and L1 cache are in the same region and share a specific amount of memory. Here we can configure how this space should be divided among the two.
Starting with Turing cards compute capability 7. However we can still give it a hint on what to prefer. This can be done for each individual kernel with the cudaFuncSetAttribute function. A legacy way of setting the ratio is the function cudaFuncSetCacheConfig which configures a program wide hard ratio between L1 and shared memory.
More information can be found here. There are two ways of which we can allocate shared memory: dynamic and static. If we know the amount of required shared memory at compile time, we can use static shared memory. The syntax for allocating static shared memory is this and always has to be allocated inside a kernel:. If the size of the dynamic memory data is unknown at compile time we have the option to allocate it dynamically. For this we have to calculate the size of the shared memory chunk in bytes before calling the kernel and then pass it to the kernel:.
The fourth argument here nullptr can be used to pass a pointer to a CUDA stream to a kernel. Streams enable more parallelism in certain situations.
We will have a separate article on this in the future here. Streams clearly outside the scope of shared memory and too big of a topic to cover here. Now, inside the kernel we have to allocate the shared memory slightly different than before:. This gives you the opportunity to test for yourself how much performance impact you get from bank conflicts. To be free of bank conflicts data must be read from shared memory by a warp from a contiguous section.
This means if each thread in a warp reads one value from shared memory, they have to be stored in adjacent memory cells or banks. If several threads try to read from the same cell leads to a broadcast of the value which is not a conflict. In this example we have two kernels, one with static and one with dynamic allocated shared memory. The other three kernels in this example use dynamically allocated shared memory, which can be used when the amount of shared memory is not known at compile time.
In this case the shared memory allocation size per thread block must be specified in bytes using an optional third execution configuration parameter, as in the following excerpt. The dynamic shared memory kernel, dynamicReverse , declares the shared memory array using an unsized extern array syntax, extern shared int s[] note the empty brackets and use of the extern specifier.
The size is implicitly determined from the third execution configuration parameter when the kernel is launched. The remainder of the kernel code is identical to the staticReverse kernel.
What if you need multiple dynamically sized arrays in a single kernel? You must declare a single extern unsized array as before, and use pointers into it to divide it into multiple arrays, as in the following excerpt. To achieve high memory bandwidth for concurrent accesses, shared memory is divided into equally sized memory modules banks that can be accessed simultaneously.
Therefore, any memory load or store of n addresses that spans b distinct memory banks can be serviced simultaneously, yielding an effective bandwidth that is b times as high as the bandwidth of a single bank. The hardware splits a conflicting memory request into as many separate conflict-free requests as necessary, decreasing the effective bandwidth by a factor equal to the number of colliding memory requests. An exception is the case where all threads in a warp address the same shared memory address, resulting in a broadcast.
Devices of compute capability 2. To minimize bank conflicts, it is important to understand how memory addresses map to memory banks. Shared memory banks are organized such that successive bit words are assigned to successive banks and the bandwidth is 32 bits per bank per clock cycle. For devices of compute capability 1. A shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp.
Note that no bank conflict occurs if only one memory location per bank is accessed by a half warp of threads. For devices of compute capability 2. A shared memory request for a warp is not split as with devices of compute capability 1.
Devices of compute capability 3. Setting the bank size to eight bytes can help avoid shared memory bank conflicts when accessing double precision data. Viewed 19k times. I am trying to solve this problem myself but I can't. So I want to get yours advice. I am writing kernel code like this.
And I know GTX has 16 processors, thread block can be implemented on processor. But my program occurs error. I can't certainly understand what maximum size of shared memory per block means. Give me advice. Improve this question. Umbrella Umbrella 3 3 gold badges 8 8 silver badges 18 18 bronze badges. Add a comment. Active Oldest Votes. Improve this answer. Thank you very much. That's the answer that I want. And I have one more question.
I'm sorry Is it possible reuse the shared memory space? I mean thread blocks which will be executed on processor can use shared memory spaces that was being used by previous thread blocks which completed their job.
0コメント