cuda,gpu,shared-memory,bank-conflict , purposely causing bank conflicts for shared memory on CUDA device

purposely causing bank conflicts for shared memory on CUDA device


Tag: cuda,gpu,shared-memory,bank-conflict

It is a mystery for me how shared memory on cuda devices work. I was curious to count threads having access to the same shared memory. For this I wrote a simple program

#include <cuda_runtime.h>
#include <stdio.h>

#define nblc 13
#define nthr 1024

//[email protected]

__device__ int inwarpD[nblc];

__global__ void kernel(){
__shared__ int mywarp;

for (int i=0;i<5;i++) mywarp += (10000*threadIdx.x+1);

//[email protected]

int main(int argc, char **argv){
int inwarpH[nblc];

kernel<<<nblc, nthr>>>();

cudaMemcpyFromSymbol(inwarpH, inwarpD, nblc*sizeof(int), 0, cudaMemcpyDeviceToHost);

for (int i=0;i<nblc;i++) printf("%i : %i\n",i, inwarpH[i]);

and ran it on K80 GPU. Since several threads are having access to the same shared memory variable I was expecting that this variable will be updated 5*nthr times, albeit not at the same cycle because of the bank conflict. However, the output indicates that mywarp shared variable was updated only 5 times. For each blocks different threads accomplished this task:

0 : 35150005
1 : 38350005
2 : 44750005
3 : 38350005
4 : 51150005
5 : 38350005
6 : 38350005
7 : 38350005
8 : 51150005
9 : 44750005
10 : 51150005
11 : 38350005
12 : 38350005

Instead, I was expecting


for each block. Can someone kindly explain me where my understanding of shared memory fails. I would like also to know how would it be possible that all threads in one block access (update) the same shared variable. I know it is not possible at the same MP cycle. Serialisation is fine for me because it is going to be a rare event.


Lets walk through the ptx that it generates.

//Declare some registers
.reg .s32       %r<5>;
.reg .s64       %rd<4>;

// demoted variable
.shared .align 4 .u32 _Z6kernelv$__cuda_local_var_35411_30_non_const_mywarp;

//load tid in register r1
mov.u32         %r1, %tid.x;

//multiple tid*5000+5 and store in r2
mad.lo.s32      %r2, %r1, 50000, 5;

//store result in shared memory
st.shared.u32   [_Z6kernelv$__cuda_local_var_35411_30_non_const_mywarp], %r2;

bar.sync        0;

//load from shared memory and store in r3
ld.shared.u32   %r3, [_Z6kernelv$__cuda_local_var_35411_30_non_const_mywarp];

mov.u32         %r4, %ctaid.x;
mul.wide.u32    %rd1, %r4, 4;
mov.u64         %rd2, inwarpD;
add.s64         %rd3, %rd2, %rd1;

//store r3 in global memory   [%rd3], %r3;

So basically

for (int i=0;i<5;i++)
    mywarp += (10000*threadIdx.x+1);

is being optimized down to


so you're not experiencing a bank-conflict. You are experiencing a race-condition.


Wierd behavior with Visual Studio Debugger

I experienced some weird behavior with Visual Studio's Debugger when running VS with the Dedicated GPU. What is weird is that when I terminate the program I am building, the debugger stays on. I don't see this when running VS with the integrated graphics. Also - I checked if there...

Understanding Memory Replays and In-Flight Requests

I'm trying to understand how a matrix transpose can be faster reading naively from columns vs. rows. (example is from Professional CUDA C Programming) The matrix is in memory by row, i.e. (0,1),(0,2),(0,3)...(1,1),(1,2) __global__ void transposeNaiveCol(float *out, float *in, const int nx, const int ny) { unsigned int ix =...

Why use memset when using CUDA?

I saw in a CUDA code example that memset is used to initialize vectors to all 0's that will store the sum of two others vectors. For example: hostRef = (float *)malloc(nBytes); gpuRef = (float *)malloc(nBytes); memset(hostRef, 0, nBytes); memset(gpuRef, 0, nBytes); What purpose does this serve if nothing else...

Is prefix scan CUDA sample code in gpugems3 correct?

I've written a piece of code to call the kernel in gpugem3 but the results that I got is a bunch of negative numbers instead of prefix scan. I'm wondering if my kernel call is wrong or there is something wrong with the gpugem3 code? here is my code: #include...

'an illegal memory access' when trying to write to a 2D array allocated using cudaMalloc3D

I am trying to allocate and copy memory of a flattened 2D array on to the device using cudaMalloc3D to test the performance of cudaMalloc3D. But when I try to write to the array from the kernel it throws 'an illegal memory access was encountered' exception. The program runs fine...

direct global memory access using cuda

q1- lets say i have copy one array onto device through stream1 using cudaMemCpyAsync; would i be able to access the values of that array in different stream say 2? cudaMemcpyAsync(da,a,10*sizeof(float),cudaMemcpyHostToDevice,stream[0]); kernel<<<n,1,0,stream[0]>>>(da); kernel<<<n,1,0,stream[1]>>>(da){//calculation involving da} ; q2- would i have to include pointer to global memory array as argument in...

Understanding Dynamic Parallelism in CUDA

Example of dynamic parallelism: __global__ void nestedHelloWorld(int const iSize,int iDepth) { int tid = threadIdx.x; printf("Recursion=%d: Hello World from thread %d" "block %d\n",iDepth,tid,blockIdx.x); // condition to stop recursive execution if (iSize == 1) return; // reduce block size to half int nthreads = iSize>>1; // thread 0 launches child grid...

Faster Matrix Multiplication in CUDA

Currently, I made a neural networks program in the cuda c. Because I needed to manipulate the matrix multiplication, I did not use CUBLAS for MM. I use the following code for MM. I was wondering if any one has some advice to make it faster which can be very...

nvcc/CUDA 6.5 & c++11(future) - gcc 4.4.7

When I compile the following code containing the design C++11, I get errors - it does not compile. I've tried with different flags, but I haven't found a solution. My setting: CUDA 6.5, gcc 4.4.7 I am not able to change the settings. How can I still make this work?...

Why does Hyper-Q selectively overlap async HtoD and DtoH transfer on my cc5.2 hardware?

There's an old Parallel ForAll blog post that demonstrates using streams and async memcpys to generate overlap between kernels and memcpys, and between HtoD and DtoH memcpys. So I ran the full Async sample given on my GTX Titan X, and here's the result: As you can see, when...

Building CUDA-aware openMPI on Ubuntu 12.04 cannot find cuda.h

I am building openMPI 1.8.5 on Ubuntu 12.04 with CUDA 6.5 installed and tested with default samples. I intend to run it on a single node with following configuration: Dell Precision T7400 Dual Xeon X5450 Nvidia GT730/Tesla C1060 The configure command issued was $ ./configure --prefix=/usr --with-cuda=/usr/local/cuda In the generated...

Tesla k20m interoperability with Direct3D 11

I would like to know if I can work with Nvidia Tesla K20 and Direct3D 11? I'd like to render an image using Direct3D, Then process the rendered image with CUDA, [ I know how to work out the CUDA interoperability]. Tesla k20 doesn't have a display adapter (physically remote...

Update a D3D9 texture from CUDA

I’m working on a prototype that integrates WPF, Direct3D9 (using Microsoft’s D3DImage WPF class), and CUDA (I need to be able to generate a texture for the D3DImage on the GPU). The problem is, CUDA doesn’t update my texture. No error codes are returned, the texture just stays unchanged. Even...

Using a data pointer with CUDA (and integrated memory)

I am using a board with integrated gpu and cpu memory. I am also using an external matrix library (Blitz++). I would like to be able to grab the pointer to my data from the matrix object and pass it into a cuda kernel. After doing some digging, it sounds...

CUDA cuBlasGetmatrix / cublasSetMatrix fails | Explanation of arguments

I've attempted to copy the matrix [1 2 3 4 ; 5 6 7 8 ; 9 10 11 12 ] stored in column-major format as x, by first copying it to a matrix in an NVIDIA GPU d_x using cublasSetMatrix, and then copying d_x to y using cublasGetMatrix(). #include<stdio.h>...

cudaMemcpyToSymbol in pycuda

I am using pycuda and i would like to know if there is an equivalent to the function cudaMemcpyToSymbol I would like to copy a constant from the host to the device like below import pycuda.driver as cuda import pycuda.autoinit from pycuda.compiler import SourceModule import numpy from sys import path...

Do I need to free device_ptr returned by thrust?

I have a function to get the minimum value of an array and it's executed within a loop. thrust::device_ptr<float> min_ptr = thrust::min_element(populationFitness, populationFitness + POPULATION); Do I have to free the returned device_ptr? I tried with thrust::device_free(min_ptr) but an exception is thrown....

cudaMalloc vs cudaMalloc3D performance for a 2D array

I want to know the impact on performance when using cudaMalloc or cudaMalloc3D when allocating, copying and accessing memory for a 2D array. I have code that I tried to test the run time on where on one I use cudaMalloc and on the other cudaMalloc3D. I have included the...

Building a tiny R package with CUDA and Rcpp

I'm working on a tiny R package that uses CUDA and Rcpp, adapted from the output of Rcpp.package.skeleton(). I will first describe what happens on the master branch for the commit entitled "fixed namespace". The package installs successfully if I forget CUDA (i.e., if I remove the src/Makefile, change src/

CUDA: Group every n-th point of array passed to GPU

I am trying to implement k-means algorithm on CUDA using Tesla card on external Unix. I read input file and store coordinates of all data points in dataX and dataY arrays. The next step is to select every centreInterval-th point and store it in another array allocated in GPU memory....

cuda thrust: selective copying and resizing results

I am copying items selectively between two thrust device arrays using copy_if as follows: thrust::device_vector<float4> collated = thrust::device_vector<float4> original_vec.size()); thrust::copy_if(original_vec.begin(), original_vec.end(), collated.begin(), is_valid_pt()); collated.shrink_to_fit(); The is_valid_pt is implemented as: struct is_valid_kpt { __host__ __device__ bool operator()(const float4 x) { return x.w >= 0; } }; Now after running this code,...

Best way to achieve CUDA Vector Diagonalization

What I want to do is feed in my m x n matrix, and in parallel, construct n square diagonal matrices for each column of the matrix, perform an operation on each square diagonal matrix, and then recombine the result. How do I do this? So far, I start of...

Practice computing grid size for CUDA

dim3 block(4, 2) dim3 grid((nx+block.x-1)/block.x, (ny.block.y-1)/block.y); I found this code in Professional CUDA C Programming on page 53. It's meant to be a naive example of matrix multiplication. nx is the number of columns and ny is the number of rows. Can you explain how the grid size is computed?...

Running CUDA programs on Quadro K620m

I have laptop which has Quadro K620m GPU. I am trying to learn CUDA programming and downloaded the network installer from NVIDIA site. During CUDA SDK installation, just when its checking the hardware of the machine, it displays Do you want to Continue? This graphics driver could not find compatible...

How many parallel threads i can run on my nvidia graphic card in cuda programming?

Operating System: Windows 8.1 Single Language, 64-bit DirectX version: 11.0 GPU processor: GeForce 840M Driver version: 353.06 Direct3D API version: 11.2 Direct3D feature level: 11_0 CUDA Cores: 384 Core clock: 1029 MHz Memory data rate: 1800 MHz Memory interface: 64-bit Memory bandwidth: 14.40 GB/s Total available graphics memory: 4096 MB...

What is version of cuda for nvidia 304.125

I am using ubuntu 14.04. I want to install CUDA. But I don't know which version is good for my laptop. I trace my drive that is $cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.125 Mon Dec 1 19:58:28 PST 2014 GCC version: gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)...

Stream compaction with Thrust; best practices and fastest way?

I am interested in porting some existing code to use thrust to see if I can speed it up on the GPU with relative ease. What I'm looking to accomplish is a stream compaction operation, where only nonzero elements will be kept. I have this mostly working, per the example...

How to pass struct containing array to the kernel in CUDA?

In the following code I have an array in a struct which I need to pass to the kernel function. I can't seem to find the proper way. I tried looking at other posts on SO but do not understand their methods that well. In my actual code, I receive...

How does CUDA's cudaMemcpyFromSymbol work?

I understand the concept of passing a symbol, but was wondering what exactly is going on behind the scenes. If it's not the address of the variable, then what is it?

Linear algebra libraries and dynamic parallelism in CUDA

With the advent of dynamic parallelism in 3.5 and above CUDA architectures, is it possible to call linear algebra libraries from within __device__ functions? Can the CUSOLVER library in CUDA 7 be called from a kernel (__global__) function?...

purposely causing bank conflicts for shared memory on CUDA device

It is a mystery for me how shared memory on cuda devices work. I was curious to count threads having access to the same shared memory. For this I wrote a simple program #include <cuda_runtime.h> #include <stdio.h> #define nblc 13 #define nthr 1024 //[email protected] __device__ int inwarpD[nblc]; __global__ void kernel(){...

NVCC CUDA cross compiling cannot find “-lcudart”

I have installed CUDA 5.0 and NVCC on my Ubuntu virtual machine and have had problems compiling even a basic CUDA C program. The error is as follows: [email protected]:~/CUDA$ nvcc -o helloworld.o -target-cpu-arch=ARM -ccbin=/usr/bin/arm-linux-gnueabi-gcc-4.6 --machine=32 /usr/lib/gcc/arm-linux-gnueabi/4.6/../../../../arm-linux-gnueabi/bin/ld: skipping incompatible /usr/local/cuda-5.0/bin/../lib/ when searching for -lcudart /usr/lib/gcc/arm-linux-gnueabi/4.6/../../../../arm-linux-gnueabi/bin/ld: skipping incompatible...

How to load data in global memory into shared memory SAFELY in CUDA?

My kernel: __global__ void myKernel(float * devData, float * devVec, float * devStrFac, int Natom, int vecNo) { extern __shared__ float sdata[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; float qx=devVec[3*idx]; float qy=devVec[3*idx+1]; float qz=devVec[3*idx+2]; __syncthreads();//sync_1 float c=0.0,s=0.0; for (int iatom=0; iatom<Natom; iatom += blockDim.x) { float rtx =...

cuda-memcheck fails to detect memory leak in an R package

I'm building CUDA-accelerated R packages, and I want to debug with cuda-memcheck. So in this minimal example (in the deliberate_memory_leak GitHub branch), I create a memory leak in someCUDAcode.c by commenting out a necessary call to cudaFree. Then, I see if cuda-memcheck can find the leak. $ cuda-memcheck --leak-check full...

Threads syncronization in CUDA

I have a 3D grid of 3D blocks, and within each block I need to compute sequentially on the "z" layers of the block. In other words, I want to execute first all (x,y,0) threads, then all (x,y,1), etc. I need to execute my threads layer by layer (counting layers...


I have a very simple scala jcuda program that adds a very large array. Everything compiles and runs just fine until I want to copy more than 4 bytes from my device to host. I am getting CUDA_ERROR_INVALID_VALUE when I try to copy more than 4 bytes. // This does...

Access violation reading location when calling cudaMemcpy2DToArray

I allocated a 2D array in device and want to copy a 2D float array to device. ImgSrc is a Mat type in openCV that I copied the elements of it into a 2D float array named ImgSrc_f.then by using cudaMemcpy2DToArray() I copied my host 2D array(ImgSrc_f) to device 2D...

Reduce by key on device array

I am using reduce_by_key to find the number of elements in an array of type int2 which has same first values . For example Array: <1,2> <1,3> <1,4> <2,5> <2,7> so no. elements with 1 as first element are 3 and with 2 are 2. CODE: struct compare_int2 : public...

How do you build the example CUDA Thrust device sort?

I am trying to build and run the Thrust example code in Visual Studio 2010 with the latest version (7.0) of CUDA and the THURST install that comes with it. I cannot get the example code to build and run. By eliminating parts of the code, I found the problem...

CUDA strange behavior accessing vector

I have implemented a simple fft program in cuda. This is the kernel function: __global__ void fftKernel(cuComplex* dev_samples, size_t length, size_t llog, Direction direction) { int tid = threadIdx.x + blockDim.x * blockIdx.x; if (tid < length / 2) { // First step, sorts data with bit reversing and compute...

Why the CPU compiles the GPU shader?

To understand in general how GPU's cache support work, I read some information and understood this: CPU compiles shader and transmit resulting code of shader to GPU to execute and also save it to the disk. If necessary to execute the same shader, GPU get it saved binary code directly...

how to generalize square matrix multiplication to handle arbitrary dimensions

I have written this program and I am having some trouble understanding how to use multiple blocks by using dim3 variable in the kernel call line. This code works fine when I am doing 1000*1000 matrix multiplication, but not getting correct answer for lower dimensions like 100*100 , 200*200. #include...

Cuda cub:Device Scan

I'm using cub to implement device scan. When I run the default example for device scan I keep getting : identifier "cudaOccupancyMaxActiveBlocksPerMultiprocessor" is undefined Does anyone have any idea about this problem? Thanks,...

Can an unsigned long long int be used to store the output from clock64()?

I need to update a global array storing clock64() from different threads atomically. All of the atomic functions in CUDA support only unsigned for long long int sizes. But the return type of clock64() is signed. Is it safe to store the output from clock64() in an unsigned?

How can I pass a struct to a kernel in JCuda

I have already looked at this which says I must modify my kernel to take only single dimensional arrays. However I refuse to believe that it is impossible to create a struct and copy it to device memory in JCuda. I would imagine the usual implementation would be to...

thrust exception bulk_kernel_by_value in transform_reduce

I'm working on a optimization problem which contains various math functions which resembles in similar form, so I warp them in a FunctionObj template <typename T> struct FunctionObj { T a; FunctionObj(): a(1) { } }; And defines a FuncEval to evaluate template <typename T> __host__ __device__ inline T FuncEval(const...