It is a mystery for me how shared memory on cuda devices work. I was curious to count threads having access to the same shared memory. For this I wrote a simple program #include <cuda_runtime.h> #include <stdio.h> #define nblc 13 #define nthr 1024 //[email protected] __device__ int inwarpD[nblc]; __global__ void kernel(){...

I've been working on optimizing some code and ran into an issue with the shared memory bank conflict report with the CUDA Nsight performance analysis. I was able to reduce it to a very simple piece of code that Nsight reports as having a bank conflict, when it doesn't seem...

I'm developing face detection app in android platform using OpenCL. Face detection algorithm is based on Viola Jones algorithm. I tried to make Cascade classification step kernel code. and I set classifier data of cascade stage 1 among cascade stages to local memory(__local) because classifier data are used for all...