cuda,nvidia , Practice computing grid size for CUDA

## Question:

Tag: cuda,nvidia
``````dim3 block(4, 2)
dim3 grid((nx+block.x-1)/block.x, (ny.block.y-1)/block.y);
``````

I found this code in Professional CUDA C Programming on page 53. It's meant to be a naive example of matrix multiplication. `nx` is the number of columns and `ny` is the number of rows.

Can you explain how the grid size is computed? Why is `block.x` added to `nx` and then subtracted by `1`?

There is a preview (https://books.google.com/books?id=_Z7rnAEACAAJ&printsec=frontcover#v=onepage&q&f=false) but page 53 is missing.

This is the standard CUDA idiom for determining the minimum number of blocks in each dimension (the "grid") that completely cover the desired input. This could be expressed as `ceil(nx/block.x)`, that is, figure out how many blocks are needed to cover the desired size, then round up.

But full floating point division and ceil is more expensive than necessary. Instead, since C defines integer division as a "floor" operation, you can add the divisor - 1 before dividing to the get the effect of a "ceiling" operation.

Try a few examples: If `nx = 10`, then `nx + block.x - 1` is 13, and by integer divison, you need 3 blocks of size 4.

As you noted in the comment, +block.x pushes up floor to ceiling and the -1 is for numbers that divide perfectly into the divisor. e.g. (12 + 4)/4 would be 4 when we actually want (12+4-1)/4 which 3

# Related:

## How many parallel threads i can run on my nvidia graphic card in cuda programming?

cuda
Operating System: Windows 8.1 Single Language, 64-bit DirectX version: 11.0 GPU processor: GeForce 840M Driver version: 353.06 Direct3D API version: 11.2 Direct3D feature level: 11_0 CUDA Cores: 384 Core clock: 1029 MHz Memory data rate: 1800 MHz Memory interface: 64-bit Memory bandwidth: 14.40 GB/s Total available graphics memory: 4096 MB...

## How do you build the example CUDA Thrust device sort?

c++,visual-studio-2010,sorting,cuda,thrust
I am trying to build and run the Thrust example code in Visual Studio 2010 with the latest version (7.0) of CUDA and the THURST install that comes with it. I cannot get the example code to build and run. By eliminating parts of the code, I found the problem...

## direct global memory access using cuda

c++,cuda
q1- lets say i have copy one array onto device through stream1 using cudaMemCpyAsync; would i be able to access the values of that array in different stream say 2? cudaMemcpyAsync(da,a,10*sizeof(float),cudaMemcpyHostToDevice,stream[0]); kernel<<<n,1,0,stream[0]>>>(da); kernel<<<n,1,0,stream[1]>>>(da){//calculation involving da} ; q2- would i have to include pointer to global memory array as argument in...

## cudaMemcpyToSymbol in pycuda

python,cuda,pycuda
I am using pycuda and i would like to know if there is an equivalent to the function cudaMemcpyToSymbol I would like to copy a constant from the host to the device like below import pycuda.driver as cuda import pycuda.autoinit from pycuda.compiler import SourceModule import numpy from sys import path...

## NVCC CUDA cross compiling cannot find “-lcudart”

linux,cuda,ld,nvcc
I have installed CUDA 5.0 and NVCC on my Ubuntu virtual machine and have had problems compiling even a basic CUDA C program. The error is as follows: [email protected]:~/CUDA\$ nvcc helloworld.cu -o helloworld.o -target-cpu-arch=ARM -ccbin=/usr/bin/arm-linux-gnueabi-gcc-4.6 --machine=32 /usr/lib/gcc/arm-linux-gnueabi/4.6/../../../../arm-linux-gnueabi/bin/ld: skipping incompatible /usr/local/cuda-5.0/bin/../lib/libcudart.so when searching for -lcudart /usr/lib/gcc/arm-linux-gnueabi/4.6/../../../../arm-linux-gnueabi/bin/ld: skipping incompatible...

## Cuda cub:Device Scan

cuda,gpu,nvcc,cub,scan
I'm using cub to implement device scan. When I run the default example for device scan I keep getting : identifier "cudaOccupancyMaxActiveBlocksPerMultiprocessor" is undefined Does anyone have any idea about this problem? Thanks,...

## OpenGL: GL_FRAMEBUFFER_UNSUPPORTED on specific combinations of framebuffer attachments

c++,opengl,textures,nvidia,framebuffer
Im trying to attach multiple targets to a framebuffer object. I have the following problem: There is no error, when using float texture attachments and a depth attachment. There is also no error, when using float texture attachments and integer texture attachments. Although these combinations work, I cant use float,...

## cuda-memcheck fails to detect memory leak in an R package

r,memory-leaks,cuda,valgrind
I'm building CUDA-accelerated R packages, and I want to debug with cuda-memcheck. So in this minimal example (in the deliberate_memory_leak GitHub branch), I create a memory leak in someCUDAcode.c by commenting out a necessary call to cudaFree. Then, I see if cuda-memcheck can find the leak. \$ cuda-memcheck --leak-check full...

## Tesla k20m interoperability with Direct3D 11

cuda,direct3d,tesla
I would like to know if I can work with Nvidia Tesla K20 and Direct3D 11? I'd like to render an image using Direct3D, Then process the rendered image with CUDA, [ I know how to work out the CUDA interoperability]. Tesla k20 doesn't have a display adapter (physically remote...

## Practice computing grid size for CUDA

cuda,nvidia
dim3 block(4, 2) dim3 grid((nx+block.x-1)/block.x, (ny.block.y-1)/block.y); I found this code in Professional CUDA C Programming on page 53. It's meant to be a naive example of matrix multiplication. nx is the number of columns and ny is the number of rows. Can you explain how the grid size is computed?...

## Best way to achieve CUDA Vector Diagonalization

matrix,cuda
What I want to do is feed in my m x n matrix, and in parallel, construct n square diagonal matrices for each column of the matrix, perform an operation on each square diagonal matrix, and then recombine the result. How do I do this? So far, I start of...

## CUDA: Group every n-th point of array passed to GPU

c++,c,arrays,cuda
I am trying to implement k-means algorithm on CUDA using Tesla card on external Unix. I read input file and store coordinates of all data points in dataX and dataY arrays. The next step is to select every centreInterval-th point and store it in another array allocated in GPU memory....

## Can an unsigned long long int be used to store the output from clock64()?

cuda
I need to update a global array storing clock64() from different threads atomically. All of the atomic functions in CUDA support only unsigned for long long int sizes. But the return type of clock64() is signed. Is it safe to store the output from clock64() in an unsigned?

## CUDA strange behavior accessing vector

c++,cuda
I have implemented a simple fft program in cuda. This is the kernel function: __global__ void fftKernel(cuComplex* dev_samples, size_t length, size_t llog, Direction direction) { int tid = threadIdx.x + blockDim.x * blockIdx.x; if (tid < length / 2) { // First step, sorts data with bit reversing and compute...

## thrust exception bulk_kernel_by_value in transform_reduce

c++,c++11,cuda
I'm working on a optimization problem which contains various math functions which resembles in similar form, so I warp them in a FunctionObj template <typename T> struct FunctionObj { T a; FunctionObj(): a(1) { } }; And defines a FuncEval to evaluate template <typename T> __host__ __device__ inline T FuncEval(const...

## onSensorChanged() not fired on Android (nVidia Shield Tablet)

android,nvidia,android-sensors
I'm trying to create an augmented reality app for the nVidia shield. I tried my app on another Android device and it works. Unfortunately, on the shield, the onSensorChanged event is not fired. Here's my code: _sensorEventListener = new SensorEventListener() { public void onSensorChanged(SensorEvent event) { AndroidAttitude.this.processSensorEvent(event); } public void...

## purposely causing bank conflicts for shared memory on CUDA device

cuda,gpu,shared-memory,bank-conflict
It is a mystery for me how shared memory on cuda devices work. I was curious to count threads having access to the same shared memory. For this I wrote a simple program #include <cuda_runtime.h> #include <stdio.h> #define nblc 13 #define nthr 1024 //[email protected] __device__ int inwarpD[nblc]; __global__ void kernel(){...

## Is prefix scan CUDA sample code in gpugems3 correct?

cuda,gpu,nvidia,prefix-sum
I've written a piece of code to call the kernel in gpugem3 but the results that I got is a bunch of negative numbers instead of prefix scan. I'm wondering if my kernel call is wrong or there is something wrong with the gpugem3 code? here is my code: #include...

## How can I pass a struct to a kernel in JCuda

java,struct,cuda,jni,jcuda
I have already looked at this http://www.javacodegeeks.com/2011/10/gpgpu-with-jcuda-good-bad-and-ugly.html which says I must modify my kernel to take only single dimensional arrays. However I refuse to believe that it is impossible to create a struct and copy it to device memory in JCuda. I would imagine the usual implementation would be to...

## 'an illegal memory access' when trying to write to a 2D array allocated using cudaMalloc3D

c,cuda
I am trying to allocate and copy memory of a flattened 2D array on to the device using cudaMalloc3D to test the performance of cudaMalloc3D. But when I try to write to the array from the kernel it throws 'an illegal memory access was encountered' exception. The program runs fine...

## CUDA cuBlasGetmatrix / cublasSetMatrix fails | Explanation of arguments

cuda,gpgpu,gpu-programming,cublas
I've attempted to copy the matrix [1 2 3 4 ; 5 6 7 8 ; 9 10 11 12 ] stored in column-major format as x, by first copying it to a matrix in an NVIDIA GPU d_x using cublasSetMatrix, and then copying d_x to y using cublasGetMatrix(). #include<stdio.h>...

## how to generalize square matrix multiplication to handle arbitrary dimensions

c,cuda,parallel-processing,matrix-multiplication
I have written this program and I am having some trouble understanding how to use multiple blocks by using dim3 variable in the kernel call line. This code works fine when I am doing 1000*1000 matrix multiplication, but not getting correct answer for lower dimensions like 100*100 , 200*200. #include...

## Why does Hyper-Q selectively overlap async HtoD and DtoH transfer on my cc5.2 hardware?

cuda
There's an old Parallel ForAll blog post that demonstrates using streams and async memcpys to generate overlap between kernels and memcpys, and between HtoD and DtoH memcpys. So I ran the full Async sample given on my GTX Titan X, and here's the result: http://i.stack.imgur.com/rT676.png As you can see, when...

I have a 3D grid of 3D blocks, and within each block I need to compute sequentially on the "z" layers of the block. In other words, I want to execute first all (x,y,0) threads, then all (x,y,1), etc. I need to execute my threads layer by layer (counting layers...

## How does CUDA's cudaMemcpyFromSymbol work?

cuda
I understand the concept of passing a symbol, but was wondering what exactly is going on behind the scenes. If it's not the address of the variable, then what is it?

## Why use memset when using CUDA?

c,cuda,nvidia
I saw in a CUDA code example that memset is used to initialize vectors to all 0's that will store the sum of two others vectors. For example: hostRef = (float *)malloc(nBytes); gpuRef = (float *)malloc(nBytes); memset(hostRef, 0, nBytes); memset(gpuRef, 0, nBytes); What purpose does this serve if nothing else...

## cuda thrust: selective copying and resizing results

cuda,thrust
I am copying items selectively between two thrust device arrays using copy_if as follows: thrust::device_vector<float4> collated = thrust::device_vector<float4> original_vec.size()); thrust::copy_if(original_vec.begin(), original_vec.end(), collated.begin(), is_valid_pt()); collated.shrink_to_fit(); The is_valid_pt is implemented as: struct is_valid_kpt { __host__ __device__ bool operator()(const float4 x) { return x.w >= 0; } }; Now after running this code,...

## Understanding Memory Replays and In-Flight Requests

caching,cuda
I'm trying to understand how a matrix transpose can be faster reading naively from columns vs. rows. (example is from Professional CUDA C Programming) The matrix is in memory by row, i.e. (0,1),(0,2),(0,3)...(1,1),(1,2) __global__ void transposeNaiveCol(float *out, float *in, const int nx, const int ny) { unsigned int ix =...

## Reduce by key on device array

cuda,parallel-processing,thrust
I am using reduce_by_key to find the number of elements in an array of type int2 which has same first values . For example Array: <1,2> <1,3> <1,4> <2,5> <2,7> so no. elements with 1 as first element are 3 and with 2 are 2. CODE: struct compare_int2 : public...

## Understanding Dynamic Parallelism in CUDA

Example of dynamic parallelism: __global__ void nestedHelloWorld(int const iSize,int iDepth) { int tid = threadIdx.x; printf("Recursion=%d: Hello World from thread %d" "block %d\n",iDepth,tid,blockIdx.x); // condition to stop recursive execution if (iSize == 1) return; // reduce block size to half int nthreads = iSize>>1; // thread 0 launches child grid...

## Building a tiny R package with CUDA and Rcpp

r,cuda,rcpp
I'm working on a tiny R package that uses CUDA and Rcpp, adapted from the output of Rcpp.package.skeleton(). I will first describe what happens on the master branch for the commit entitled "fixed namespace". The package installs successfully if I forget CUDA (i.e., if I remove the src/Makefile, change src/rcppcuda.cu...

## Building CUDA-aware openMPI on Ubuntu 12.04 cannot find cuda.h

ubuntu,cuda,mpi
I am building openMPI 1.8.5 on Ubuntu 12.04 with CUDA 6.5 installed and tested with default samples. I intend to run it on a single node with following configuration: Dell Precision T7400 Dual Xeon X5450 Nvidia GT730/Tesla C1060 The configure command issued was \$ ./configure --prefix=/usr --with-cuda=/usr/local/cuda In the generated...

## Stream compaction with Thrust; best practices and fastest way?

c++,cuda,gpgpu,thrust,sparse-array
I am interested in porting some existing code to use thrust to see if I can speed it up on the GPU with relative ease. What I'm looking to accomplish is a stream compaction operation, where only nonzero elements will be kept. I have this mostly working, per the example...

## cuMemcpyDtoH yields CUDA_ERROR_INVALID_VALUE

java,scala,ubuntu,cuda,jcuda
I have a very simple scala jcuda program that adds a very large array. Everything compiles and runs just fine until I want to copy more than 4 bytes from my device to host. I am getting CUDA_ERROR_INVALID_VALUE when I try to copy more than 4 bytes. // This does...

## Faster Matrix Multiplication in CUDA

c,cuda,matrix-multiplication
Currently, I made a neural networks program in the cuda c. Because I needed to manipulate the matrix multiplication, I did not use CUBLAS for MM. I use the following code for MM. I was wondering if any one has some advice to make it faster which can be very...

## Point Grey Bumblebee2 firewire 1394 with Nvidia Jetson TK1 board

opencv,ubuntu-14.04,nvidia
I have successfully interfaced Point Grey Bumblebee2 firewire1394 camera with Nvida Jetson TK1 board and I get the video using Coriander and video for Linux loop back device is working as well. But when I tried to access camera using OpenCV and Coriander at the same time, I have conflicts....

## How to pass struct containing array to the kernel in CUDA?

c,arrays,cuda
In the following code I have an array in a struct which I need to pass to the kernel function. I can't seem to find the proper way. I tried looking at other posts on SO but do not understand their methods that well. In my actual code, I receive...

## Access violation reading location when calling cudaMemcpy2DToArray

c++,arrays,opencv,cuda
I allocated a 2D array in device and want to copy a 2D float array to device. ImgSrc is a Mat type in openCV that I copied the elements of it into a 2D float array named ImgSrc_f.then by using cudaMemcpy2DToArray() I copied my host 2D array(ImgSrc_f) to device 2D...

## Update a D3D9 texture from CUDA

c#,cuda,sharpdx,direct3d9,managed-cuda
I’m working on a prototype that integrates WPF, Direct3D9 (using Microsoft’s D3DImage WPF class), and CUDA (I need to be able to generate a texture for the D3DImage on the GPU). The problem is, CUDA doesn’t update my texture. No error codes are returned, the texture just stays unchanged. Even...

## What is version of cuda for nvidia 304.125

ubuntu,cuda,ubuntu-14.04,nvidia
I am using ubuntu 14.04. I want to install CUDA. But I don't know which version is good for my laptop. I trace my drive that is \$cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.125 Mon Dec 1 19:58:28 PST 2014 GCC version: gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)...

## Using a data pointer with CUDA (and integrated memory)

c++,memory-management,cuda
I am using a board with integrated gpu and cpu memory. I am also using an external matrix library (Blitz++). I would like to be able to grab the pointer to my data from the matrix object and pass it into a cuda kernel. After doing some digging, it sounds...

## cudaMalloc vs cudaMalloc3D performance for a 2D array

c,cuda
I want to know the impact on performance when using cudaMalloc or cudaMalloc3D when allocating, copying and accessing memory for a 2D array. I have code that I tried to test the run time on where on one I use cudaMalloc and on the other cudaMalloc3D. I have included the...

## How to load data in global memory into shared memory SAFELY in CUDA?

c++,cuda,shared-memory
My kernel: __global__ void myKernel(float * devData, float * devVec, float * devStrFac, int Natom, int vecNo) { extern __shared__ float sdata[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; float qx=devVec[3*idx]; float qy=devVec[3*idx+1]; float qz=devVec[3*idx+2]; __syncthreads();//sync_1 float c=0.0,s=0.0; for (int iatom=0; iatom<Natom; iatom += blockDim.x) { float rtx =...

## nvcc/CUDA 6.5 & c++11(future) - gcc 4.4.7

c++11,gcc,cuda,future
When I compile the following code containing the design C++11, I get errors - it does not compile. I've tried with different flags, but I haven't found a solution. My setting: CUDA 6.5, gcc 4.4.7 I am not able to change the settings. How can I still make this work?...

## Firewire 1394 camera with OpenCV

c++,opencv,camera,nvidia,firewire
I am trying to run face detection demo from OpenCV using firewire 1394 camera. While doing so I got the following error. Unable to stop the stream.: Bad file descriptor In capture ... VIDIOC_STREAMON: Inappropriate ioctl for device Unable to stop the stream.: Inappropriate ioctl for device Here is the...

## Linear algebra libraries and dynamic parallelism in CUDA

cuda,gpu,gpgpu
With the advent of dynamic parallelism in 3.5 and above CUDA architectures, is it possible to call linear algebra libraries from within __device__ functions? Can the CUSOLVER library in CUDA 7 be called from a kernel (__global__) function?...