FAQ Database Discussion Community


Is prefix scan CUDA sample code in gpugems3 correct?

cuda,gpu,nvidia,prefix-sum
I've written a piece of code to call the kernel in gpugem3 but the results that I got is a bunch of negative numbers instead of prefix scan. I'm wondering if my kernel call is wrong or there is something wrong with the gpugem3 code? here is my code: #include...

CUDA: Group every n-th point of array passed to GPU

c++,c,arrays,cuda
I am trying to implement k-means algorithm on CUDA using Tesla card on external Unix. I read input file and store coordinates of all data points in dataX and dataY arrays. The next step is to select every centreInterval-th point and store it in another array allocated in GPU memory....

using dictionary in pycuda

python,cuda,pycuda
I have a dictionary and i would like to know if is it possible to use it as a parameter of a kernel. for instance i have the cuda kernel signature __global__ void calTab(Tableaux) Tableaux is a C structure corresponding to typedef struct { float *Tab1; float *Tab2; } Tableaux;...

Do I need to free device_ptr returned by thrust?

c++,pointers,cuda,thrust
I have a function to get the minimum value of an array and it's executed within a loop. thrust::device_ptr<float> min_ptr = thrust::min_element(populationFitness, populationFitness + POPULATION); Do I have to free the returned device_ptr? I tried with thrust::device_free(min_ptr) but an exception is thrown....

GPU Programming Strategy

c++,cuda,gpu
Recently, I am trying to program a type of neural network using c in CUDA. I have one basic question. For the programming, I can either use big arrays or different naming strategy. For example for the weights, I can put all the weights in one big array or use...

CUDA thread execution order

multithreading,cuda
I have the following C code for CUDA program: #include <stdio.h> #define NUM_BLOCKS 4 #define THREADS_PER_BLOCK 4 __global__ void hello() { printf("Hello. I'm a thread %d in block %d\n", threadIdx.x, blockIdx.x); } int main(int argc,char **argv) { // launch the kernel hello<<<NUM_BLOCKS, THREADS_PER_BLOCK>>>(); // force the printf()s to flush cudaDeviceSynchronize();...

Does cuda 7 fully support lambda on device code?

c++,c++11,lambda,cuda
If I use thrust::transform on thrust::host, the lambda usage is fine thrust::transform(thrust::host, a, a+arraySize,b,d,[](int a, int b)->int { return a + b; }); However, if I change the thrust::host to thrust::device, The code wouldn't pass the compiler. Here is the error on vs 2013 The closure type for a lambda...

Is there any way I can have a barrier within Device code that is controlled by Host?

cuda,barrier
For example, my code is something like this (but it doesn't work and the kernel stalls): __device__ __managed__ int x; __global__ void kernel() { // do something while(x == 1); // a barrier // do the rest } int main() { x = 1; kernel<<< 1, 1 >>>(); x =...

OpenGL 4.5 Buffer Texture : extensions support

opengl,cuda,textures,buffer,opengl-4
I use OpenGL Version 4.5.0, but somehow I can not make texture_buffer_object extensions work for me ("GL_EXT_texture_buffer_object" or "GL_ARB_texture_buffer_object"). I am quite new with OpenGL, but if I understand right, these extensions are quite old and even already included in the core functionality... I looked for the extensions with "OpenGL...

purposely causing bank conflicts for shared memory on CUDA device

cuda,gpu,shared-memory,bank-conflict
It is a mystery for me how shared memory on cuda devices work. I was curious to count threads having access to the same shared memory. For this I wrote a simple program #include <cuda_runtime.h> #include <stdio.h> #define nblc 13 #define nthr 1024 //[email protected] __device__ int inwarpD[nblc]; __global__ void kernel(){...

In CUDA / Thrust, how can I access a vector element's neighbor during a for-each operation?

c++,cuda,thrust
I am trying to do some scientific simulation using Thrust library in CUDA, but I got stuck in the following operation which is basically a for-each loop: device_vector<float> In(N); for-each In(x) in In Out(x) = some_calculation(In(x-1),In(x),In(x+1)); end I have already looked up stackoverflow.com and find some similar questions: Similar questions...

Problems with floating-point additions. Ignoring some small values

math,cuda,floating-point
I'm looking up a book about CUDA. On the chapter which explains the floating points of CUDA, I found something odd. The book says that (1.00 * 1) + (1.00 * 1) + (1.00 * 0.01) + (1.00* 0.01) = 10. All the numbers are binaries. 0.01 refers to decimal...

Reading event counters with concurrent exection

cuda,profiling,nvidia
I am trying to read performance counters with nvprof while executing two kernels concurrently. nvprof --concurrent-kernels on --events fb_subp0_write_sectors ./myprogram However by doing this the kernel execution seems to serialize. What I want out of this is exactly how they perform when they are running concurrently. Is it possible at...

cudaMemcpy doesn't work in 64 bits

cuda,32bit-64bit
I made a very simple CUDA kernel which populates an array of 100 elements with f[i]=i (and checked using assert in another kernel that it had indeed done so). #include<stdio.h> #include<assert.h> //definizione di gpuErrchk __global__ void setToItself(int* vect){ vect[threadIdx.x] = threadIdx.x; } int main(){ int* a_d; gpuErrchk( cudaMalloc(&a_d, 100 *...

How many parallel threads i can run on my nvidia graphic card in cuda programming?

cuda
Operating System: Windows 8.1 Single Language, 64-bit DirectX version: 11.0 GPU processor: GeForce 840M Driver version: 353.06 Direct3D API version: 11.2 Direct3D feature level: 11_0 CUDA Cores: 384 Core clock: 1029 MHz Memory data rate: 1800 MHz Memory interface: 64-bit Memory bandwidth: 14.40 GB/s Total available graphics memory: 4096 MB...

Finding reason for the inaccurate results, copying code from research paper

cuda,parallel-processing,research
I am trying to replicate the linear programming solver that this person has made http://www.idi.ntnu.no/~elster/master-studs/spampinato/spampinato-linear-prog-gpu-report.pdf. First of, the device I am using is Quadro FX 1800M with compute capability 1.2. My problem is that when I launch more than 22 threads per block then most of the time I get...

Making some, but not all, (CUDA) memory accesses uncached

caching,cuda,gpgpu
I just noticed it's at all possible to have (CUDA kernel) memory accesses uncached (see e.g. this answer here on SO). Can this be done... For a single kernel individually? At run time rather than at compile time? For writes only rather than for reads and writes? ...

Memory allocation on GPU for dynamic array of structs

c,struct,cuda,dynamic-memory-allocation
I have problem with passing array of struct to gpu kernel. I based on this topic - cudaMemcpy segmentation fault and I wrote sth like this: #include <stdio.h> #include <stdlib.h> struct Test { char *array; }; __global__ void kernel(Test *dev_test) { for(int i=0; i < 5; i++) { printf("Kernel[0][i]: %c...

how to generalize square matrix multiplication to handle arbitrary dimensions

c,cuda,parallel-processing,matrix-multiplication
I have written this program and I am having some trouble understanding how to use multiple blocks by using dim3 variable in the kernel call line. This code works fine when I am doing 1000*1000 matrix multiplication, but not getting correct answer for lower dimensions like 100*100 , 200*200. #include...

Stream compaction with Thrust; best practices and fastest way?

c++,cuda,gpgpu,thrust,sparse-array
I am interested in porting some existing code to use thrust to see if I can speed it up on the GPU with relative ease. What I'm looking to accomplish is a stream compaction operation, where only nonzero elements will be kept. I have this mostly working, per the example...

How to load data in global memory into shared memory SAFELY in CUDA?

c++,cuda,shared-memory
My kernel: __global__ void myKernel(float * devData, float * devVec, float * devStrFac, int Natom, int vecNo) { extern __shared__ float sdata[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; float qx=devVec[3*idx]; float qy=devVec[3*idx+1]; float qz=devVec[3*idx+2]; __syncthreads();//sync_1 float c=0.0,s=0.0; for (int iatom=0; iatom<Natom; iatom += blockDim.x) { float rtx =...

Practice computing grid size for CUDA

cuda,nvidia
dim3 block(4, 2) dim3 grid((nx+block.x-1)/block.x, (ny.block.y-1)/block.y); I found this code in Professional CUDA C Programming on page 53. It's meant to be a naive example of matrix multiplication. nx is the number of columns and ny is the number of rows. Can you explain how the grid size is computed?...

cudaMemcpyToSymbol in pycuda

python,cuda,pycuda
I am using pycuda and i would like to know if there is an equivalent to the function cudaMemcpyToSymbol I would like to copy a constant from the host to the device like below import pycuda.driver as cuda import pycuda.autoinit from pycuda.compiler import SourceModule import numpy from sys import path...

'an illegal memory access' when trying to write to a 2D array allocated using cudaMalloc3D

c,cuda
I am trying to allocate and copy memory of a flattened 2D array on to the device using cudaMalloc3D to test the performance of cudaMalloc3D. But when I try to write to the array from the kernel it throws 'an illegal memory access was encountered' exception. The program runs fine...

Using a data pointer with CUDA (and integrated memory)

c++,memory-management,cuda
I am using a board with integrated gpu and cpu memory. I am also using an external matrix library (Blitz++). I would like to be able to grab the pointer to my data from the matrix object and pass it into a cuda kernel. After doing some digging, it sounds...

Why my cuda program became slower after using 128 threads on blocks?

c++,cuda,tesla
I have a simple cuda application with the following code: #include <stdio.h> #include <sys/time.h> #include <stdint.h> __global__ void daxpy(int n, int a, int *x, int *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; y[i] = x[i]; int j; for(j = 0; j < 1024*10000; ++j) { y[i] += j%10; }...

Shuffle instruction in CUDA not working

c++,multithreading,cuda,shuffle
I have got problem with shuffle instruction in CUDA 5.0. This is snippet of my kernel. It is inside the loop. Print is there only for debug purpose because I can't use ordinary debugger: ... tex_val = tex2D(srcTexRef, threadIdx.x + w, y_pos); if (threadIdx.x == 0) { left = left_value[y_pos];...

Understanding Dynamic Parallelism in CUDA

multithreading,cuda
Example of dynamic parallelism: __global__ void nestedHelloWorld(int const iSize,int iDepth) { int tid = threadIdx.x; printf("Recursion=%d: Hello World from thread %d" "block %d\n",iDepth,tid,blockIdx.x); // condition to stop recursive execution if (iSize == 1) return; // reduce block size to half int nthreads = iSize>>1; // thread 0 launches child grid...

Function pointer (to other kernel) as kernel arg in CUDA

c++,cuda,function-pointers,gpgpu
With dynamic parallelism in CUDA, you can launch kernels on the GPU side, starting from a certain version. I have a wrapper function that takes a pointer to the kernel I want to use, and it either does this on the CPU for older devices, or on the GPU for...

How to run CUDA/OpenGL interop (particle) sample from a remote machine

ubuntu,cuda,x11,freeglut,nsight
I am trying to run the CUDA particle sample on a remote Ubuntu machine from a host ubuntu machine. I followed this tutorial: http://devblogs.nvidia.com/parallelforall/remote-application-development-nvidia-nsight-eclipse-edition/ and it runs on my host, but not on my remote machine. I get the following output in Nsight: CUDA Particles Simulation Starting... grid: 64 x...

Storing non-square images in a volume

cuda
I have been trying to store non-square images into a volume. My code works when I want to store, 512 times, a square image of size w512 x h512, but it does not work for a non-square image of size w512 x h1024. I get an error with a description...

Amount of cores per SM and threads per block in CUDA

cuda
As NVIDIA GPU evolve the amount of cores per SM changes: in Fermi we have 32 of them, but in Maxwell the number is 128 according to the white papers. So, my questions are following: Is that better to create grids with blocks, containing 128 threads each? Will such code...

Why use max and min macro in __global__ kernel of CUDA not giving correct answer?

c++,cuda,macros
I was trying to write a simple CUDA function to blur images. I use myself defined max and min macro as #define min(a, b) ((float)a > (float)b)? (float)b: (float)a #define max(a, b) ((float)a > (float)b)? (float)a: (float)b The part of __global__ kernel is: float norm; float sum = 0;// when...

CUDA cuBlasGetmatrix / cublasSetMatrix fails | Explanation of arguments

cuda,gpgpu,gpu-programming,cublas
I've attempted to copy the matrix [1 2 3 4 ; 5 6 7 8 ; 9 10 11 12 ] stored in column-major format as x, by first copying it to a matrix in an NVIDIA GPU d_x using cublasSetMatrix, and then copying d_x to y using cublasGetMatrix(). #include<stdio.h>...

Can an unsigned long long int be used to store the output from clock64()?

cuda
I need to update a global array storing clock64() from different threads atomically. All of the atomic functions in CUDA support only unsigned for long long int sizes. But the return type of clock64() is signed. Is it safe to store the output from clock64() in an unsigned?

Update a D3D9 texture from CUDA

c#,cuda,sharpdx,direct3d9,managed-cuda
I’m working on a prototype that integrates WPF, Direct3D9 (using Microsoft’s D3DImage WPF class), and CUDA (I need to be able to generate a texture for the D3DImage on the GPU). The problem is, CUDA doesn’t update my texture. No error codes are returned, the texture just stays unchanged. Even...

MVAPICH on multi-GPU causes Segmentation fault

cuda,mvapich2
I'm using MVAPICH2 2.1 on a Debian 7 machine. It has multiple cards of Tesla K40m. The code is as follows. #include <cstdio> #include <cstdlib> #include <ctime> #include <cuda_runtime.h> #include <mpi.h> int main(int argc, char** argv) { MPI_Status status; int rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); cudaSetDevice(0); if (rank == 0)...

How does CUDA's cudaMemcpyFromSymbol work?

cuda
I understand the concept of passing a symbol, but was wondering what exactly is going on behind the scenes. If it's not the address of the variable, then what is it?

Faster Matrix Multiplication in CUDA

c,cuda,matrix-multiplication
Currently, I made a neural networks program in the cuda c. Because I needed to manipulate the matrix multiplication, I did not use CUBLAS for MM. I use the following code for MM. I was wondering if any one has some advice to make it faster which can be very...

Template __host__ __device__ calling host defined functions

cuda
During implementation of CUDA code I often need some utility functions, which will be called from device and also from host code. So I declare these functions as __host__ __device__. This is OK and possible device/host incompabilities can be handled by #ifdef CUDA_ARCH. Problems come when the utility function is...

Threads syncronization in CUDA

c++,multithreading,cuda
I have a 3D grid of 3D blocks, and within each block I need to compute sequentially on the "z" layers of the block. In other words, I want to execute first all (x,y,0) threads, then all (x,y,1), etc. I need to execute my threads layer by layer (counting layers...

cuda 7.0 installation compatible hardware not found

windows,cuda,installation
I have a GeForce 8400 GS and compatible updated driver 341.44 installed on my Windows 8.1 x64 machine. When I tried to install CUDA 7.0 it showed that it could not find compatible hardware but I went along and chose to not install the provided driver and looking at this...

Building a tiny R package with CUDA and Rcpp

r,cuda,rcpp
I'm working on a tiny R package that uses CUDA and Rcpp, adapted from the output of Rcpp.package.skeleton(). I will first describe what happens on the master branch for the commit entitled "fixed namespace". The package installs successfully if I forget CUDA (i.e., if I remove the src/Makefile, change src/rcppcuda.cu...

What is version of cuda for nvidia 304.125

ubuntu,cuda,ubuntu-14.04,nvidia
I am using ubuntu 14.04. I want to install CUDA. But I don't know which version is good for my laptop. I trace my drive that is $cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.125 Mon Dec 1 19:58:28 PST 2014 GCC version: gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)...

Returning to host code in pyCUDA after asynchronous kernel launch

python,python-2.7,cuda,ipython,pycuda
I am trying to launch a kernel in pyCUDA and then terminate the kernel by writing to a GPU global memory location. Here is a simple example kernel that I would like to be able to terminate at some point after it enters the infinite while loop: __global__ void countUp(u16...

Best way to achieve CUDA Vector Diagonalization

matrix,cuda
What I want to do is feed in my m x n matrix, and in parallel, construct n square diagonal matrices for each column of the matrix, perform an operation on each square diagonal matrix, and then recombine the result. How do I do this? So far, I start of...

What's wrong with this simple cuda program?

cuda
I'm just trying to use CUDA to blank an image. But "before" and "after" I get the same original image. Can't figure out the problem. sumKernel.cu: #include "sumKernel.h" __global__ void _sumKernel(char *image, int width, int height, char *kernel, int kerwidth, int kerheight) { int idx = blockIdx.x * blockDim.x +...

cudaMalloc vs cudaMalloc3D performance for a 2D array

c,cuda
I want to know the impact on performance when using cudaMalloc or cudaMalloc3D when allocating, copying and accessing memory for a 2D array. I have code that I tried to test the run time on where on one I use cudaMalloc and on the other cudaMalloc3D. I have included the...

Extracting raw data from template for use in CUDA

c++,templates,opencv,cuda
The following code is a snippet from the PCL (point cloud) library. It calculates the integral sum of an image. template <class DataType, unsigned Dimension> class IntegralImage2D { public: static const unsigned dim_fst = Dimension; typedef cv::Vec<typename TypeTraits<DataType>::IntegralType, dim_fst> FirstType; std::vector<FirstType> img_fst; //.... lots of methods missing here that actually...

Why does Hyper-Q selectively overlap async HtoD and DtoH transfer on my cc5.2 hardware?

cuda
There's an old Parallel ForAll blog post that demonstrates using streams and async memcpys to generate overlap between kernels and memcpys, and between HtoD and DtoH memcpys. So I ran the full Async sample given on my GTX Titan X, and here's the result: http://i.stack.imgur.com/rT676.png As you can see, when...

NVCC CUDA cross compiling cannot find “-lcudart”

linux,cuda,ld,nvcc
I have installed CUDA 5.0 and NVCC on my Ubuntu virtual machine and have had problems compiling even a basic CUDA C program. The error is as follows: [email protected]:~/CUDA$ nvcc helloworld.cu -o helloworld.o -target-cpu-arch=ARM -ccbin=/usr/bin/arm-linux-gnueabi-gcc-4.6 --machine=32 /usr/lib/gcc/arm-linux-gnueabi/4.6/../../../../arm-linux-gnueabi/bin/ld: skipping incompatible /usr/local/cuda-5.0/bin/../lib/libcudart.so when searching for -lcudart /usr/lib/gcc/arm-linux-gnueabi/4.6/../../../../arm-linux-gnueabi/bin/ld: skipping incompatible...

Reduce by key on device array

cuda,parallel-processing,thrust
I am using reduce_by_key to find the number of elements in an array of type int2 which has same first values . For example Array: <1,2> <1,3> <1,4> <2,5> <2,7> so no. elements with 1 as first element are 3 and with 2 are 2. CODE: struct compare_int2 : public...

caffe Debug build: stray '"' character in nvcc command

compilation,cuda,nvcc,caffe
I am trying to build my C++ application that uses caffe, in Debug Mode, VS2013 community, x64. To be able to build version that do not need cuda to run, I added to wrapped each .cu file as indicated below: #ifndef CPU_ONLY // .cu file contents #endif The project was...

Why does cuSOLVER cusolverSpDcsrlsvchol not work?

c++,cuda,linear-algebra,solver,cusolver
We are experiencing problems while using cuSOLVER's cusolverSpScsrlsvchol function, probably due to misunderstanding of the cuSOLVER library... Motivation: we are solving the Poisson equation -divgrad x = b on a rectangular grid. In 2 dimensions with a 5-stencil (1, 1, -4, 1, 1), the Laplacian on the grid provides a...

How to pass struct containing array to the kernel in CUDA?

c,arrays,cuda
In the following code I have an array in a struct which I need to pass to the kernel function. I can't seem to find the proper way. I tried looking at other posts on SO but do not understand their methods that well. In my actual code, I receive...

Cuda cusolver can't link in Visual studio 2013

c++,visual-studio,visual-studio-2013,cuda,cusolver
I have tried basically everything and I can't get vs2013 to compile and link against the cusolver library. I have tried all the sample projects that came with the cuda installation package and basically all of the samples work fine. Though there are no samples using cusolver. The include files...

Compiler 'cl.exe' in PATH different than the one specified with -ccbin

visual-studio-2013,cuda,intel-c++
This is on Windows 7 Pro 64 bit with CUDA 6.5 and Intel Composer 2015, in Visual Studio 2013. I have a project that contains a mix of native c++ and CUDA. I'm trying to switch from the Microsoft compiler to the Intel compiler, for better vectorization and performance. However,...

Linear algebra libraries and dynamic parallelism in CUDA

cuda,gpu,gpgpu
With the advent of dynamic parallelism in 3.5 and above CUDA architectures, is it possible to call linear algebra libraries from within __device__ functions? Can the CUSOLVER library in CUDA 7 be called from a kernel (__global__) function?...

How to give texture to a Point cloud? OpenGL / CUDA / C++

c++,opencv,opengl,cuda,kernel
I have a disparity image obtained using OpenCV. I am able to display a 640 x 360 grid of points using OpenGL and CUDA, and give to each point a Z value which corresponds to the disparity value. The result: Now I wish to give to each point in the...

Why use memset when using CUDA?

c,cuda,nvidia
I saw in a CUDA code example that memset is used to initialize vectors to all 0's that will store the sum of two others vectors. For example: hostRef = (float *)malloc(nBytes); gpuRef = (float *)malloc(nBytes); memset(hostRef, 0, nBytes); memset(gpuRef, 0, nBytes); What purpose does this serve if nothing else...

sharing cuda texture among multiple streams

cuda,textures
I have a texture which I am using for reading an image. So, the texture is defined as: texture<uchar4, 2, cudaReadModeNormalizedFloat> text; I have a CUDA kernel which uses this texture to read some image pixel value as: __global__ void resample_2D(float4* result, int width, nt height, float* x, float* y)...

Access violation reading location when calling cudaMemcpy2DToArray

c++,arrays,opencv,cuda
I allocated a 2D array in device and want to copy a 2D float array to device. ImgSrc is a Mat type in openCV that I copied the elements of it into a 2D float array named ImgSrc_f.then by using cudaMemcpy2DToArray() I copied my host 2D array(ImgSrc_f) to device 2D...

CUDA, “illegal memory access was encountered” in Memcpy

c++,cuda
I have this cuda file: #include "cuda.h" #include "../../HandleError.h" #include "Sphere.hpp" #include <stdlib.h> #include <CImg.h> #define WIDTH 1280 #define HEIGHT 720 #define rnd(x) (x*rand()/RAND_MAX) #define SPHERES_COUNT 5 using namespace cimg_library; __global__ void kernel(unsigned char* bitmap, Sphere* s) { // Map threadIdx/blockIdx to pixel position int x = threadIdx.x + blockIdx.x...

nvcc/CUDA 6.5 & c++11(future) - gcc 4.4.7

c++11,gcc,cuda,future
When I compile the following code containing the design C++11, I get errors - it does not compile. I've tried with different flags, but I haven't found a solution. My setting: CUDA 6.5, gcc 4.4.7 I am not able to change the settings. How can I still make this work?...

Losing Unified Memory support in CUDA 7.0 under Windows 10

cuda
Recently I updated from CUDA 6.0 to CUDA 7.0, and my CUDA programs with unified memory allocation stopped working (other programs without unified memory still work, and the CUDA 7.0 template in Visual Studio 2013 still works). Following What is the canonical way to check for errors using the CUDA...

CUDA strange behavior accessing vector

c++,cuda
I have implemented a simple fft program in cuda. This is the kernel function: __global__ void fftKernel(cuComplex* dev_samples, size_t length, size_t llog, Direction direction) { int tid = threadIdx.x + blockDim.x * blockIdx.x; if (tid < length / 2) { // First step, sorts data with bit reversing and compute...

Tesla k20m interoperability with Direct3D 11

cuda,direct3d,tesla
I would like to know if I can work with Nvidia Tesla K20 and Direct3D 11? I'd like to render an image using Direct3D, Then process the rendered image with CUDA, [ I know how to work out the CUDA interoperability]. Tesla k20 doesn't have a display adapter (physically remote...

Cuda: Copy host data to shared memory array

c++,cuda
I have a struct defined on my host and on my device. In the host I initialize an array of this struct with values. MyStruct *h_s = (MyStruct *) malloc(objsize*sizeof(MyStruct)); hs[0] = ... Mystruct *d_s; cudaMalloc( &d_s, objsize * sizeof(MyStruct)); cudaMemcpy( d_s, h_s, objsize * sizeof(MyStruct), cudaMemcpyHostToDevice ); init<<< gridSize,...

Branch and predicated instructions

cuda,simd
Section 5.4.2 of the CUDA C Programming Guide states that branch divergence is handled either by "branch instructions" or, under certain conditions, "predicated instructions". I don't understand the difference between the two, and why one leads to better performance than the other. This comment suggests that branch instructions lead to...

cuda-memcheck fails to detect memory leak in an R package

r,memory-leaks,cuda,valgrind
I'm building CUDA-accelerated R packages, and I want to debug with cuda-memcheck. So in this minimal example (in the deliberate_memory_leak GitHub branch), I create a memory leak in someCUDAcode.c by commenting out a necessary call to cudaFree. Then, I see if cuda-memcheck can find the leak. $ cuda-memcheck --leak-check full...

Understanding CUDA profiler output (nvprof)

cuda,memcpy,nvprof
I'm just looking at the following output and trying to wrap my mind around the numbers: ==2906== Profiling result: Time(%) Time Calls Avg Min Max Name 23.04% 10.9573s 16436 666.67us 64.996us 1.5927ms sgemm_sm35_ldg_tn_32x16x64x8x16 22.28% 10.5968s 14088 752.18us 612.13us 1.6235ms sgemm_sm_heavy_nt_ldg 18.09% 8.60573s 14088 610.86us 513.05us 1.2504ms sgemm_sm35_ldg_nn_128x8x128x16x16 16.48% 7.84050s 68092...

cuda calc distance of two points

cuda
Here I want to calculate the distance of each two points, and decide if they are neighbours. here is my simple code in cuda. __global__ void calcNeighbors(const DataPoint* points, const float doubleRadius, bool* neighbors) { int tid = threadIdx.x + blockIdx.x * blockDim.x; float dis = 0.0f; while (tid <...

cusolverSpDcsrlsvlu or QR method using CUDA

cuda,cusolver
I have searched the whole world but unable to solve this problem! "Unhandled exception at 0x00007FFF3AD3D430 (cusolver64_70.dll) in cusolver test.exe: 0xC0000005: Access violation reading location 0x0000000400960004." i want to solve Ax=B using least square solver or by Qr method .... my codes compile without error but later on i get...

CUDA with OpenGL: all CUDA-capable devices are busy or unavailable

c++,opengl,cuda,interop
I am following the CUDA-by-example tutorial to set up OpenGL for graphics interoperation with CUDA. Here is what I'm following. When I get to the point of registering the buffer with CUDA runtime as a graphics resource and run the code, I get an error stating that all CUDA-capable devices...

direct global memory access using cuda

c++,cuda
q1- lets say i have copy one array onto device through stream1 using cudaMemCpyAsync; would i be able to access the values of that array in different stream say 2? cudaMemcpyAsync(da,a,10*sizeof(float),cudaMemcpyHostToDevice,stream[0]); kernel<<<n,1,0,stream[0]>>>(da); kernel<<<n,1,0,stream[1]>>>(da){//calculation involving da} ; q2- would i have to include pointer to global memory array as argument in...

Building CUDA-aware openMPI on Ubuntu 12.04 cannot find cuda.h

ubuntu,cuda,mpi
I am building openMPI 1.8.5 on Ubuntu 12.04 with CUDA 6.5 installed and tested with default samples. I intend to run it on a single node with following configuration: Dell Precision T7400 Dual Xeon X5450 Nvidia GT730/Tesla C1060 The configure command issued was $ ./configure --prefix=/usr --with-cuda=/usr/local/cuda In the generated...

cuda thrust: selective copying and resizing results

cuda,thrust
I am copying items selectively between two thrust device arrays using copy_if as follows: thrust::device_vector<float4> collated = thrust::device_vector<float4> original_vec.size()); thrust::copy_if(original_vec.begin(), original_vec.end(), collated.begin(), is_valid_pt()); collated.shrink_to_fit(); The is_valid_pt is implemented as: struct is_valid_kpt { __host__ __device__ bool operator()(const float4 x) { return x.w >= 0; } }; Now after running this code,...

Calling a CUDA “Hello World” from Haskell using the FFI gives wrong results

haskell,cuda,ffi
This is the standard Hello World CUDA file: #include <stdio.h> #include "hello.h" const int N = 7; const int blocksize = 7; __global__ void hello_kernel(char *a, int *b) { a[threadIdx.x] += b[threadIdx.x]; } #define cudaCheckError() { \ cudaError_t e=cudaGetLastError(); \ if(e!=cudaSuccess) { \ printf("Cuda failure %s:%d: '%s'\n",__FILE__,__LINE__,cudaGetErrorString(e)); \ exit(0); \...

Cuda grid size limitations

cuda
Are there limitations as to what I can set the grid size of a CUDA kernel to be? I ran into a problem where kernels were not launching with a grid size of 33 x 33 but were able to launch when the grid size was 32 x 32. Is...

CUDA Nsight Debug Focus Block not active

c++,cuda,nsight
Original code: for (int row_idx = 0; row_idx < 1370-1; row_idx++){ for (int col_idx = 0; col_idx < 644-1; col_idx++){ register int idx = row_idx*644 + col_idx; //some calculations which involve setting d_depthMap[idx]=0; } } Parallised code using cuda: dim3 threadsPerBlock(8,8); dim3 numBlocks(644/threadsPerBlock.x, 1370/threadsPerBlock.y); Kernel <<<numBlocks,threadsPerBlock>>>(d_depthMap, d_dcf, d_inp, d_wdt); __global__...

CudaMemCpy returns cudaErrorInvalidValue on copying vector

c++,opencv,cuda
CudaMemCpy returns cudaErrorInvalidValue on copying vector onto the device. I have tried giving "&input", "&input[0]",... I always get the same error but don't understand why? Can you copy a vector using cudaMemcpy or do I need to copy the contents of that vector in a new array first? void computeDepthChangeMap(unsigned...

Why cuSparse is much slower than cuBlas for sparse matrix multiplication

matrix,cuda,multiplication,sparse,cublas
Recently when I used cuSparse and cuBlas in CUDA TOOLKIT 6.5 to do sparse matrix multiplication, I find cuSparse is much slower than cuBlas in all cases! In all my experiments, I used "cusparseScsrmm" in cuSparse and "cublasSgemm" in cuBlas. In the sparse matrix, half of the total elements are...

thrust exception bulk_kernel_by_value in transform_reduce

c++,c++11,cuda
I'm working on a optimization problem which contains various math functions which resembles in similar form, so I warp them in a FunctionObj template <typename T> struct FunctionObj { T a; FunctionObj(): a(1) { } }; And defines a FuncEval to evaluate template <typename T> __host__ __device__ inline T FuncEval(const...

Different results on CPU and GPU

c++,cuda,double,gpu,nvidia
I implemented the same algorithm both on CPU and GPU using C++ and CUDA C. In order to check if the results are correct I check if the 2 arrays of double calculated by both are the same with a precision of 1.0E-8 . And the result is that the...

cuMemcpyDtoH yields CUDA_ERROR_INVALID_VALUE

java,scala,ubuntu,cuda,jcuda
I have a very simple scala jcuda program that adds a very large array. Everything compiles and runs just fine until I want to copy more than 4 bytes from my device to host. I am getting CUDA_ERROR_INVALID_VALUE when I try to copy more than 4 bytes. // This does...

cuda device function and templates

c++,templates,cuda
I am using CUDA 7 and am trying to pass a function as a template parameter to a device function as follows: typedef float(*Op)(float, float); template<typename Op> __device__ bool is_maxima(float ax, float ay, cudaTextureObject_t current) { // I try to use the passed function as: float cv = tex2D<float>(current, ax,...

CUDA Matrix Addition Execution Time with Variation in Block and Grid Dimensions

cuda
#include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> #include <stdlib.h> #include <malloc.h> #include <time.h> #include <intrin.h> #include <stdint.h> uint64_t rdtsc() { return __rdtsc(); } void init_matrix(int *a,int size) { for(int i=0;i<size;i++) a[i]=i; } void print_matrix(int *a,int rows,int columns) { for(int i=0;i<rows;i++){ for(int j=0;j<columns;j++){ printf("%d ",a[j+i*columns]); } printf("\n"); } } __global__ void add_matrix(int...

Running CUDA programs on Quadro K620m

cuda,nvidia
I have laptop which has Quadro K620m GPU. I am trying to learn CUDA programming and downloaded the network installer from NVIDIA site. During CUDA SDK installation, just when its checking the hardware of the machine, it displays Do you want to Continue? This graphics driver could not find compatible...

How do you build the example CUDA Thrust device sort?

c++,visual-studio-2010,sorting,cuda,thrust
I am trying to build and run the Thrust example code in Visual Studio 2010 with the latest version (7.0) of CUDA and the THURST install that comes with it. I cannot get the example code to build and run. By eliminating parts of the code, I found the problem...

Cuda cub:Device Scan

cuda,gpu,nvcc,cub,scan
I'm using cub to implement device scan. When I run the default example for device scan I keep getting : identifier "cudaOccupancyMaxActiveBlocksPerMultiprocessor" is undefined Does anyone have any idea about this problem? Thanks,...

Integer min/max in CUDA

cuda
I see in the CUDA Math API documentation that there are functions for single and double precision min/max operations (e.g. fminf()). I assume these are highly optimized, etc. There don't seem to be functions like these for integers. Is this true? Is there a reason for that?

Understanding Memory Replays and In-Flight Requests

caching,cuda
I'm trying to understand how a matrix transpose can be faster reading naively from columns vs. rows. (example is from Professional CUDA C Programming) The matrix is in memory by row, i.e. (0,1),(0,2),(0,3)...(1,1),(1,2) __global__ void transposeNaiveCol(float *out, float *in, const int nx, const int ny) { unsigned int ix =...

How can I pass a struct to a kernel in JCuda

java,struct,cuda,jni,jcuda
I have already looked at this http://www.javacodegeeks.com/2011/10/gpgpu-with-jcuda-good-bad-and-ugly.html which says I must modify my kernel to take only single dimensional arrays. However I refuse to believe that it is impossible to create a struct and copy it to device memory in JCuda. I would imagine the usual implementation would be to...

CUDA parallel program flow

c++,c,multithreading,cuda
the code here is a cuda code and is meant to find shortest pair path using Dijkstra's algorithm. My code logic works perfectly in a c program, not in Cuda. I'm using 1 block with N threads, N being user entered. First doubt, every thread has their own copy of...

Interfacing cuSOLVER-sparse using PyCUDA

python,cuda,ctypes,pycuda,cusolver
I'm trying to interface the sparse cuSOLVER routine cusolverSpDcsrlsvqr() (>= CUDA 7.0) using PyCUDA and am facing some difficulties: I have tried wrapping the methods the same way the dense cuSolver routines are wrapped in scikits-cuda (https://github.com/lebedov/scikits.cuda/blob/master/scikits/cuda/cusolver.py). However, the code crashes with a segmentation fault when calling the cusolverSpDcsrlsvqr() function....

Compiling and Linking pure C and CUDA code [warning: implicit declaration of function]

c,compilation,cuda,gcc-warning,nvcc
I am trying to compile and link .c and .cu files and I am getting a warning warning: implicit declaration of function I have a function in the .cu file that I need to call from the .c file. The .c file is compiled using gcc and .cu file is...

Passing dynamic array of structs to GPU kernel

c++,cuda,structure,dynamic-memory-allocation
I try to pass my dynamic array of structs to kernel but it doesn't works. I get - "Segmentation fault (core dumped)" My code - EDITED #include <stdio.h> #include <stdlib.h> struct Test { unsigned char *array; }; __global__ void kernel(Test *dev_test) { } int main(void) { int n = 4;...

How to create a Cuda module without a host compiler

cuda
I would like to create a Cuda module for use in the Cuda Driver API without interacting with the host compiler. The main impetus for this is that the decisions in my group on when to change versions of host compilers and cuda compilers is not always within our control....

unable to run openCV GPU cascade classifier sample

opencv,cuda,gpu
I am trying to run cascade classifier GPU sample but unfortunately I am unable to run it. Here are my observations. I compiled cascadeclassifier.cpp with the help of jetson wiki page http://elinux.org/Jetson/Installing_OpenCV g++ cascadeclassifier.cpp -lopencv_core -lopencv_imgproc -lopencv_highgui -lopencv_calib3d -lopencv_contrib -lopencv_features2d -lopencv_flann -lopencv_gpu -lopencv_legacy -lopencv_ml -lopencv_objdetect -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_video -lopencv_videostab...

batch CUDA solution of sparse banded Ax=b for various b's

c++,cuda,sparse-matrix,matrix-factorization,cusolver
I have a sparse banded matrix A and I'd like to (direct) solve Ax=b. I have about 500 vectors b, so I'd like to solve for the corresponding 500 x's. I'm brand new to CUDA, so I'm a little confused as to what options I have available. cuSOLVER has a...