FAQ Database Discussion Community
cuda,gpu,nvcc,cub,scan
I'm using cub to implement device scan. When I run the default example for device scan I keep getting : identifier "cudaOccupancyMaxActiveBlocksPerMultiprocessor" is undefined Does anyone have any idea about this problem? Thanks,...
cuda,nvidia
I have laptop which has Quadro K620m GPU. I am trying to learn CUDA programming and downloaded the network installer from NVIDIA site. During CUDA SDK installation, just when its checking the hardware of the machine, it displays Do you want to Continue? This graphics driver could not find compatible...
c++,memory-management,cuda
I am using a board with integrated gpu and cpu memory. I am also using an external matrix library (Blitz++). I would like to be able to grab the pointer to my data from the matrix object and pass it into a cuda kernel. After doing some digging, it sounds...
c,arrays,cuda
In the following code I have an array in a struct which I need to pass to the kernel function. I can't seem to find the proper way. I tried looking at other posts on SO but do not understand their methods that well. In my actual code, I receive...
c++,multithreading,cuda
I have a 3D grid of 3D blocks, and within each block I need to compute sequentially on the "z" layers of the block. In other words, I want to execute first all (x,y,0) threads, then all (x,y,1), etc. I need to execute my threads layer by layer (counting layers...
c++,templates,cuda
I am using CUDA 7 and am trying to pass a function as a template parameter to a device function as follows: typedef float(*Op)(float, float); template<typename Op> __device__ bool is_maxima(float ax, float ay, cudaTextureObject_t current) { // I try to use the passed function as: float cv = tex2D<float>(current, ax,...
python,cuda,pycuda
I am using pycuda and i would like to know if there is an equivalent to the function cudaMemcpyToSymbol I would like to copy a constant from the host to the device like below import pycuda.driver as cuda import pycuda.autoinit from pycuda.compiler import SourceModule import numpy from sys import path...
ubuntu,cuda,mpi
I am building openMPI 1.8.5 on Ubuntu 12.04 with CUDA 6.5 installed and tested with default samples. I intend to run it on a single node with following configuration: Dell Precision T7400 Dual Xeon X5450 Nvidia GT730/Tesla C1060 The configure command issued was $ ./configure --prefix=/usr --with-cuda=/usr/local/cuda In the generated...
c++,cuda,gpgpu,thrust,sparse-array
I am interested in porting some existing code to use thrust to see if I can speed it up on the GPU with relative ease. What I'm looking to accomplish is a stream compaction operation, where only nonzero elements will be kept. I have this mostly working, per the example...
c++,opencv,opengl,cuda,kernel
I have a disparity image obtained using OpenCV. I am able to display a 640 x 360 grid of points using OpenGL and CUDA, and give to each point a Z value which corresponds to the disparity value. The result: Now I wish to give to each point in the...
ubuntu,cuda,ubuntu-14.04,nvidia
I am using ubuntu 14.04. I want to install CUDA. But I don't know which version is good for my laptop. I trace my drive that is $cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.125 Mon Dec 1 19:58:28 PST 2014 GCC version: gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)...
r,memory-leaks,cuda,valgrind
I'm building CUDA-accelerated R packages, and I want to debug with cuda-memcheck. So in this minimal example (in the deliberate_memory_leak GitHub branch), I create a memory leak in someCUDAcode.c by commenting out a necessary call to cudaFree. Then, I see if cuda-memcheck can find the leak. $ cuda-memcheck --leak-check full...
c++,cuda,sparse-matrix,matrix-factorization,cusolver
I have a sparse banded matrix A and I'd like to (direct) solve Ax=b. I have about 500 vectors b, so I'd like to solve for the corresponding 500 x's. I'm brand new to CUDA, so I'm a little confused as to what options I have available. cuSOLVER has a...
opengl,cuda,textures,buffer,opengl-4
I use OpenGL Version 4.5.0, but somehow I can not make texture_buffer_object extensions work for me ("GL_EXT_texture_buffer_object" or "GL_ARB_texture_buffer_object"). I am quite new with OpenGL, but if I understand right, these extensions are quite old and even already included in the core functionality... I looked for the extensions with "OpenGL...
cuda
There's an old Parallel ForAll blog post that demonstrates using streams and async memcpys to generate overlap between kernels and memcpys, and between HtoD and DtoH memcpys. So I ran the full Async sample given on my GTX Titan X, and here's the result: http://i.stack.imgur.com/rT676.png As you can see, when...
cuda,thrust
I am copying items selectively between two thrust device arrays using copy_if as follows: thrust::device_vector<float4> collated = thrust::device_vector<float4> original_vec.size()); thrust::copy_if(original_vec.begin(), original_vec.end(), collated.begin(), is_valid_pt()); collated.shrink_to_fit(); The is_valid_pt is implemented as: struct is_valid_kpt { __host__ __device__ bool operator()(const float4 x) { return x.w >= 0; } }; Now after running this code,...
python,cuda,pycuda
I have a dictionary and i would like to know if is it possible to use it as a parameter of a kernel. for instance i have the cuda kernel signature __global__ void calTab(Tableaux) Tableaux is a C structure corresponding to typedef struct { float *Tab1; float *Tab2; } Tableaux;...
c++,cuda,thrust
I am trying to do some scientific simulation using Thrust library in CUDA, but I got stuck in the following operation which is basically a for-each loop: device_vector<float> In(N); for-each In(x) in In Out(x) = some_calculation(In(x-1),In(x),In(x+1)); end I have already looked up stackoverflow.com and find some similar questions: Similar questions...
caching,cuda
I'm trying to understand how a matrix transpose can be faster reading naively from columns vs. rows. (example is from Professional CUDA C Programming) The matrix is in memory by row, i.e. (0,1),(0,2),(0,3)...(1,1),(1,2) __global__ void transposeNaiveCol(float *out, float *in, const int nx, const int ny) { unsigned int ix =...
cuda
I would like to create a Cuda module for use in the Cuda Driver API without interacting with the host compiler. The main impetus for this is that the decisions in my group on when to change versions of host compilers and cuda compilers is not always within our control....
c++,cuda
I have implemented a simple fft program in cuda. This is the kernel function: __global__ void fftKernel(cuComplex* dev_samples, size_t length, size_t llog, Direction direction) { int tid = threadIdx.x + blockDim.x * blockIdx.x; if (tid < length / 2) { // First step, sorts data with bit reversing and compute...
matrix,cuda
What I want to do is feed in my m x n matrix, and in parallel, construct n square diagonal matrices for each column of the matrix, perform an operation on each square diagonal matrix, and then recombine the result. How do I do this? So far, I start of...
c++,visual-studio,visual-studio-2013,cuda,cusolver
I have tried basically everything and I can't get vs2013 to compile and link against the cusolver library. I have tried all the sample projects that came with the cuda installation package and basically all of the samples work fine. Though there are no samples using cusolver. The include files...
c++,cuda,shared-memory
My kernel: __global__ void myKernel(float * devData, float * devVec, float * devStrFac, int Natom, int vecNo) { extern __shared__ float sdata[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; float qx=devVec[3*idx]; float qy=devVec[3*idx+1]; float qz=devVec[3*idx+2]; __syncthreads();//sync_1 float c=0.0,s=0.0; for (int iatom=0; iatom<Natom; iatom += blockDim.x) { float rtx =...
cuda,textures
I have a texture which I am using for reading an image. So, the texture is defined as: texture<uchar4, 2, cudaReadModeNormalizedFloat> text; I have a CUDA kernel which uses this texture to read some image pixel value as: __global__ void resample_2D(float4* result, int width, nt height, float* x, float* y)...
cuda
Are there limitations as to what I can set the grid size of a CUDA kernel to be? I ran into a problem where kernels were not launching with a grid size of 33 x 33 but were able to launch when the grid size was 32 x 32. Is...
cuda,32bit-64bit
I made a very simple CUDA kernel which populates an array of 100 elements with f[i]=i (and checked using assert in another kernel that it had indeed done so). #include<stdio.h> #include<assert.h> //definizione di gpuErrchk __global__ void setToItself(int* vect){ vect[threadIdx.x] = threadIdx.x; } int main(){ int* a_d; gpuErrchk( cudaMalloc(&a_d, 100 *...
cuda
Operating System: Windows 8.1 Single Language, 64-bit DirectX version: 11.0 GPU processor: GeForce 840M Driver version: 353.06 Direct3D API version: 11.2 Direct3D feature level: 11_0 CUDA Cores: 384 Core clock: 1029 MHz Memory data rate: 1800 MHz Memory interface: 64-bit Memory bandwidth: 14.40 GB/s Total available graphics memory: 4096 MB...
c,cuda,parallel-processing,matrix-multiplication
I have written this program and I am having some trouble understanding how to use multiple blocks by using dim3 variable in the kernel call line. This code works fine when I am doing 1000*1000 matrix multiplication, but not getting correct answer for lower dimensions like 100*100 , 200*200. #include...
matrix,cuda,multiplication,sparse,cublas
Recently when I used cuSparse and cuBlas in CUDA TOOLKIT 6.5 to do sparse matrix multiplication, I find cuSparse is much slower than cuBlas in all cases! In all my experiments, I used "cusparseScsrmm" in cuSparse and "cublasSgemm" in cuBlas. In the sparse matrix, half of the total elements are...
c++,cuda,function-pointers,gpgpu
With dynamic parallelism in CUDA, you can launch kernels on the GPU side, starting from a certain version. I have a wrapper function that takes a pointer to the kernel I want to use, and it either does this on the CPU for older devices, or on the GPU for...
c++,templates,opencv,cuda
The following code is a snippet from the PCL (point cloud) library. It calculates the integral sum of an image. template <class DataType, unsigned Dimension> class IntegralImage2D { public: static const unsigned dim_fst = Dimension; typedef cv::Vec<typename TypeTraits<DataType>::IntegralType, dim_fst> FirstType; std::vector<FirstType> img_fst; //.... lots of methods missing here that actually...
cuda,memcpy,nvprof
I'm just looking at the following output and trying to wrap my mind around the numbers: ==2906== Profiling result: Time(%) Time Calls Avg Min Max Name 23.04% 10.9573s 16436 666.67us 64.996us 1.5927ms sgemm_sm35_ldg_tn_32x16x64x8x16 22.28% 10.5968s 14088 752.18us 612.13us 1.6235ms sgemm_sm_heavy_nt_ldg 18.09% 8.60573s 14088 610.86us 513.05us 1.2504ms sgemm_sm35_ldg_nn_128x8x128x16x16 16.48% 7.84050s 68092...
c++,cuda,gpu
Recently, I am trying to program a type of neural network using c in CUDA. I have one basic question. For the programming, I can either use big arrays or different naming strategy. For example for the weights, I can put all the weights in one big array or use...
multithreading,cuda
Example of dynamic parallelism: __global__ void nestedHelloWorld(int const iSize,int iDepth) { int tid = threadIdx.x; printf("Recursion=%d: Hello World from thread %d" "block %d\n",iDepth,tid,blockIdx.x); // condition to stop recursive execution if (iSize == 1) return; // reduce block size to half int nthreads = iSize>>1; // thread 0 launches child grid...
cuda,parallel-processing,thrust
I am using reduce_by_key to find the number of elements in an array of type int2 which has same first values . For example Array: <1,2> <1,3> <1,4> <2,5> <2,7> so no. elements with 1 as first element are 3 and with 2 are 2. CODE: struct compare_int2 : public...
ubuntu,cuda,x11,freeglut,nsight
I am trying to run the CUDA particle sample on a remote Ubuntu machine from a host ubuntu machine. I followed this tutorial: http://devblogs.nvidia.com/parallelforall/remote-application-development-nvidia-nsight-eclipse-edition/ and it runs on my host, but not on my remote machine. I get the following output in Nsight: CUDA Particles Simulation Starting... grid: 64 x...
c++,cuda,tesla
I have a simple cuda application with the following code: #include <stdio.h> #include <sys/time.h> #include <stdint.h> __global__ void daxpy(int n, int a, int *x, int *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; y[i] = x[i]; int j; for(j = 0; j < 1024*10000; ++j) { y[i] += j%10; }...
c++,opengl,cuda,interop
I am following the CUDA-by-example tutorial to set up OpenGL for graphics interoperation with CUDA. Here is what I'm following. When I get to the point of registering the buffer with CUDA runtime as a graphics resource and run the code, I get an error stating that all CUDA-capable devices...
visual-studio-2013,cuda,intel-c++
This is on Windows 7 Pro 64 bit with CUDA 6.5 and Intel Composer 2015, in Visual Studio 2013. I have a project that contains a mix of native c++ and CUDA. I'm trying to switch from the Microsoft compiler to the Intel compiler, for better vectorization and performance. However,...
cuda,profiling,nvidia
I am trying to read performance counters with nvprof while executing two kernels concurrently. nvprof --concurrent-kernels on --events fb_subp0_write_sectors ./myprogram However by doing this the kernel execution seems to serialize. What I want out of this is exactly how they perform when they are running concurrently. Is it possible at...
c++,c,multithreading,cuda
the code here is a cuda code and is meant to find shortest pair path using Dijkstra's algorithm. My code logic works perfectly in a c program, not in Cuda. I'm using 1 block with N threads, N being user entered. First doubt, every thread has their own copy of...
cuda
Here I want to calculate the distance of each two points, and decide if they are neighbours. here is my simple code in cuda. __global__ void calcNeighbors(const DataPoint* points, const float doubleRadius, bool* neighbors) { int tid = threadIdx.x + blockIdx.x * blockDim.x; float dis = 0.0f; while (tid <...
cuda,gpgpu,gpu-programming,cublas
I've attempted to copy the matrix [1 2 3 4 ; 5 6 7 8 ; 9 10 11 12 ] stored in column-major format as x, by first copying it to a matrix in an NVIDIA GPU d_x using cublasSetMatrix, and then copying d_x to y using cublasGetMatrix(). #include<stdio.h>...
cuda
During implementation of CUDA code I often need some utility functions, which will be called from device and also from host code. So I declare these functions as __host__ __device__. This is OK and possible device/host incompabilities can be handled by #ifdef CUDA_ARCH. Problems come when the utility function is...
cuda
I'm just trying to use CUDA to blank an image. But "before" and "after" I get the same original image. Can't figure out the problem. sumKernel.cu: #include "sumKernel.h" __global__ void _sumKernel(char *image, int width, int height, char *kernel, int kerwidth, int kerheight) { int idx = blockIdx.x * blockDim.x +...
linux,cuda,ld,nvcc
I have installed CUDA 5.0 and NVCC on my Ubuntu virtual machine and have had problems compiling even a basic CUDA C program. The error is as follows: [email protected]:~/CUDA$ nvcc helloworld.cu -o helloworld.o -target-cpu-arch=ARM -ccbin=/usr/bin/arm-linux-gnueabi-gcc-4.6 --machine=32 /usr/lib/gcc/arm-linux-gnueabi/4.6/../../../../arm-linux-gnueabi/bin/ld: skipping incompatible /usr/local/cuda-5.0/bin/../lib/libcudart.so when searching for -lcudart /usr/lib/gcc/arm-linux-gnueabi/4.6/../../../../arm-linux-gnueabi/bin/ld: skipping incompatible...
cuda,gpu,shared-memory,bank-conflict
It is a mystery for me how shared memory on cuda devices work. I was curious to count threads having access to the same shared memory. For this I wrote a simple program #include <cuda_runtime.h> #include <stdio.h> #define nblc 13 #define nthr 1024 //[email protected] __device__ int inwarpD[nblc]; __global__ void kernel(){...
python,cuda,ctypes,pycuda,cusolver
I'm trying to interface the sparse cuSOLVER routine cusolverSpDcsrlsvqr() (>= CUDA 7.0) using PyCUDA and am facing some difficulties: I have tried wrapping the methods the same way the dense cuSolver routines are wrapped in scikits-cuda (https://github.com/lebedov/scikits.cuda/blob/master/scikits/cuda/cusolver.py). However, the code crashes with a segmentation fault when calling the cusolverSpDcsrlsvqr() function....
cuda,gpu,nvidia,prefix-sum
I've written a piece of code to call the kernel in gpugem3 but the results that I got is a bunch of negative numbers instead of prefix scan. I'm wondering if my kernel call is wrong or there is something wrong with the gpugem3 code? here is my code: #include...
cuda,cusolver
I have searched the whole world but unable to solve this problem! "Unhandled exception at 0x00007FFF3AD3D430 (cusolver64_70.dll) in cusolver test.exe: 0xC0000005: Access violation reading location 0x0000000400960004." i want to solve Ax=B using least square solver or by Qr method .... my codes compile without error but later on i get...
cuda
#include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> #include <stdlib.h> #include <malloc.h> #include <time.h> #include <intrin.h> #include <stdint.h> uint64_t rdtsc() { return __rdtsc(); } void init_matrix(int *a,int size) { for(int i=0;i<size;i++) a[i]=i; } void print_matrix(int *a,int rows,int columns) { for(int i=0;i<rows;i++){ for(int j=0;j<columns;j++){ printf("%d ",a[j+i*columns]); } printf("\n"); } } __global__ void add_matrix(int...
cuda
I have been trying to store non-square images into a volume. My code works when I want to store, 512 times, a square image of size w512 x h512, but it does not work for a non-square image of size w512 x h1024. I get an error with a description...
c,compilation,cuda,gcc-warning,nvcc
I am trying to compile and link .c and .cu files and I am getting a warning warning: implicit declaration of function I have a function in the .cu file that I need to call from the .c file. The .c file is compiled using gcc and .cu file is...
java,scala,ubuntu,cuda,jcuda
I have a very simple scala jcuda program that adds a very large array. Everything compiles and runs just fine until I want to copy more than 4 bytes from my device to host. I am getting CUDA_ERROR_INVALID_VALUE when I try to copy more than 4 bytes. // This does...
python,python-2.7,cuda,ipython,pycuda
I am trying to launch a kernel in pyCUDA and then terminate the kernel by writing to a GPU global memory location. Here is a simple example kernel that I would like to be able to terminate at some point after it enters the infinite while loop: __global__ void countUp(u16...
c++,pointers,cuda,thrust
I have a function to get the minimum value of an array and it's executed within a loop. thrust::device_ptr<float> min_ptr = thrust::min_element(populationFitness, populationFitness + POPULATION); Do I have to free the returned device_ptr? I tried with thrust::device_free(min_ptr) but an exception is thrown....
c++,c++11,lambda,cuda
If I use thrust::transform on thrust::host, the lambda usage is fine thrust::transform(thrust::host, a, a+arraySize,b,d,[](int a, int b)->int { return a + b; }); However, if I change the thrust::host to thrust::device, The code wouldn't pass the compiler. Here is the error on vs 2013 The closure type for a lambda...
c,cuda,matrix-multiplication
Currently, I made a neural networks program in the cuda c. Because I needed to manipulate the matrix multiplication, I did not use CUBLAS for MM. I use the following code for MM. I was wondering if any one has some advice to make it faster which can be very...
java,struct,cuda,jni,jcuda
I have already looked at this http://www.javacodegeeks.com/2011/10/gpgpu-with-jcuda-good-bad-and-ugly.html which says I must modify my kernel to take only single dimensional arrays. However I refuse to believe that it is impossible to create a struct and copy it to device memory in JCuda. I would imagine the usual implementation would be to...
c,cuda
I am trying to allocate and copy memory of a flattened 2D array on to the device using cudaMalloc3D to test the performance of cudaMalloc3D. But when I try to write to the array from the kernel it throws 'an illegal memory access was encountered' exception. The program runs fine...
cuda
Recently I updated from CUDA 6.0 to CUDA 7.0, and my CUDA programs with unified memory allocation stopped working (other programs without unified memory still work, and the CUDA 7.0 template in Visual Studio 2013 still works). Following What is the canonical way to check for errors using the CUDA...
c++,cuda,nsight
Original code: for (int row_idx = 0; row_idx < 1370-1; row_idx++){ for (int col_idx = 0; col_idx < 644-1; col_idx++){ register int idx = row_idx*644 + col_idx; //some calculations which involve setting d_depthMap[idx]=0; } } Parallised code using cuda: dim3 threadsPerBlock(8,8); dim3 numBlocks(644/threadsPerBlock.x, 1370/threadsPerBlock.y); Kernel <<<numBlocks,threadsPerBlock>>>(d_depthMap, d_dcf, d_inp, d_wdt); __global__...
haskell,cuda,ffi
This is the standard Hello World CUDA file: #include <stdio.h> #include "hello.h" const int N = 7; const int blocksize = 7; __global__ void hello_kernel(char *a, int *b) { a[threadIdx.x] += b[threadIdx.x]; } #define cudaCheckError() { \ cudaError_t e=cudaGetLastError(); \ if(e!=cudaSuccess) { \ printf("Cuda failure %s:%d: '%s'\n",__FILE__,__LINE__,cudaGetErrorString(e)); \ exit(0); \...
cuda,simd
Section 5.4.2 of the CUDA C Programming Guide states that branch divergence is handled either by "branch instructions" or, under certain conditions, "predicated instructions". I don't understand the difference between the two, and why one leads to better performance than the other. This comment suggests that branch instructions lead to...
r,cuda,rcpp
I'm working on a tiny R package that uses CUDA and Rcpp, adapted from the output of Rcpp.package.skeleton(). I will first describe what happens on the master branch for the commit entitled "fixed namespace". The package installs successfully if I forget CUDA (i.e., if I remove the src/Makefile, change src/rcppcuda.cu...
c++,arrays,opencv,cuda
I allocated a 2D array in device and want to copy a 2D float array to device. ImgSrc is a Mat type in openCV that I copied the elements of it into a 2D float array named ImgSrc_f.then by using cudaMemcpy2DToArray() I copied my host 2D array(ImgSrc_f) to device 2D...
c,cuda
I want to know the impact on performance when using cudaMalloc or cudaMalloc3D when allocating, copying and accessing memory for a 2D array. I have code that I tried to test the run time on where on one I use cudaMalloc and on the other cudaMalloc3D. I have included the...
c++,visual-studio-2010,sorting,cuda,thrust
I am trying to build and run the Thrust example code in Visual Studio 2010 with the latest version (7.0) of CUDA and the THURST install that comes with it. I cannot get the example code to build and run. By eliminating parts of the code, I found the problem...
c++,cuda
I have a struct defined on my host and on my device. In the host I initialize an array of this struct with values. MyStruct *h_s = (MyStruct *) malloc(objsize*sizeof(MyStruct)); hs[0] = ... Mystruct *d_s; cudaMalloc( &d_s, objsize * sizeof(MyStruct)); cudaMemcpy( d_s, h_s, objsize * sizeof(MyStruct), cudaMemcpyHostToDevice ); init<<< gridSize,...
cuda
I need to update a global array storing clock64() from different threads atomically. All of the atomic functions in CUDA support only unsigned for long long int sizes. But the return type of clock64() is signed. Is it safe to store the output from clock64() in an unsigned?
cuda,barrier
For example, my code is something like this (but it doesn't work and the kernel stalls): __device__ __managed__ int x; __global__ void kernel() { // do something while(x == 1); // a barrier // do the rest } int main() { x = 1; kernel<<< 1, 1 >>>(); x =...
cuda
I understand the concept of passing a symbol, but was wondering what exactly is going on behind the scenes. If it's not the address of the variable, then what is it?
windows,cuda,installation
I have a GeForce 8400 GS and compatible updated driver 341.44 installed on my Windows 8.1 x64 machine. When I tried to install CUDA 7.0 it showed that it could not find compatible hardware but I went along and chose to not install the provided driver and looking at this...
caching,cuda,gpgpu
I just noticed it's at all possible to have (CUDA kernel) memory accesses uncached (see e.g. this answer here on SO). Can this be done... For a single kernel individually? At run time rather than at compile time? For writes only rather than for reads and writes? ...
c,cuda,nvidia
I saw in a CUDA code example that memset is used to initialize vectors to all 0's that will store the sum of two others vectors. For example: hostRef = (float *)malloc(nBytes); gpuRef = (float *)malloc(nBytes); memset(hostRef, 0, nBytes); memset(gpuRef, 0, nBytes); What purpose does this serve if nothing else...
opencv,cuda,gpu
I am trying to run cascade classifier GPU sample but unfortunately I am unable to run it. Here are my observations. I compiled cascadeclassifier.cpp with the help of jetson wiki page http://elinux.org/Jetson/Installing_OpenCV g++ cascadeclassifier.cpp -lopencv_core -lopencv_imgproc -lopencv_highgui -lopencv_calib3d -lopencv_contrib -lopencv_features2d -lopencv_flann -lopencv_gpu -lopencv_legacy -lopencv_ml -lopencv_objdetect -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_video -lopencv_videostab...
compilation,cuda,nvcc,caffe
I am trying to build my C++ application that uses caffe, in Debug Mode, VS2013 community, x64. To be able to build version that do not need cuda to run, I added to wrapped each .cu file as indicated below: #ifndef CPU_ONLY // .cu file contents #endif The project was...
cuda,gpu,gpgpu
With the advent of dynamic parallelism in 3.5 and above CUDA architectures, is it possible to call linear algebra libraries from within __device__ functions? Can the CUSOLVER library in CUDA 7 be called from a kernel (__global__) function?...
math,cuda,floating-point
I'm looking up a book about CUDA. On the chapter which explains the floating points of CUDA, I found something odd. The book says that (1.00 * 1) + (1.00 * 1) + (1.00 * 0.01) + (1.00* 0.01) = 10. All the numbers are binaries. 0.01 refers to decimal...
c++,opencv,cuda
CudaMemCpy returns cudaErrorInvalidValue on copying vector onto the device. I have tried giving "&input", "&input[0]",... I always get the same error but don't understand why? Can you copy a vector using cudaMemcpy or do I need to copy the contents of that vector in a new array first? void computeDepthChangeMap(unsigned...
c++,c++11,cuda
I'm working on a optimization problem which contains various math functions which resembles in similar form, so I warp them in a FunctionObj template <typename T> struct FunctionObj { T a; FunctionObj(): a(1) { } }; And defines a FuncEval to evaluate template <typename T> __host__ __device__ inline T FuncEval(const...
c++,cuda
I have this cuda file: #include "cuda.h" #include "../../HandleError.h" #include "Sphere.hpp" #include <stdlib.h> #include <CImg.h> #define WIDTH 1280 #define HEIGHT 720 #define rnd(x) (x*rand()/RAND_MAX) #define SPHERES_COUNT 5 using namespace cimg_library; __global__ void kernel(unsigned char* bitmap, Sphere* s) { // Map threadIdx/blockIdx to pixel position int x = threadIdx.x + blockIdx.x...
c++,cuda,macros
I was trying to write a simple CUDA function to blur images. I use myself defined max and min macro as #define min(a, b) ((float)a > (float)b)? (float)b: (float)a #define max(a, b) ((float)a > (float)b)? (float)a: (float)b The part of __global__ kernel is: float norm; float sum = 0;// when...
c++,cuda,structure,dynamic-memory-allocation
I try to pass my dynamic array of structs to kernel but it doesn't works. I get - "Segmentation fault (core dumped)" My code - EDITED #include <stdio.h> #include <stdlib.h> struct Test { unsigned char *array; }; __global__ void kernel(Test *dev_test) { } int main(void) { int n = 4;...
c++,cuda,double,gpu,nvidia
I implemented the same algorithm both on CPU and GPU using C++ and CUDA C. In order to check if the results are correct I check if the 2 arrays of double calculated by both are the same with a precision of 1.0E-8 . And the result is that the...
c#,cuda,sharpdx,direct3d9,managed-cuda
I’m working on a prototype that integrates WPF, Direct3D9 (using Microsoft’s D3DImage WPF class), and CUDA (I need to be able to generate a texture for the D3DImage on the GPU). The problem is, CUDA doesn’t update my texture. No error codes are returned, the texture just stays unchanged. Even...
cuda
As NVIDIA GPU evolve the amount of cores per SM changes: in Fermi we have 32 of them, but in Maxwell the number is 128 according to the white papers. So, my questions are following: Is that better to create grids with blocks, containing 128 threads each? Will such code...
c++,multithreading,cuda,shuffle
I have got problem with shuffle instruction in CUDA 5.0. This is snippet of my kernel. It is inside the loop. Print is there only for debug purpose because I can't use ordinary debugger: ... tex_val = tex2D(srcTexRef, threadIdx.x + w, y_pos); if (threadIdx.x == 0) { left = left_value[y_pos];...
cuda,parallel-processing,research
I am trying to replicate the linear programming solver that this person has made http://www.idi.ntnu.no/~elster/master-studs/spampinato/spampinato-linear-prog-gpu-report.pdf. First of, the device I am using is Quadro FX 1800M with compute capability 1.2. My problem is that when I launch more than 22 threads per block then most of the time I get...
cuda,direct3d,tesla
I would like to know if I can work with Nvidia Tesla K20 and Direct3D 11? I'd like to render an image using Direct3D, Then process the rendered image with CUDA, [ I know how to work out the CUDA interoperability]. Tesla k20 doesn't have a display adapter (physically remote...
c++,c,arrays,cuda
I am trying to implement k-means algorithm on CUDA using Tesla card on external Unix. I read input file and store coordinates of all data points in dataX and dataY arrays. The next step is to select every centreInterval-th point and store it in another array allocated in GPU memory....
multithreading,cuda
I have the following C code for CUDA program: #include <stdio.h> #define NUM_BLOCKS 4 #define THREADS_PER_BLOCK 4 __global__ void hello() { printf("Hello. I'm a thread %d in block %d\n", threadIdx.x, blockIdx.x); } int main(int argc,char **argv) { // launch the kernel hello<<<NUM_BLOCKS, THREADS_PER_BLOCK>>>(); // force the printf()s to flush cudaDeviceSynchronize();...
cuda
I see in the CUDA Math API documentation that there are functions for single and double precision min/max operations (e.g. fminf()). I assume these are highly optimized, etc. There don't seem to be functions like these for integers. Is this true? Is there a reason for that?
c,struct,cuda,dynamic-memory-allocation
I have problem with passing array of struct to gpu kernel. I based on this topic - cudaMemcpy segmentation fault and I wrote sth like this: #include <stdio.h> #include <stdlib.h> struct Test { char *array; }; __global__ void kernel(Test *dev_test) { for(int i=0; i < 5; i++) { printf("Kernel[0][i]: %c...
cuda,nvidia
dim3 block(4, 2) dim3 grid((nx+block.x-1)/block.x, (ny.block.y-1)/block.y); I found this code in Professional CUDA C Programming on page 53. It's meant to be a naive example of matrix multiplication. nx is the number of columns and ny is the number of rows. Can you explain how the grid size is computed?...
c++,cuda,linear-algebra,solver,cusolver
We are experiencing problems while using cuSOLVER's cusolverSpScsrlsvchol function, probably due to misunderstanding of the cuSOLVER library... Motivation: we are solving the Poisson equation -divgrad x = b on a rectangular grid. In 2 dimensions with a 5-stencil (1, 1, -4, 1, 1), the Laplacian on the grid provides a...
c++,cuda
q1- lets say i have copy one array onto device through stream1 using cudaMemCpyAsync; would i be able to access the values of that array in different stream say 2? cudaMemcpyAsync(da,a,10*sizeof(float),cudaMemcpyHostToDevice,stream[0]); kernel<<<n,1,0,stream[0]>>>(da); kernel<<<n,1,0,stream[1]>>>(da){//calculation involving da} ; q2- would i have to include pointer to global memory array as argument in...
cuda,mvapich2
I'm using MVAPICH2 2.1 on a Debian 7 machine. It has multiple cards of Tesla K40m. The code is as follows. #include <cstdio> #include <cstdlib> #include <ctime> #include <cuda_runtime.h> #include <mpi.h> int main(int argc, char** argv) { MPI_Status status; int rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); cudaSetDevice(0); if (rank == 0)...
c++11,gcc,cuda,future
When I compile the following code containing the design C++11, I get errors - it does not compile. I've tried with different flags, but I haven't found a solution. My setting: CUDA 6.5, gcc 4.4.7 I am not able to change the settings. How can I still make this work?...