FAQ Database Discussion Community


Coprocessor accelerators compared to GPUs

gpu,intel,gpgpu,hardware-acceleration,xeon-phi
Are coprocessors like Intel Xeon-Phi supposed to be utilized much like the GPUs, so that one should offload a large amount of blocks executing a single kernel, so that only the overall throughput the coprocessor handles results in a speed up, OR offloading independent threads (tasks) will increase the efficiency...

nvcc -arch sm_52 gives error “Value 'sm_52' is not defined for option 'gpu-architecture'”

cuda,gpu,nvcc
I updated my cuda toolkit from 5.5 to 6.5. Then following command nvcc -arch=sm_52 starts to give me an error nvcc fatal : Value 'sm_52' is not defined for option 'gpu-architecture' Is this a bug ? or nvcc 6.5 does not support Maxwell virtual architecture....

CUDA: Find out if host buffer is pinned (page-locked)

c++,memory,cuda,gpu
A short description of my problem is as follows: I developed a function that calls a CUDA kernel. My function receives a pointer to the host data buffers (input and output of kernel), and has no control over the allocation of these buffers. --> It is possible that the host...

Is it possible to offload function to graphic card?

c++,multithreading,gpu,intel,cilk
I have a C++ multithread application, and I want to get better performance and decrease the total CPU usage, by using the Intel HD Graphics. I'm not using CILK. (The application is written in pure C++) I read the follwoing link: How to offload computation to Intel(R) Graphics Technology But...

Cuda cub:Device Scan

cuda,gpu,nvcc,cub,scan
I'm using cub to implement device scan. When I run the default example for device scan I keep getting : identifier "cudaOccupancyMaxActiveBlocksPerMultiprocessor" is undefined Does anyone have any idea about this problem? Thanks,...

Is there a function in the cublas that can apply the sigmoid function with a vector?

cuda,gpu,cublas
As the title says, I want to do the element-wise operation in the vector with a function.I wonder that is there any function in the cublas library to do that?

Efficiently Calculate Frequency Averaged Periodogram Using GPU

performance,matlab,optimization,gpu,signal-processing
In Matlab I am looking for a way to most efficiently calculate a frequency averaged periodogram on a GPU. I understand that the most important thing is to minimise for loops and use the already built in GPU functions. However my code still feels relatively unoptimised and I was wondering...

purposely causing bank conflicts for shared memory on CUDA device

cuda,gpu,shared-memory,bank-conflict
It is a mystery for me how shared memory on cuda devices work. I was curious to count threads having access to the same shared memory. For this I wrote a simple program #include <cuda_runtime.h> #include <stdio.h> #define nblc 13 #define nthr 1024 //[email protected] __device__ int inwarpD[nblc]; __global__ void kernel(){...

Which GPU model/brand is optimal for Neural Networks? [closed]

machine-learning,neural-network,gpu
This is not an unreasonable question. Nvidia and ATI architectures differ, enough so that for certain tasks (such as bitcoin mining) ATI is vastly better than Nvidia. The same could be true for Neural Network related processing. I have attempted to find comparisons of the 2 GPU brands in such...

GPU computing in matlab

matlab,gpu
My matlab version is 2014a. Norm function is built in gpuArray function in matlab. I think it should return gpuArray class when input is gpuArray. But in my matlab, it returns double. Could anyone tell me what happened? Example, a=gpuArray.randn(3,4); b=norm(a) the type of b is "double" instead of "gpuArray"....

How can I find out which thread is getting executed on which core of the GPU?

cuda,gpu,nvidia
I'm developing some simple programs in Cuda and i want to know which thread is getting executed on which core of the GPU. I'm using Visual Studio 2012 and i have a NVIDIA GeForce 610M graphic card. Is it possible to do so... I've already searched a lot on google...

GPU rendering images with Python cv2 on Raspberry Pi

python,opencv,computer-vision,gpu
I am using the Raspberry Pi 2 to load large resolution images using opencv. I have sketch running, but without apparent "OpenGL" support as the opencv library states that it is not supported: OpenCV Error: no OpenGL support (Library was built without openGL support) I attempted to install pyOpenGL, but...

Compress “sparse data” with CUDA (CCL: connected component labeling reduction)

cuda,gpu,cudafy.net
I have a 5 million list of 32 bit integers (actually a 2048 x 2560 image) that is 90% zeros. The non-zero cells are labels (e.g. 2049, 8195, 1334300, 34320923, 4320932) that completely not sequential or consecutive in any way (it is the output of our custom connected component labeling...

GPU Programming Strategy

c++,cuda,gpu
Recently, I am trying to program a type of neural network using c in CUDA. I have one basic question. For the programming, I can either use big arrays or different naming strategy. For example for the weights, I can put all the weights in one big array or use...

large numbers and float and double in C

c,matrix,floating-point,double,gpu
I need to deal with very large matrices and/or large numbers and I don't know why double result = 2251.000000 * 9488.000000 + 7887.000000 * 8397.000000; gives me the correct output of 87584627.000000. Same with int result. However, if I use float result = 2251.000000f + ... etc, it gives...

is early exit of loops on GPU worth doing?

opengl-es,glsl,webgl,shader,gpu
We've written GLSL shader code to do ray tracing visualisation using the GPU. It seems to be pretty standard to put an early exit break in the ray marching loop, so if the light is extinguished, the loop breaks. But from what I know about GPU code, each render will...

Different results on CPU and GPU

c++,cuda,double,gpu,nvidia
I implemented the same algorithm both on CPU and GPU using C++ and CUDA C. In order to check if the results are correct I check if the 2 arrays of double calculated by both are the same with a precision of 1.0E-8 . And the result is that the...

Wierd behavior with Visual Studio Debugger

visual-studio-2013,gpu,visual-studio-debugging,directx-11
I experienced some weird behavior with Visual Studio's Debugger when running VS with the Dedicated GPU. What is weird is that when I terminate the program I am building, the debugger stays on. I don't see this when running VS with the integrated graphics. Also - I checked if there...

Set each gpu for each thread

cuda,gpu,gpu-programming,multi-gpu
For example, I have 2 GPUs and 2 host threads. I cant check it because multigpu PC is far away from me. I want to make the first host thread work with the first GPU and the second host thread work with the second GPU. All host threads consist of...

OpenGL, measuring rendering time on gpu

opengl,gpu,timing,opengl-4,elapsedtime
I have some big performance issues here So I would like to take some measurements on the gpu side. By reading this thread I wrote this code around my draw functions, including the gl error check and the swapBuffers() (auto swapping is indeed disabled) gl4.glBeginQuery(GL4.GL_TIME_ELAPSED, queryId[0]); { draw(gl4); checkGlError(gl4); glad.swapBuffers();...

When is a Three.js Texture sent to the GPU?

javascript,three.js,textures,gpu
I am building an application that dynamically loads images from a server to use as textures in the scene and I am working on how to load/unload these textures properly. My simple question is; Where, in the Three.js call graph, does textures get loaded and/or updated into the GPU? Is...

Why the CPU compiles the GPU shader?

caching,compilation,shader,gpu,cpu
To understand in general how GPU's cache support work, I read some information and understood this: CPU compiles shader and transmit resulting code of shader to GPU to execute and also save it to the disk. If necessary to execute the same shader, GPU get it saved binary code directly...

Basic GPU application, integer calculations

c,gpu,gpgpu
Long story short, I have done several prototypes of interactive software. I use pygame now (python sdl wrapper) and everything is done on CPU. I am starting to port it to C now and at the same time search for the existing possibilities to use some GPU power to entlast...

CUDA shared memory bank conflicts report higher

cuda,gpu,shared-memory,bank-conflict
I've been working on optimizing some code and ran into an issue with the shared memory bank conflict report with the CUDA Nsight performance analysis. I was able to reduce it to a very simple piece of code that Nsight reports as having a bank conflict, when it doesn't seem...

How does printf work on CUDA compute >= 2

c,cuda,printf,gpu
In the earlier days printf was not supported and we would either run CUDA programs using emulator or would copy back and forth the variable and print on host side. Now that CUDA (arch 2 and greater) support printf I am currious to know how this work? I mean how...

“invalid configuration argument ” error for the call of CUDA kernel

cuda,kernel,gpu
I have GEFORCE 620M and my code is: int threadsPerBlock = 256; int blocksPerGrid = Number_AA_GPU / threadsPerBlock; for(it=0;it<Number_repeatGPU;it++) { Kernel_Update<<<blocksPerGrid,threadsPerBlock>>>(A, B,C, D, rand(), rand()); } and I get : invalid configuration argument. what could be the reason? ...

QOpenGLShaderProgram: is possible to make error output nice?

qt,opengl,error-handling,shader,gpu
I'm implementing some numerical algorithms on GPU via OpenGL and Qt. But i am not very familiar with it. I want to extract some functions from my current shader to some "shader library" and use it in my other shaders by string interpolation. It not hard to implement but i...

OpenCL clGetDeviceIDs returns -1 when GPU and 0 when CPU

opencl,gpu
I was running my OpenCL/C++ code on Ubuntu 14.04.2 LTS (NVIDIA Corporation GM204 [GeForce GTX 980]). It's working correctly on CPU but clGetDeviceIDs method returned -1 when I changed CL_DEVICE_TYPE_CPU to CL_DEVICE_TYPE_GPU. The code in question: ret = clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_GPU, 1, &device_id, &ret_num_devices); cout << ret; checkError(ret, "clGetDeviceIDs"); Outputs -1....

opencl - use image object with local memory

memory,opencl,gpu,shared-memory,gpgpu
i'm trying to program with opencl. There are two types of memory object. one is buffer and another one is image. some blogs and web site,white papers say 'image object is little bit faster that buffer because of cache'. i'm trying to use image object and the reason for that...

Can GPU be used to run programs that run on CPU?

gpu
Can Gpu be used to run programs that run on Cpu like getting input from keyboard and mouse or playing music or reading the contents of a text file using Direct3D and OpenGL Api?

Do gpu cores switch tasks when they're done with one?

gpu,gpgpu,c++-amp
I'm experimenting with c++ AMP, one thing thats unclear from MS documentation is this: If I dispatch a parallel_for_each with an extent of say 1000, then that would mean that it spawns 1000 threads. If the gpu is unable to take on those 1000 threads at the same time, it...

Cuda program not working for more than 1024 threads

cuda,gpu,nvidia
My program is of Odd-even merge sort and it's not working for more than 1024 threads. I have already tried increasing the block size to 100 but it still not working for more than 1024 threads. I'm using Visual Studio 2012 and I have Nvidia Geforce 610M. This is my...

In CNN with caffe, Can I set up initial caffemodel?

gpu,restart,convolution,caffe
I have operated CNN using caffe. however, system was forced termination. I have caffemodel so far. Can I restart learning from now on using current caffemodel? Thanks,...

Unhandled exception… Access Violation reading location

c,visual-studio-2010,cuda,gpu
I'm following a tutorial on using NVIDIA/CUDA/etc. here: http://www.nvidia.com/content/gtc-2010/pdfs/2131_gtc2010.pdf I'm trying to add two vectors in parallel, but I am having trouble with these memory access violations mentioned in the title of my post. The error is occurring at my printf line (I will post my code below), but if...

Is prefix scan CUDA sample code in gpugems3 correct?

cuda,gpu,nvidia,prefix-sum
I've written a piece of code to call the kernel in gpugem3 but the results that I got is a bunch of negative numbers instead of prefix scan. I'm wondering if my kernel call is wrong or there is something wrong with the gpugem3 code? here is my code: #include...

How does Tesseract use OpenCL?

parallel-processing,opencl,gpu,tesseract
I am working on a project that requires me to speed up the process of text recognition using Tesseract. I came across an article which said Tesseract is working in conjunction with OpenCL to offload some of the compute intensive tasks onto the CPU or GPUs available. Is there a...

Issues concatenating in Arrayfun with GPU processing. MATLAB

matlab,concatenation,gpu
I am having issues using arrayfun in MATLAB using GPU processing. I have simplified my situation below. I have 4 large matrices (video data as (x,y,t)). Ill use random for this example. A = gpuArray(rand(10,10,100)); B = gpuArray(rand(10,10,100)); C = gpuArray(rand(10,10,100)); D = gpuArray(rand(10,10,100)); I wish to take each pixel...

Linear algebra libraries and dynamic parallelism in CUDA

cuda,gpu,gpgpu
With the advent of dynamic parallelism in 3.5 and above CUDA architectures, is it possible to call linear algebra libraries from within __device__ functions? Can the CUSOLVER library in CUDA 7 be called from a kernel (__global__) function?...

Confusion regarding frequent updates of instanced array using glBufferSubData

c++,opengl,buffer,gpu
I'm rendering large patches of grass using instanced rendering and for that I use an instanced array consisting of a large number of 4x4 transformation matrices. I use a LOD algorithm on the grass leaves to determine which leaves to render based on their distance to the camera. For this...

How does Matlab implement GPU computation in CPU parallel loops?

matlab,parallel-processing,gpu
Can we improve performance by calculating some parts of CPU's parfor or spmd blocks using gpuArray of GPU functions? Is this a rational way to improve performance or there are limitations in this procedure? I read somewhere that we can use this procedure when we have some GPU units. Is...

unable to run openCV GPU cascade classifier sample

opencv,cuda,gpu
I am trying to run cascade classifier GPU sample but unfortunately I am unable to run it. Here are my observations. I compiled cascadeclassifier.cpp with the help of jetson wiki page http://elinux.org/Jetson/Installing_OpenCV g++ cascadeclassifier.cpp -lopencv_core -lopencv_imgproc -lopencv_highgui -lopencv_calib3d -lopencv_contrib -lopencv_features2d -lopencv_flann -lopencv_gpu -lopencv_legacy -lopencv_ml -lopencv_objdetect -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_video -lopencv_videostab...

Incorrect act OpenCl/C++ script on CPU and GPU

c++,opencl,gpu,cpu
I have transfered OpenCl/C++ script to new machine (Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz, NVIDIA TESLA C2070). I ran it successfully on GPU and I got correct results (Here when I tried run it on CPU it gave me incorrect results 0), then I wanted to run it on CPU...