FAQ Database Discussion Community


How to do calculation using OpenGL ES 2.0/3.0?

opengl-es,gpgpu
i'm thinking a problem that is it possible to do some arrays calcution using OpenGL ES in mobile devices. For example, i used glTexImage2D to pass shader a float array (which contains some 0.0 and 1.0, such as {0.0, 1.0, 1.0, 0.0, 0.0...}), and i wish to figure out how...

Making some, but not all, (CUDA) memory accesses uncached

caching,cuda,gpgpu
I just noticed it's at all possible to have (CUDA kernel) memory accesses uncached (see e.g. this answer here on SO). Can this be done... For a single kernel individually? At run time rather than at compile time? For writes only rather than for reads and writes? ...

OpenCL strange behavior

opencl,gpgpu
Good day, I think I've tried everything to figure out where the problem is but I couldn't. I have the following code for a host: cl_mem cl_distances = clCreateBuffer(context, CL_MEM_READ_WRITE, 2 * sizeof(cl_uint), NULL, NULL); clSetKernelArg(kernel, 0, sizeof(cl_mem), &cl_distances); cl_event event; clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, &global_workers, &local_workers, 0, NULL, &event);...

How to explain this GPU computation?

image-processing,opencl,gpgpu
I am working on implementing an image processing kernel in 2D on my GPU using OpenCL. I am getting very puzzling results from my GPU. The code uses a 2X2 stencil and computes the average of each input sample in the stencil and adds the computed average to each sample...

Is sort_by_key in thrust a blocking call?

cuda,gpgpu,thrust
I repeatedly enqueue a sequence of kernels: for 1..100: for 1..10000: // Enqueue GPU kernels Kernel 1 - update each element of array Kernel 2 - sort array Kernel 3 - operate on array end // run some CPU code output "Waiting for GPU to finish" // copy from device...

Is it possible to say which pointer was allocated by cudaMalloc and which by malloc?

c,memory-management,cuda,gpgpu,nvidia
For example, I have a float pointer in the host code float *p Is it possible to determine a type(device/host) of memory to which he points?...

What is the difference between the CUDA tookit and the CUDA sdk

cuda,gpgpu,nvidia
I am installing CUDA on Ubuntu 14.04 and have a Maxwell card (GTX 9** series) and I think I have installed everything properly with the toolkit as I can compile my samples. However, I read that in places that I should install the SDK (This appears to be talked about...

How do images work in opencl kernel?

c++,c,opencl,gpgpu
I'm trying to find ways to copy multidimensional arrays from host to device in opencl and thought an approach was to use an image... which can be 1, 2, or 3 dimensional objects. However I'm confused because when reading a pixle from an array, they are using vector datatypes. Normally...

Unable to generate gpg keys in linux [closed]

linux,gpgpu,gnupg,gpgme
I'm not able to generate GPG keys in linux sudo gpg --gen-key # This is the command to try to generate key error You need a Passphrase to protect your secret key. gpg: problem with the agent: Timeout gpg: Key generation canceled. Please let me know where I'm doing wrong...

opencl - use image object with local memory

memory,opencl,gpu,shared-memory,gpgpu
i'm trying to program with opencl. There are two types of memory object. one is buffer and another one is image. some blogs and web site,white papers say 'image object is little bit faster that buffer because of cache'. i'm trying to use image object and the reason for that...

local and global work sizes in open cl

parallel-processing,opencl,gpgpu
I am trying to learn open cl but there is a source of confusion i do not understand right now, it is related to such lines size_t global_item_size = LIST_SIZE; // Process the entire lists size_t local_item_size = 64; // Divide work items into groups of 64 ret = clEnqueueNDRangeKernel(command_queue,...

Monetdb, cuda, opencl, amd bolt

gpgpu,computed-columns,monetdb
I'm wondering if it's possible to create a user function and call any of the GPGPU languages. Is it possible to invoke cuda functions from monetdb? I've been reading about amd bolt. Will it be possible to be used with monetdb? Thanks...

accessing file system using cpu device in opencl

opencl,gpgpu
I am a newbie to opencl. I have a doubt about opencl functioning when kernel is running on a cpu device.Suppose we have a kernel running on a cpu device, can it read from a file on the disk. If yes,then how ? If no , then why not ?...

Linear algebra libraries and dynamic parallelism in CUDA

cuda,gpu,gpgpu
With the advent of dynamic parallelism in 3.5 and above CUDA architectures, is it possible to call linear algebra libraries from within __device__ functions? Can the CUSOLVER library in CUDA 7 be called from a kernel (__global__) function?...

Why does the “cpu” accelerator report “No” for the supports_double_precision data member?

visual-c++,gpgpu,c++-amp
If you check the "cpu" accelerator with MS C++ AMP, you will get "no" for the supports_double_precision. Now, I was under the impression that a CPU has better precision than a GPU... is this just because MSVC++'s math library is not precise enough? Example code to get the output: #include...

DXGI_ERROR_DEVICE_HUNG resulting from concurrency::copy on C++ AMP

gpgpu,direct3d11,c++-amp
I have created some C++ AMP code for performing background gradient removal on astronomical images. They come in as 16-bit unsigned integers for RGB. All of my application's processing and output occurs in single precision floating point, so I convert the input data, run the C++ AMP code, and then...

Differences between clBLAS and ViennaCL?

opencl,gpgpu,viennacl
Looking at the OpenCL libraries out there I am trying to get a complete grasp of each one. One library in particular is clBLAS. Their website states that it implements BLAS level 1,2, & 3 methods. That is great but ViennaCL also has BLAS routines, linear algebra solvers, supports OpenCL...

Coprocessor accelerators compared to GPUs

gpu,intel,gpgpu,hardware-acceleration,xeon-phi
Are coprocessors like Intel Xeon-Phi supposed to be utilized much like the GPUs, so that one should offload a large amount of blocks executing a single kernel, so that only the overall throughput the coprocessor handles results in a speed up, OR offloading independent threads (tasks) will increase the efficiency...

Basic GPU application, integer calculations

c,gpu,gpgpu
Long story short, I have done several prototypes of interactive software. I use pygame now (python sdl wrapper) and everything is done on CPU. I am starting to port it to C now and at the same time search for the existing possibilities to use some GPU power to entlast...

Percentile of array (in CUDA ) without sort?

cuda,gpgpu,percentile
I have a 2560x2048 array of float values that I need the 25% and the 75% percentile values. (5,242,880) as a 1D vector. My first thought was to use a bitonic sort and fetch the value at 25% and 75%. But the Bitonic sort I have is for power of...

Accelerating __device__ function in Thrust comparison operator

cuda,parallel-processing,gpgpu,thrust
I'm running a Thrust parallelized binary search-type routine on an array: // array and array2 are raw pointers to device memory thrust::device_ptr<int> array_ptr(array); // Search for first position where 0 could be inserted in array // without violating the ordering thrust::device_vector<int>::iterator iter; iter = thrust::lower_bound(array_ptr, array_ptr+length, 0, cmp(array2)); A custom...

OpenCL kernel definition syntax

opencl,gpgpu
I'm trying to clarify some structs and syntax in OpenCL. Currently I'm working with VS2013 and OpenCL Emulator-Debugger. I started working with the demo project which comes with the emulator and stuck into this: __Kernel(hello) __ArgNULL { ... } Just two lines above there is this: //__kernel void //hello() What's...

CUDA cuBlasGetmatrix / cublasSetMatrix fails | Explanation of arguments

cuda,gpgpu,gpu-programming,cublas
I've attempted to copy the matrix [1 2 3 4 ; 5 6 7 8 ; 9 10 11 12 ] stored in column-major format as x, by first copying it to a matrix in an NVIDIA GPU d_x using cublasSetMatrix, and then copying d_x to y using cublasGetMatrix(). #include<stdio.h>...

Not all work-items being used opencl

c++,linux,opencl,gpgpu,nvidia
so I'm able to compile and execute my kernel, the problem is that only two work-items are being used. I'm basically trying to fill up a float array[8] with {0,1,2,3,4,5,6,7}. So this is a very simple hello world application. Bellow is my kernel. // Highly simplified to demonstrate __kernel void...

OpenCL barrier of finding max in a block

parallel-processing,max,opencl,gpgpu,barrier
I've found a piece of OpenCL kernel sample code in Nvidia's developer site The purpose function maxOneBlock is to find out the biggest value of array maxValue and store it to maxValue[0]. I was fully understand about the looping part, but confused about the unroll part: Why the unroll part...

What's the difference between parallel cost and parallel work?

performance,algorithm,optimization,parallel-processing,gpgpu
I read a paper in which parallel cost for (parallel) algorithms is defined as CP(n) = p * TP(n), where p is the number of processors, T the processing time and n the input. An algorithm is cost-optimal, if CP(n) is approximately constant, i.e. if the algorithm uses two processors...

Thrust : reduce_by_key is slower than expected

performance,cuda,parallel-processing,gpgpu,thrust
I have the following code : thrust::device_vector<int> unique_idxs(N); thrust::device_vector<int> sizes(N); thrust::pair<thrust::device_vector<int>::iterator, thrust::device_vector<int>::iterator> new_end = reduce_by_key(idxs.begin(), idxs.end(),thrust::make_constant_iterator(1),unique_idxs.begin(),sizes.begin()); int unique_elems=new_end.first-unique_idxs.begin(); sizes.erase(new_end.second, sizes.end()); where idxs is a sorted device vector of indices, unique_idxs are the unique indices and sizes...

Function pointer (to other kernel) as kernel arg in CUDA

c++,cuda,function-pointers,gpgpu
With dynamic parallelism in CUDA, you can launch kernels on the GPU side, starting from a certain version. I have a wrapper function that takes a pointer to the kernel I want to use, and it either does this on the CPU for older devices, or on the GPU for...

OpenCL: 2-10x run time summing columns rather that rows of square array?

opencl,gpgpu
I'm just getting started with OpenCL, so I'm sure there's a dozen things I can do to improve this code, but one thing in particular is standing out to me: If I sum columns rather than rows (basically contiguous versus strided, because all buffers are linear) in a 2D array...

Do gpu cores switch tasks when they're done with one?

gpu,gpgpu,c++-amp
I'm experimenting with c++ AMP, one thing thats unclear from MS documentation is this: If I dispatch a parallel_for_each with an extent of say 1000, then that would mean that it spawns 1000 threads. If the gpu is unable to take on those 1000 threads at the same time, it...

Optimizing local memory use with OpenCL

memory-management,opencl,gpgpu
OpenCL is of course designed to abstract away the details of hardware implementation, so going down too much of a rabbit hole with respect to worrying about how the hardware is configured is probably a bad idea. Having said that, I am wondering how much local memory is efficient to...

Stream compaction with Thrust; best practices and fastest way?

c++,cuda,gpgpu,thrust,sparse-array
I am interested in porting some existing code to use thrust to see if I can speed it up on the GPU with relative ease. What I'm looking to accomplish is a stream compaction operation, where only nonzero elements will be kept. I have this mostly working, per the example...

Coalesced memory access to 2d array with CUDA

c++,arrays,cuda,gpgpu,nvidia
I'm working on a piece of CUDA C++ code and need each thread to, essentially, access a 2D array in global memory by BOTH row-major AND column-major. Specifically, I need each thread-block to: generate it's own 1-d array (let's say, gridDim # of elements) Write these to global memory Read...

sending oclMat to function creates huge difference in runtime

opencv,image-processing,opencl,gpgpu
I wrote a function (masking) with 3 inputs: inputOCL - an oclMat comparisonValue - a double value method - an int variable determining the comparison method For my example I chose method=1, which stands for CMP_GT, testing if inputOCL>comparisonValue element-wise. The purpose of the function is to zero out all...

OpenCL / AMD: Deep Learning

sdk,opencl,neural-network,gpgpu,deep-learning
While "googl'ing" and doing some research I were not able to find any serious/popular framework/sdk for scientific GPGPU-Computing and OpenCL on AMD hardware. Is there any literature and/or software I missed? Especially I am interested in deep learning. For all I know deeplearning.net recommends NVIDIA hardware and CUDA frameworks. Additionally...