cuda,parallel-processing,thrust , Reduce by key on device array

Reduce by key on device array


Tag: cuda,parallel-processing,thrust

I am using reduce_by_key to find the number of elements in an array of type int2 which has same first values .

For example Array: <1,2> <1,3> <1,4> <2,5> <2,7> so no. elements with 1 as first element are 3 and with 2 are 2.


struct compare_int2 : public thrust::binary_function<int2, int2, bool> {
__host__ __device__ bool operator()(const int2 &a,const int2 &b) const{
return (a.x == b.x);}

compare_int2 cmp;
int main()
        int n,i;

        int2 *h_list = (int2 *) malloc(sizeof(int2)*n);
        int *h_ones = (int *) malloc(sizeof(int)*n);

        int2 *d_list,*C;
        int  *d_ones,*D;
        cudaMalloc((void **)&d_list,sizeof(int2)*n);
        cudaMalloc((void **)&d_ones,sizeof(int)*n);
        cudaMalloc((void **)&C,sizeof(int2)*n);
        cudaMalloc((void **)&D,sizeof(int)*n);

                int2 p;
                printf("Value ? ");
                scanf("%d %d",&p.x,&p.y);
                h_list[i] = p;
                h_ones[i] = 1;

        thrust::reduce_by_key(d_list, d_list+n, d_ones, C, D,cmp);
        return 0;

The above code is showing Segmentation Fault . I ran the above code using gdb and it reported the segfault at this location.

thrust::system::detail::internal::scalar::reduce_by_key > (keys_first=0x1304740000,keys_last=0x1304740010,values_first=0x1304740200,keys_output=0x1304740400, values_output=0x1304740600,binary_pred=...,binary_op=...)

at /usr/local/cuda-6.5/bin/../targets/x86_64-linux/include/thrust/system/detail/internal/scalar/reduce_by_key.h:61 61
InputKeyType temp_key = *keys_first

How to use reduce_by_key on device arrays ?


Thrust interprets ordinary pointers as pointing to data on the host:

    thrust::reduce_by_key(d_list, d_list+n, d_ones, C, D,cmp);

Therefore thrust will call the host path for the above algorithm, and it will seg fault when it attempts to dereference those pointers in host code. This is covered in the thrust getting started guide:

You may wonder what happens when a "raw" pointer is used as an argument to a Thrust function. Like the STL, Thrust permits this usage and it will dispatch the host path of the algorithm. If the pointer in question is in fact a pointer to device memory then you'll need to wrap it with thrust::device_ptr before calling the function.

Thrust has a variety of mechanisms (e.g. device_ptr, device_vector, and execution policy) to identify to the algorithm that the data is device-resident and the device path should be used.

The simplest modification for your existing code might be to use device_ptr:

#include <thrust/device_ptr.h>
thrust::device_ptr<int2> dlistptr(d_list);
thrust::device_ptr<int>  donesptr(d_ones);
thrust::device_ptr<int2> Cptr(C);
thrust::device_ptr<int>  Dptr(D);
thrust::reduce_by_key(dlistptr, dlistptr+n, donesptr, Cptr, Dptr,cmp);

The issue described above is similar to another issue you asked about.


OpenMP shared variable seems to be private

I don't understand why in this code only the thread 0 has n = 1 while the other ones have n = 0 with shared n: int main() { int n, tid; #pragma omp parallel shared(n) private(tid) { tid = omp_get_thread_num(); n = 0; if (tid == 0) { n++;...

How to perform synchronous parallel functions in Clojure?

I have an application which has an initial flow that can be done in parallel: Fetch two JSON documents (I use clj-http for that) Parse these documents (extract only required data) Join the results Dump them into a one file So there's something like that: some-entry-point /\ / \ /...

How to measure if a program was run in parallel over multiple cores in Linux?

I want to know if my program was run in parallel over multiple cores. I can get the perf tool to report how many cores were used in the computation, but not if they were used at the same time (in parallel). How can this be done?...

cudaMalloc vs cudaMalloc3D performance for a 2D array

I want to know the impact on performance when using cudaMalloc or cudaMalloc3D when allocating, copying and accessing memory for a 2D array. I have code that I tried to test the run time on where on one I use cudaMalloc and on the other cudaMalloc3D. I have included the...

clEnqueueNDRangeKernel fills up entire memory

I am currently trying to write an OpenCL application doing some memory intensive calculations. To track the progress of all the calculations I created a for loop which creates different kernel groups. Unfortunately, the calculation fills up my whole memory. My guess is that the kernels are not done executing...

running a thread in parallel

I am inside a threat updating a graph and I go into a routine that makes a measurement for 4 seconds. The routine returns a double. What I am noticing is that my graph stops showing activity for 4 seconds until I am done collecting data. I need to start...

Faster Matrix Multiplication in CUDA

Currently, I made a neural networks program in the cuda c. Because I needed to manipulate the matrix multiplication, I did not use CUBLAS for MM. I use the following code for MM. I was wondering if any one has some advice to make it faster which can be very...

direct global memory access using cuda

q1- lets say i have copy one array onto device through stream1 using cudaMemCpyAsync; would i be able to access the values of that array in different stream say 2? cudaMemcpyAsync(da,a,10*sizeof(float),cudaMemcpyHostToDevice,stream[0]); kernel<<<n,1,0,stream[0]>>>(da); kernel<<<n,1,0,stream[1]>>>(da){//calculation involving da} ; q2- would i have to include pointer to global memory array as argument in...

how to make the program pause when actor is running

For example import scala.actors.Actor import scala.actors.Actor._ object Main { class Pong extends Actor { def act() { var pongCount = 0 while (true) { receive { case "Ping" => if (pongCount % 1000 == 0) Console.println("Pong: ping "+pongCount) sender ! "Pong" pongCount = pongCount + 1 case "Stop" => Console.println("Pong:...

Access violation reading location when calling cudaMemcpy2DToArray

I allocated a 2D array in device and want to copy a 2D float array to device. ImgSrc is a Mat type in openCV that I copied the elements of it into a 2D float array named ImgSrc_f.then by using cudaMemcpy2DToArray() I copied my host 2D array(ImgSrc_f) to device 2D...

Java 8 parallelStream for concurrent Database / REST call

Here I am using Javaparallel stream to iterate through a List and calling a REST call with each list element as input. I need to add all the results of the REST call to a collection for which I am using an ArrayList. The code given below is working fine...

How many parallel threads i can run on my nvidia graphic card in cuda programming?

Operating System: Windows 8.1 Single Language, 64-bit DirectX version: 11.0 GPU processor: GeForce 840M Driver version: 353.06 Direct3D API version: 11.2 Direct3D feature level: 11_0 CUDA Cores: 384 Core clock: 1029 MHz Memory data rate: 1800 MHz Memory interface: 64-bit Memory bandwidth: 14.40 GB/s Total available graphics memory: 4096 MB...

Using a data pointer with CUDA (and integrated memory)

I am using a board with integrated gpu and cpu memory. I am also using an external matrix library (Blitz++). I would like to be able to grab the pointer to my data from the matrix object and pass it into a cuda kernel. After doing some digging, it sounds...

How does CUDA's cudaMemcpyFromSymbol work?

I understand the concept of passing a symbol, but was wondering what exactly is going on behind the scenes. If it's not the address of the variable, then what is it?

Async await usage for MongoDB repository

I have a MongoDB repository class as you see below: public class MongoDbRepository<TEntity> : IRepository<TEntity> where TEntity : EntityBase { private IMongoClient client; private IMongoDatabase database; private IMongoCollection<TEntity> collection; public MongoDbRepository() { client = new MongoClient(); database = client.GetDatabase("Test"); collection = database.GetCollection<TEntity>(typeof(TEntity).Name); } public async Task Insert(TEntity entity) { if...

Threads syncronization in CUDA

I have a 3D grid of 3D blocks, and within each block I need to compute sequentially on the "z" layers of the block. In other words, I want to execute first all (x,y,0) threads, then all (x,y,1), etc. I need to execute my threads layer by layer (counting layers...

Why does Hyper-Q selectively overlap async HtoD and DtoH transfer on my cc5.2 hardware?

There's an old Parallel ForAll blog post that demonstrates using streams and async memcpys to generate overlap between kernels and memcpys, and between HtoD and DtoH memcpys. So I ran the full Async sample given on my GTX Titan X, and here's the result: As you can see, when...


I have a very simple scala jcuda program that adds a very large array. Everything compiles and runs just fine until I want to copy more than 4 bytes from my device to host. I am getting CUDA_ERROR_INVALID_VALUE when I try to copy more than 4 bytes. // This does...

Reduce by key on device array

I am using reduce_by_key to find the number of elements in an array of type int2 which has same first values . For example Array: <1,2> <1,3> <1,4> <2,5> <2,7> so no. elements with 1 as first element are 3 and with 2 are 2. CODE: struct compare_int2 : public...

Parallel.ForEach loop is performing like a serial loop

I've spent about 8+ hours searching online for help and I couldn't find anything so, here goes. I'm working with Team Foundation Server and C# and I'm trying to acquire a list of work items and convert them into a generic object we made to be bound to a special...

cuda-memcheck fails to detect memory leak in an R package

I'm building CUDA-accelerated R packages, and I want to debug with cuda-memcheck. So in this minimal example (in the deliberate_memory_leak GitHub branch), I create a memory leak in someCUDAcode.c by commenting out a necessary call to cudaFree. Then, I see if cuda-memcheck can find the leak. $ cuda-memcheck --leak-check full...

'an illegal memory access' when trying to write to a 2D array allocated using cudaMalloc3D

I am trying to allocate and copy memory of a flattened 2D array on to the device using cudaMalloc3D to test the performance of cudaMalloc3D. But when I try to write to the array from the kernel it throws 'an illegal memory access was encountered' exception. The program runs fine...

how to generalize square matrix multiplication to handle arbitrary dimensions

I have written this program and I am having some trouble understanding how to use multiple blocks by using dim3 variable in the kernel call line. This code works fine when I am doing 1000*1000 matrix multiplication, but not getting correct answer for lower dimensions like 100*100 , 200*200. #include...

Java 8, using .parallel in a stream causes OOM error

In the book Java 8 In Action, section 7.1.1, the authors state that a stream can benefit from parallel processing by adding the function .parallel(). They provide a simple method called parallelSum(int) to illustrate this. I was curious to see how well it worked so I executed this code: package...

NVCC CUDA cross compiling cannot find “-lcudart”

I have installed CUDA 5.0 and NVCC on my Ubuntu virtual machine and have had problems compiling even a basic CUDA C program. The error is as follows: [email protected]:~/CUDA$ nvcc -o helloworld.o -target-cpu-arch=ARM -ccbin=/usr/bin/arm-linux-gnueabi-gcc-4.6 --machine=32 /usr/lib/gcc/arm-linux-gnueabi/4.6/../../../../arm-linux-gnueabi/bin/ld: skipping incompatible /usr/local/cuda-5.0/bin/../lib/ when searching for -lcudart /usr/lib/gcc/arm-linux-gnueabi/4.6/../../../../arm-linux-gnueabi/bin/ld: skipping incompatible...

How do you build the example CUDA Thrust device sort?

I am trying to build and run the Thrust example code in Visual Studio 2010 with the latest version (7.0) of CUDA and the THURST install that comes with it. I cannot get the example code to build and run. By eliminating parts of the code, I found the problem...

Tesla k20m interoperability with Direct3D 11

I would like to know if I can work with Nvidia Tesla K20 and Direct3D 11? I'd like to render an image using Direct3D, Then process the rendered image with CUDA, [ I know how to work out the CUDA interoperability]. Tesla k20 doesn't have a display adapter (physically remote...

How to perform parallel processes for different groups in a folder?

I have a folder containing a lot of images. I have a code which transforms these images into black and white format and then use tesseract to convert them into text files. I have been using the following code to split these files into subgroups: i=0; for f in *;...

CUDA cuBlasGetmatrix / cublasSetMatrix fails | Explanation of arguments

I've attempted to copy the matrix [1 2 3 4 ; 5 6 7 8 ; 9 10 11 12 ] stored in column-major format as x, by first copying it to a matrix in an NVIDIA GPU d_x using cublasSetMatrix, and then copying d_x to y using cublasGetMatrix(). #include<stdio.h>...

Is prefix scan CUDA sample code in gpugems3 correct?

I've written a piece of code to call the kernel in gpugem3 but the results that I got is a bunch of negative numbers instead of prefix scan. I'm wondering if my kernel call is wrong or there is something wrong with the gpugem3 code? here is my code: #include...

Running java application parallely on multicore cluster nodes

I have a Java program that calculates semantic similarity between two documents. The program retrieves documents from a specified file system and calculates similarity. There are around 2,00,000 such docs. I have created 10 threads for this task and I have assigned data chucks to each of the thread. For...

Basic Java threading (4 threads) slower than non-threading

I have a four core CPU. I create 4 threads and run a cpu intensive loop, and it takes > 4x longer than running it all procedurally in one thread. I created two projects to compare, one with threading and one without. I'll show the code and run times. Just...

ElasticSearch Multiple Scrolls Java API

I want to get all data from an index. Since the number of items is too large for memory I use the Scroll (nice function): client.prepareSearch(index) .setTypes(myType).setSearchType(SearchType.SCAN) .setScroll(new TimeValue(60000)) .setSize(amountPerCall) .setQuery(MatchAll()) .execute().actionGet(); Which works nice when calling: client.prepareSearchScroll(scrollId) .setScroll(new TimeValue(600000)) .execute().actionGet() But, when I call the former method multiple times,...

Reactive pipeline - how to control parallelism?

I'm building a straightforward processing pipeline where an item is fetched as an input, it is being operated by multiple processors in a sequential manner and finally it is output. Image below describes the overall architecture: The way it is currently working: Pipeline is fetching items from the provider as...

using “foreach” for running different classifiers in R

I am trying to use foreach to run different classifiers on my data, but it doesn't work. In fact it doesn't return me anything. my purpose is to parallelize my process. here is the simplified of my code: library(foreach) library(doParallel) no_cores <- detectCores() - 1 cl<-makeCluster(no_cores) registerDoParallel(cl) registerDoParallel(no_cores) model_list<-foreach(i =...

OpenMP Matrix-Vector Multiplication Executes on Only One Thread

I have this code (outlined below) for parallelizing matrix-vector multiplication. But whenever I run it, I discover that it is executing on just one thread (even though I specified 4). How can I separate parts of the parallel code to run on separate threads. Any help will be highly appreciated....

How to run DEoptim in parallel?

have developed a hydrological model in R, which has 8 parameters in the function, and I wish to calibrate the model using DEoptim. Due to the length of time it takes to run each function I would like to parallel-ise the DEoptim function as it has an option to run...

Understanding Memory Replays and In-Flight Requests

I'm trying to understand how a matrix transpose can be faster reading naively from columns vs. rows. (example is from Professional CUDA C Programming) The matrix is in memory by row, i.e. (0,1),(0,2),(0,3)...(1,1),(1,2) __global__ void transposeNaiveCol(float *out, float *in, const int nx, const int ny) { unsigned int ix =...

Update a D3D9 texture from CUDA

I’m working on a prototype that integrates WPF, Direct3D9 (using Microsoft’s D3DImage WPF class), and CUDA (I need to be able to generate a texture for the D3DImage on the GPU). The problem is, CUDA doesn’t update my texture. No error codes are returned, the texture just stays unchanged. Even...

Understanding Dynamic Parallelism in CUDA

Example of dynamic parallelism: __global__ void nestedHelloWorld(int const iSize,int iDepth) { int tid = threadIdx.x; printf("Recursion=%d: Hello World from thread %d" "block %d\n",iDepth,tid,blockIdx.x); // condition to stop recursive execution if (iSize == 1) return; // reduce block size to half int nthreads = iSize>>1; // thread 0 launches child grid...

What is version of cuda for nvidia 304.125

I am using ubuntu 14.04. I want to install CUDA. But I don't know which version is good for my laptop. I trace my drive that is $cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.125 Mon Dec 1 19:58:28 PST 2014 GCC version: gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)...

Julia parallel programming - Making existing function available to all workers

I am faced with the following problem: I have a function called TrainModel that runs for a very long time on a single thread. When it finishes computing, it returns a function as an output argument, let's call it f. When I enquire the type of this f, Julia returns:...

Can an unsigned long long int be used to store the output from clock64()?

I need to update a global array storing clock64() from different threads atomically. All of the atomic functions in CUDA support only unsigned for long long int sizes. But the return type of clock64() is signed. Is it safe to store the output from clock64() in an unsigned?

Multiprocessing a python script

I learnt about the multiprocessing tool in python: Say I have a python program which is complicated and fleshed out, but it does not use up all my cores when running. So it uses 100% of one core and takes forever to complete. It is hard for me to...

How to load data in global memory into shared memory SAFELY in CUDA?

My kernel: __global__ void myKernel(float * devData, float * devVec, float * devStrFac, int Natom, int vecNo) { extern __shared__ float sdata[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; float qx=devVec[3*idx]; float qy=devVec[3*idx+1]; float qz=devVec[3*idx+2]; __syncthreads();//sync_1 float c=0.0,s=0.0; for (int iatom=0; iatom<Natom; iatom += blockDim.x) { float rtx =...

Algorithm for [inclusive/exclusive]_scan in parallel proposal N3554

Proposal N3554 (A Parallel Algorithms Library) for C++14, proposes (among other things), what seem to be parallel versions of the current std::partial_sum, e.g.: template< class ExecutionPolicy, class InputIterator, class OutputIterator, class BinaryOperation> OutputIterator inclusive_scan( ExecutionPolicy &&exec, InputIterator first, InputIterator last, OutputIterator result, BinaryOperation binary_op); With the explanation Effects: For each...

Proper way to handle a group of asynchronous calls in parallel

I have a console app that calls a web api and gets a list of services. It then loops through and makes calls to each service. I have the following: static int Main(string[] args) { ... Task.WaitAll(Process()); } private static async Task BeginProcess() { using(HttpClientHandler handler = new HttpClientHandler()) {...

How can I pass a struct to a kernel in JCuda

I have already looked at this which says I must modify my kernel to take only single dimensional arrays. However I refuse to believe that it is impossible to create a struct and copy it to device memory in JCuda. I would imagine the usual implementation would be to...