cuda , Can an unsigned long long int be used to store the output from clock64()?


Can an unsigned long long int be used to store the output from clock64()?

Question:

Tag: cuda

I need to update a global array storing clock64() from different threads atomically. All of the atomic functions in CUDA support only unsigned for long long int sizes. But the return type of clock64() is signed. Is it safe to store the output from clock64() in an unsigned?


Answer:

There are various atomic functions which support atomic operations on unsigned long long int (ie. a 64-bit unsigned integer), such as atomicCAS, atomicExch and atomicAdd. And if you have a cc3.5 or higher GPU you have even more options.

Referring to the documentation on clock64():

long long int clock64(); when executed in device code, returns the value of a per-multiprocessor counter that is incremented every clock cycle.

So, since it is a 64-bit signed quantity, it is bit-wise identical to an unsigned long long int until it becomes negative. Let's assume the counter is reset to zero either at the start of your kernel, the start of the cuda context, or machine power-on. This counter will not become negative until around:

2^63(cycles)/1,000,000,000(cycles/s) = ~292 years after whichever of the above events is the actual reset point.

(I'm using 1GHz here as an estimate of the GPU core clock)

So for the first 200-300 years (after machine power-on, let's say), the clock64() function will not return a negative value. So I'd say it's pretty safe to consider it as "always" positive, and therefore always identical to unsigned long long int, meaning you can safely cast it to that, and use it in one of the atomic functions that support unsigned long long int.

On the other hand, it's probably not safe to cast it into an unsigned quantity. That arithmetic would be:

2^32(cycles)/1,000,000,000(cycles/s) = ~4 seconds (after machine power on)

So in about 4 seconds, the clock64() function will numerically exceed the value that can be safely recorded in an unsigned quantity.


Related:


How to pass struct containing array to the kernel in CUDA?


c,arrays,cuda
In the following code I have an array in a struct which I need to pass to the kernel function. I can't seem to find the proper way. I tried looking at other posts on SO but do not understand their methods that well. In my actual code, I receive...

Running CUDA programs on Quadro K620m


cuda,nvidia
I have laptop which has Quadro K620m GPU. I am trying to learn CUDA programming and downloaded the network installer from NVIDIA site. During CUDA SDK installation, just when its checking the hardware of the machine, it displays Do you want to Continue? This graphics driver could not find compatible...

Amount of cores per SM and threads per block in CUDA


cuda
As NVIDIA GPU evolve the amount of cores per SM changes: in Fermi we have 32 of them, but in Maxwell the number is 128 according to the white papers. So, my questions are following: Is that better to create grids with blocks, containing 128 threads each? Will such code...

Linear algebra libraries and dynamic parallelism in CUDA


cuda,gpu,gpgpu
With the advent of dynamic parallelism in 3.5 and above CUDA architectures, is it possible to call linear algebra libraries from within __device__ functions? Can the CUSOLVER library in CUDA 7 be called from a kernel (__global__) function?...

cuMemcpyDtoH yields CUDA_ERROR_INVALID_VALUE


java,scala,ubuntu,cuda,jcuda
I have a very simple scala jcuda program that adds a very large array. Everything compiles and runs just fine until I want to copy more than 4 bytes from my device to host. I am getting CUDA_ERROR_INVALID_VALUE when I try to copy more than 4 bytes. // This does...

Can an unsigned long long int be used to store the output from clock64()?


cuda
I need to update a global array storing clock64() from different threads atomically. All of the atomic functions in CUDA support only unsigned for long long int sizes. But the return type of clock64() is signed. Is it safe to store the output from clock64() in an unsigned?

Tesla k20m interoperability with Direct3D 11


cuda,direct3d,tesla
I would like to know if I can work with Nvidia Tesla K20 and Direct3D 11? I'd like to render an image using Direct3D, Then process the rendered image with CUDA, [ I know how to work out the CUDA interoperability]. Tesla k20 doesn't have a display adapter (physically remote...

How can I pass a struct to a kernel in JCuda


java,struct,cuda,jni,jcuda
I have already looked at this http://www.javacodegeeks.com/2011/10/gpgpu-with-jcuda-good-bad-and-ugly.html which says I must modify my kernel to take only single dimensional arrays. However I refuse to believe that it is impossible to create a struct and copy it to device memory in JCuda. I would imagine the usual implementation would be to...

Problems with floating-point additions. Ignoring some small values


math,cuda,floating-point
I'm looking up a book about CUDA. On the chapter which explains the floating points of CUDA, I found something odd. The book says that (1.00 * 1) + (1.00 * 1) + (1.00 * 0.01) + (1.00* 0.01) = 10. All the numbers are binaries. 0.01 refers to decimal...

cuda-memcheck fails to detect memory leak in an R package


r,memory-leaks,cuda,valgrind
I'm building CUDA-accelerated R packages, and I want to debug with cuda-memcheck. So in this minimal example (in the deliberate_memory_leak GitHub branch), I create a memory leak in someCUDAcode.c by commenting out a necessary call to cudaFree. Then, I see if cuda-memcheck can find the leak. $ cuda-memcheck --leak-check full...

How do you build the example CUDA Thrust device sort?


c++,visual-studio-2010,sorting,cuda,thrust
I am trying to build and run the Thrust example code in Visual Studio 2010 with the latest version (7.0) of CUDA and the THURST install that comes with it. I cannot get the example code to build and run. By eliminating parts of the code, I found the problem...

Is prefix scan CUDA sample code in gpugems3 correct?


cuda,gpu,nvidia,prefix-sum
I've written a piece of code to call the kernel in gpugem3 but the results that I got is a bunch of negative numbers instead of prefix scan. I'm wondering if my kernel call is wrong or there is something wrong with the gpugem3 code? here is my code: #include...

Why does Hyper-Q selectively overlap async HtoD and DtoH transfer on my cc5.2 hardware?


cuda
There's an old Parallel ForAll blog post that demonstrates using streams and async memcpys to generate overlap between kernels and memcpys, and between HtoD and DtoH memcpys. So I ran the full Async sample given on my GTX Titan X, and here's the result: http://i.stack.imgur.com/rT676.png As you can see, when...

how to generalize square matrix multiplication to handle arbitrary dimensions


c,cuda,parallel-processing,matrix-multiplication
I have written this program and I am having some trouble understanding how to use multiple blocks by using dim3 variable in the kernel call line. This code works fine when I am doing 1000*1000 matrix multiplication, but not getting correct answer for lower dimensions like 100*100 , 200*200. #include...

Reduce by key on device array


cuda,parallel-processing,thrust
I am using reduce_by_key to find the number of elements in an array of type int2 which has same first values . For example Array: <1,2> <1,3> <1,4> <2,5> <2,7> so no. elements with 1 as first element are 3 and with 2 are 2. CODE: struct compare_int2 : public...

Access violation reading location when calling cudaMemcpy2DToArray


c++,arrays,opencv,cuda
I allocated a 2D array in device and want to copy a 2D float array to device. ImgSrc is a Mat type in openCV that I copied the elements of it into a 2D float array named ImgSrc_f.then by using cudaMemcpy2DToArray() I copied my host 2D array(ImgSrc_f) to device 2D...

Why use memset when using CUDA?


c,cuda,nvidia
I saw in a CUDA code example that memset is used to initialize vectors to all 0's that will store the sum of two others vectors. For example: hostRef = (float *)malloc(nBytes); gpuRef = (float *)malloc(nBytes); memset(hostRef, 0, nBytes); memset(gpuRef, 0, nBytes); What purpose does this serve if nothing else...

Do I need to free device_ptr returned by thrust?


c++,pointers,cuda,thrust
I have a function to get the minimum value of an array and it's executed within a loop. thrust::device_ptr<float> min_ptr = thrust::min_element(populationFitness, populationFitness + POPULATION); Do I have to free the returned device_ptr? I tried with thrust::device_free(min_ptr) but an exception is thrown....

purposely causing bank conflicts for shared memory on CUDA device


cuda,gpu,shared-memory,bank-conflict
It is a mystery for me how shared memory on cuda devices work. I was curious to count threads having access to the same shared memory. For this I wrote a simple program #include <cuda_runtime.h> #include <stdio.h> #define nblc 13 #define nthr 1024 //[email protected] __device__ int inwarpD[nblc]; __global__ void kernel(){...

Using a data pointer with CUDA (and integrated memory)


c++,memory-management,cuda
I am using a board with integrated gpu and cpu memory. I am also using an external matrix library (Blitz++). I would like to be able to grab the pointer to my data from the matrix object and pass it into a cuda kernel. After doing some digging, it sounds...

Building CUDA-aware openMPI on Ubuntu 12.04 cannot find cuda.h


ubuntu,cuda,mpi
I am building openMPI 1.8.5 on Ubuntu 12.04 with CUDA 6.5 installed and tested with default samples. I intend to run it on a single node with following configuration: Dell Precision T7400 Dual Xeon X5450 Nvidia GT730/Tesla C1060 The configure command issued was $ ./configure --prefix=/usr --with-cuda=/usr/local/cuda In the generated...

Stream compaction with Thrust; best practices and fastest way?


c++,cuda,gpgpu,thrust,sparse-array
I am interested in porting some existing code to use thrust to see if I can speed it up on the GPU with relative ease. What I'm looking to accomplish is a stream compaction operation, where only nonzero elements will be kept. I have this mostly working, per the example...

Understanding Dynamic Parallelism in CUDA


multithreading,cuda
Example of dynamic parallelism: __global__ void nestedHelloWorld(int const iSize,int iDepth) { int tid = threadIdx.x; printf("Recursion=%d: Hello World from thread %d" "block %d\n",iDepth,tid,blockIdx.x); // condition to stop recursive execution if (iSize == 1) return; // reduce block size to half int nthreads = iSize>>1; // thread 0 launches child grid...

NVCC CUDA cross compiling cannot find “-lcudart”


linux,cuda,ld,nvcc
I have installed CUDA 5.0 and NVCC on my Ubuntu virtual machine and have had problems compiling even a basic CUDA C program. The error is as follows: [email protected]:~/CUDA$ nvcc helloworld.cu -o helloworld.o -target-cpu-arch=ARM -ccbin=/usr/bin/arm-linux-gnueabi-gcc-4.6 --machine=32 /usr/lib/gcc/arm-linux-gnueabi/4.6/../../../../arm-linux-gnueabi/bin/ld: skipping incompatible /usr/local/cuda-5.0/bin/../lib/libcudart.so when searching for -lcudart /usr/lib/gcc/arm-linux-gnueabi/4.6/../../../../arm-linux-gnueabi/bin/ld: skipping incompatible...

cudaMemcpyToSymbol in pycuda


python,cuda,pycuda
I am using pycuda and i would like to know if there is an equivalent to the function cudaMemcpyToSymbol I would like to copy a constant from the host to the device like below import pycuda.driver as cuda import pycuda.autoinit from pycuda.compiler import SourceModule import numpy from sys import path...

CUDA strange behavior accessing vector


c++,cuda
I have implemented a simple fft program in cuda. This is the kernel function: __global__ void fftKernel(cuComplex* dev_samples, size_t length, size_t llog, Direction direction) { int tid = threadIdx.x + blockDim.x * blockIdx.x; if (tid < length / 2) { // First step, sorts data with bit reversing and compute...

cudaMalloc vs cudaMalloc3D performance for a 2D array


c,cuda
I want to know the impact on performance when using cudaMalloc or cudaMalloc3D when allocating, copying and accessing memory for a 2D array. I have code that I tried to test the run time on where on one I use cudaMalloc and on the other cudaMalloc3D. I have included the...

CUDA cuBlasGetmatrix / cublasSetMatrix fails | Explanation of arguments


cuda,gpgpu,gpu-programming,cublas
I've attempted to copy the matrix [1 2 3 4 ; 5 6 7 8 ; 9 10 11 12 ] stored in column-major format as x, by first copying it to a matrix in an NVIDIA GPU d_x using cublasSetMatrix, and then copying d_x to y using cublasGetMatrix(). #include<stdio.h>...

Update a D3D9 texture from CUDA


c#,cuda,sharpdx,direct3d9,managed-cuda
I’m working on a prototype that integrates WPF, Direct3D9 (using Microsoft’s D3DImage WPF class), and CUDA (I need to be able to generate a texture for the D3DImage on the GPU). The problem is, CUDA doesn’t update my texture. No error codes are returned, the texture just stays unchanged. Even...

Why does cuSOLVER cusolverSpDcsrlsvchol not work?


c++,cuda,linear-algebra,solver,cusolver
We are experiencing problems while using cuSOLVER's cusolverSpScsrlsvchol function, probably due to misunderstanding of the cuSOLVER library... Motivation: we are solving the Poisson equation -divgrad x = b on a rectangular grid. In 2 dimensions with a 5-stencil (1, 1, -4, 1, 1), the Laplacian on the grid provides a...

How to load data in global memory into shared memory SAFELY in CUDA?


c++,cuda,shared-memory
My kernel: __global__ void myKernel(float * devData, float * devVec, float * devStrFac, int Natom, int vecNo) { extern __shared__ float sdata[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; float qx=devVec[3*idx]; float qy=devVec[3*idx+1]; float qz=devVec[3*idx+2]; __syncthreads();//sync_1 float c=0.0,s=0.0; for (int iatom=0; iatom<Natom; iatom += blockDim.x) { float rtx =...

What is version of cuda for nvidia 304.125


ubuntu,cuda,ubuntu-14.04,nvidia
I am using ubuntu 14.04. I want to install CUDA. But I don't know which version is good for my laptop. I trace my drive that is $cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.125 Mon Dec 1 19:58:28 PST 2014 GCC version: gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)...

How does CUDA's cudaMemcpyFromSymbol work?


cuda
I understand the concept of passing a symbol, but was wondering what exactly is going on behind the scenes. If it's not the address of the variable, then what is it?

Practice computing grid size for CUDA


cuda,nvidia
dim3 block(4, 2) dim3 grid((nx+block.x-1)/block.x, (ny.block.y-1)/block.y); I found this code in Professional CUDA C Programming on page 53. It's meant to be a naive example of matrix multiplication. nx is the number of columns and ny is the number of rows. Can you explain how the grid size is computed?...

thrust exception bulk_kernel_by_value in transform_reduce


c++,c++11,cuda
I'm working on a optimization problem which contains various math functions which resembles in similar form, so I warp them in a FunctionObj template <typename T> struct FunctionObj { T a; FunctionObj(): a(1) { } }; And defines a FuncEval to evaluate template <typename T> __host__ __device__ inline T FuncEval(const...

cuda thrust: selective copying and resizing results


cuda,thrust
I am copying items selectively between two thrust device arrays using copy_if as follows: thrust::device_vector<float4> collated = thrust::device_vector<float4> original_vec.size()); thrust::copy_if(original_vec.begin(), original_vec.end(), collated.begin(), is_valid_pt()); collated.shrink_to_fit(); The is_valid_pt is implemented as: struct is_valid_kpt { __host__ __device__ bool operator()(const float4 x) { return x.w >= 0; } }; Now after running this code,...

Faster Matrix Multiplication in CUDA


c,cuda,matrix-multiplication
Currently, I made a neural networks program in the cuda c. Because I needed to manipulate the matrix multiplication, I did not use CUBLAS for MM. I use the following code for MM. I was wondering if any one has some advice to make it faster which can be very...

Interfacing cuSOLVER-sparse using PyCUDA


python,cuda,ctypes,pycuda,cusolver
I'm trying to interface the sparse cuSOLVER routine cusolverSpDcsrlsvqr() (>= CUDA 7.0) using PyCUDA and am facing some difficulties: I have tried wrapping the methods the same way the dense cuSolver routines are wrapped in scikits-cuda (https://github.com/lebedov/scikits.cuda/blob/master/scikits/cuda/cusolver.py). However, the code crashes with a segmentation fault when calling the cusolverSpDcsrlsvqr() function....

Threads syncronization in CUDA


c++,multithreading,cuda
I have a 3D grid of 3D blocks, and within each block I need to compute sequentially on the "z" layers of the block. In other words, I want to execute first all (x,y,0) threads, then all (x,y,1), etc. I need to execute my threads layer by layer (counting layers...

How many parallel threads i can run on my nvidia graphic card in cuda programming?


cuda
Operating System: Windows 8.1 Single Language, 64-bit DirectX version: 11.0 GPU processor: GeForce 840M Driver version: 353.06 Direct3D API version: 11.2 Direct3D feature level: 11_0 CUDA Cores: 384 Core clock: 1029 MHz Memory data rate: 1800 MHz Memory interface: 64-bit Memory bandwidth: 14.40 GB/s Total available graphics memory: 4096 MB...

'an illegal memory access' when trying to write to a 2D array allocated using cudaMalloc3D


c,cuda
I am trying to allocate and copy memory of a flattened 2D array on to the device using cudaMalloc3D to test the performance of cudaMalloc3D. But when I try to write to the array from the kernel it throws 'an illegal memory access was encountered' exception. The program runs fine...

Understanding Memory Replays and In-Flight Requests


caching,cuda
I'm trying to understand how a matrix transpose can be faster reading naively from columns vs. rows. (example is from Professional CUDA C Programming) The matrix is in memory by row, i.e. (0,1),(0,2),(0,3)...(1,1),(1,2) __global__ void transposeNaiveCol(float *out, float *in, const int nx, const int ny) { unsigned int ix =...

Building a tiny R package with CUDA and Rcpp


r,cuda,rcpp
I'm working on a tiny R package that uses CUDA and Rcpp, adapted from the output of Rcpp.package.skeleton(). I will first describe what happens on the master branch for the commit entitled "fixed namespace". The package installs successfully if I forget CUDA (i.e., if I remove the src/Makefile, change src/rcppcuda.cu...

CUDA: Group every n-th point of array passed to GPU


c++,c,arrays,cuda
I am trying to implement k-means algorithm on CUDA using Tesla card on external Unix. I read input file and store coordinates of all data points in dataX and dataY arrays. The next step is to select every centreInterval-th point and store it in another array allocated in GPU memory....

Best way to achieve CUDA Vector Diagonalization


matrix,cuda
What I want to do is feed in my m x n matrix, and in parallel, construct n square diagonal matrices for each column of the matrix, perform an operation on each square diagonal matrix, and then recombine the result. How do I do this? So far, I start of...

direct global memory access using cuda


c++,cuda
q1- lets say i have copy one array onto device through stream1 using cudaMemCpyAsync; would i be able to access the values of that array in different stream say 2? cudaMemcpyAsync(da,a,10*sizeof(float),cudaMemcpyHostToDevice,stream[0]); kernel<<<n,1,0,stream[0]>>>(da); kernel<<<n,1,0,stream[1]>>>(da){//calculation involving da} ; q2- would i have to include pointer to global memory array as argument in...

nvcc/CUDA 6.5 & c++11(future) - gcc 4.4.7


c++11,gcc,cuda,future
When I compile the following code containing the design C++11, I get errors - it does not compile. I've tried with different flags, but I haven't found a solution. My setting: CUDA 6.5, gcc 4.4.7 I am not able to change the settings. How can I still make this work?...

MVAPICH on multi-GPU causes Segmentation fault


cuda,mvapich2
I'm using MVAPICH2 2.1 on a Debian 7 machine. It has multiple cards of Tesla K40m. The code is as follows. #include <cstdio> #include <cstdlib> #include <ctime> #include <cuda_runtime.h> #include <mpi.h> int main(int argc, char** argv) { MPI_Status status; int rank; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); cudaSetDevice(0); if (rank == 0)...

cuda device function and templates


c++,templates,cuda
I am using CUDA 7 and am trying to pass a function as a template parameter to a device function as follows: typedef float(*Op)(float, float); template<typename Op> __device__ bool is_maxima(float ax, float ay, cudaTextureObject_t current) { // I try to use the passed function as: float cv = tex2D<float>(current, ax,...

Cuda cub:Device Scan


cuda,gpu,nvcc,cub,scan
I'm using cub to implement device scan. When I run the default example for device scan I keep getting : identifier "cudaOccupancyMaxActiveBlocksPerMultiprocessor" is undefined Does anyone have any idea about this problem? Thanks,...