cuda,direct3d,tesla , Tesla k20m interoperability with Direct3D 11

Tesla k20m interoperability with Direct3D 11


Tag: cuda,direct3d,tesla

I would like to know if I can work with Nvidia Tesla K20 and Direct3D 11?

I'd like to render an image using Direct3D, Then process the rendered image with CUDA, [ I know how to work out the CUDA interoperability].

Tesla k20 doesn't have a display adapter (physically remote adapter )

I managed to do so with Tesla C2075, however with K20 I can't receive the device adapter ( EnumAdapters command ).

Is it possible to work with Tesla K20 and Direct3D ?

Frankly speaking, this code was written in notepad


  IDXGIFactory* factory = 0 ; 
  IDXGIAdapter* adapter = 0 ; 
  int dev = 0;  

  CreateDXGIFactory( __uuidof(IDXGIFactory) , (void**)&factory); 
  for (unsigned int i = 0 ; !adapter ; ++i ) 
         if ( FAILED( factory->EnumAdapters (i , &adapter )))
         if ( cudaD3D11GetDevice(&dev , adapter) == cudaSuccess )


No, this won't be possible.

K20m can be used (with some effort) with OpenGL graphics on Linux, but at least up through windows 8.x, you won't be able to use K20m as a D3D device in Windows.

The K20m does not publish a VGA classcode in PCI configuration space, which means that niether windows nor the NVIDIA driver will build a proper windows display driver stack on this device. Without that, you cannot use it as a D3D device. Additional evidence of this is visible through the nvidia-smi utility, which will show the K20 device as being in TCC mode. Any attempts to switch it to WDDM mode will fail (in some fashion - the failure may not be evident until a reboot).

If you find another GPU (such as Tesla C2075) for which it's possible, it invariably means, among other things, that the GPU is publishing a VGA classcode in PCI config space.

This general document covers classcode location in the PCI header on slide 62. This ECN excerpts the classcode definition. A VGA classcode is 0x0300, whereas a 3D controller classcode (I believe that is what K20m publishes) is 0x0302.

There are a few limited exceptions to the above. For example, a Tesla M2070Q in one configuration does not publish a VGA classcode (there is another configuration where it does publish a VGA classcode), but instead publishes a 3D controller classcode. In this configuration, it is usable by Microsoft RemoteFX as a shared graphics device for multiple Hyper-V VMs. In this situation, some D3D capability (up through DX9) is possible in the VMs.

In linux, the difference between a "3D Controller" and "VGA Controller" is evident using the lspci command.

In windows, you can get a config space reader to look at the difference, or you can look in Device Manager. The Tesla C2075 should show up under "Display Adapters" whereas the K20m will show up somewhere else.


nvcc/CUDA 6.5 & c++11(future) - gcc 4.4.7

When I compile the following code containing the design C++11, I get errors - it does not compile. I've tried with different flags, but I haven't found a solution. My setting: CUDA 6.5, gcc 4.4.7 I am not able to change the settings. How can I still make this work?...

How to load data in global memory into shared memory SAFELY in CUDA?

My kernel: __global__ void myKernel(float * devData, float * devVec, float * devStrFac, int Natom, int vecNo) { extern __shared__ float sdata[]; int idx = blockIdx.x * blockDim.x + threadIdx.x; float qx=devVec[3*idx]; float qy=devVec[3*idx+1]; float qz=devVec[3*idx+2]; __syncthreads();//sync_1 float c=0.0,s=0.0; for (int iatom=0; iatom<Natom; iatom += blockDim.x) { float rtx =...

Understanding Memory Replays and In-Flight Requests

I'm trying to understand how a matrix transpose can be faster reading naively from columns vs. rows. (example is from Professional CUDA C Programming) The matrix is in memory by row, i.e. (0,1),(0,2),(0,3)...(1,1),(1,2) __global__ void transposeNaiveCol(float *out, float *in, const int nx, const int ny) { unsigned int ix =...

'an illegal memory access' when trying to write to a 2D array allocated using cudaMalloc3D

I am trying to allocate and copy memory of a flattened 2D array on to the device using cudaMalloc3D to test the performance of cudaMalloc3D. But when I try to write to the array from the kernel it throws 'an illegal memory access was encountered' exception. The program runs fine...

Linear algebra libraries and dynamic parallelism in CUDA

With the advent of dynamic parallelism in 3.5 and above CUDA architectures, is it possible to call linear algebra libraries from within __device__ functions? Can the CUSOLVER library in CUDA 7 be called from a kernel (__global__) function?...

Tesla k20m interoperability with Direct3D 11

I would like to know if I can work with Nvidia Tesla K20 and Direct3D 11? I'd like to render an image using Direct3D, Then process the rendered image with CUDA, [ I know how to work out the CUDA interoperability]. Tesla k20 doesn't have a display adapter (physically remote...

CUDA: Group every n-th point of array passed to GPU

I am trying to implement k-means algorithm on CUDA using Tesla card on external Unix. I read input file and store coordinates of all data points in dataX and dataY arrays. The next step is to select every centreInterval-th point and store it in another array allocated in GPU memory....

How many parallel threads i can run on my nvidia graphic card in cuda programming?

Operating System: Windows 8.1 Single Language, 64-bit DirectX version: 11.0 GPU processor: GeForce 840M Driver version: 353.06 Direct3D API version: 11.2 Direct3D feature level: 11_0 CUDA Cores: 384 Core clock: 1029 MHz Memory data rate: 1800 MHz Memory interface: 64-bit Memory bandwidth: 14.40 GB/s Total available graphics memory: 4096 MB...

cudaMemcpyToSymbol in pycuda

I am using pycuda and i would like to know if there is an equivalent to the function cudaMemcpyToSymbol I would like to copy a constant from the host to the device like below import pycuda.driver as cuda import pycuda.autoinit from pycuda.compiler import SourceModule import numpy from sys import path...

Is prefix scan CUDA sample code in gpugems3 correct?

I've written a piece of code to call the kernel in gpugem3 but the results that I got is a bunch of negative numbers instead of prefix scan. I'm wondering if my kernel call is wrong or there is something wrong with the gpugem3 code? here is my code: #include...

Why does Hyper-Q selectively overlap async HtoD and DtoH transfer on my cc5.2 hardware?

There's an old Parallel ForAll blog post that demonstrates using streams and async memcpys to generate overlap between kernels and memcpys, and between HtoD and DtoH memcpys. So I ran the full Async sample given on my GTX Titan X, and here's the result: As you can see, when...

Using a data pointer with CUDA (and integrated memory)

I am using a board with integrated gpu and cpu memory. I am also using an external matrix library (Blitz++). I would like to be able to grab the pointer to my data from the matrix object and pass it into a cuda kernel. After doing some digging, it sounds...

Problems with floating-point additions. Ignoring some small values

I'm looking up a book about CUDA. On the chapter which explains the floating points of CUDA, I found something odd. The book says that (1.00 * 1) + (1.00 * 1) + (1.00 * 0.01) + (1.00* 0.01) = 10. All the numbers are binaries. 0.01 refers to decimal...

Running CUDA programs on Quadro K620m

I have laptop which has Quadro K620m GPU. I am trying to learn CUDA programming and downloaded the network installer from NVIDIA site. During CUDA SDK installation, just when its checking the hardware of the machine, it displays Do you want to Continue? This graphics driver could not find compatible...

Understanding Dynamic Parallelism in CUDA

Example of dynamic parallelism: __global__ void nestedHelloWorld(int const iSize,int iDepth) { int tid = threadIdx.x; printf("Recursion=%d: Hello World from thread %d" "block %d\n",iDepth,tid,blockIdx.x); // condition to stop recursive execution if (iSize == 1) return; // reduce block size to half int nthreads = iSize>>1; // thread 0 launches child grid...

Can an unsigned long long int be used to store the output from clock64()?

I need to update a global array storing clock64() from different threads atomically. All of the atomic functions in CUDA support only unsigned for long long int sizes. But the return type of clock64() is signed. Is it safe to store the output from clock64() in an unsigned?

thrust exception bulk_kernel_by_value in transform_reduce

I'm working on a optimization problem which contains various math functions which resembles in similar form, so I warp them in a FunctionObj template <typename T> struct FunctionObj { T a; FunctionObj(): a(1) { } }; And defines a FuncEval to evaluate template <typename T> __host__ __device__ inline T FuncEval(const...

NVCC CUDA cross compiling cannot find “-lcudart”

I have installed CUDA 5.0 and NVCC on my Ubuntu virtual machine and have had problems compiling even a basic CUDA C program. The error is as follows: [email protected]:~/CUDA$ nvcc -o helloworld.o -target-cpu-arch=ARM -ccbin=/usr/bin/arm-linux-gnueabi-gcc-4.6 --machine=32 /usr/lib/gcc/arm-linux-gnueabi/4.6/../../../../arm-linux-gnueabi/bin/ld: skipping incompatible /usr/local/cuda-5.0/bin/../lib/ when searching for -lcudart /usr/lib/gcc/arm-linux-gnueabi/4.6/../../../../arm-linux-gnueabi/bin/ld: skipping incompatible...

Best way to achieve CUDA Vector Diagonalization

What I want to do is feed in my m x n matrix, and in parallel, construct n square diagonal matrices for each column of the matrix, perform an operation on each square diagonal matrix, and then recombine the result. How do I do this? So far, I start of...

D3D11 Post Shader Results in Dark Image

I'm trying to implement post shaders (pixel shaders) as a test. I have a texture to which all stuff is rendered, and the post shader simply copies this texture to the back buffer. If this texture is larger than the back buffer, the post shading process results in dimming: However,...

direct global memory access using cuda

q1- lets say i have copy one array onto device through stream1 using cudaMemCpyAsync; would i be able to access the values of that array in different stream say 2? cudaMemcpyAsync(da,a,10*sizeof(float),cudaMemcpyHostToDevice,stream[0]); kernel<<<n,1,0,stream[0]>>>(da); kernel<<<n,1,0,stream[1]>>>(da){//calculation involving da} ; q2- would i have to include pointer to global memory array as argument in...

Stream compaction with Thrust; best practices and fastest way?

I am interested in porting some existing code to use thrust to see if I can speed it up on the GPU with relative ease. What I'm looking to accomplish is a stream compaction operation, where only nonzero elements will be kept. I have this mostly working, per the example...

CUDA cuBlasGetmatrix / cublasSetMatrix fails | Explanation of arguments

I've attempted to copy the matrix [1 2 3 4 ; 5 6 7 8 ; 9 10 11 12 ] stored in column-major format as x, by first copying it to a matrix in an NVIDIA GPU d_x using cublasSetMatrix, and then copying d_x to y using cublasGetMatrix(). #include<stdio.h>...

cuda-memcheck fails to detect memory leak in an R package

I'm building CUDA-accelerated R packages, and I want to debug with cuda-memcheck. So in this minimal example (in the deliberate_memory_leak GitHub branch), I create a memory leak in someCUDAcode.c by commenting out a necessary call to cudaFree. Then, I see if cuda-memcheck can find the leak. $ cuda-memcheck --leak-check full...

Why use memset when using CUDA?

I saw in a CUDA code example that memset is used to initialize vectors to all 0's that will store the sum of two others vectors. For example: hostRef = (float *)malloc(nBytes); gpuRef = (float *)malloc(nBytes); memset(hostRef, 0, nBytes); memset(gpuRef, 0, nBytes); What purpose does this serve if nothing else...

Faster Matrix Multiplication in CUDA

Currently, I made a neural networks program in the cuda c. Because I needed to manipulate the matrix multiplication, I did not use CUBLAS for MM. I use the following code for MM. I was wondering if any one has some advice to make it faster which can be very...

Reduce by key on device array

I am using reduce_by_key to find the number of elements in an array of type int2 which has same first values . For example Array: <1,2> <1,3> <1,4> <2,5> <2,7> so no. elements with 1 as first element are 3 and with 2 are 2. CODE: struct compare_int2 : public...

Practice computing grid size for CUDA

dim3 block(4, 2) dim3 grid((nx+block.x-1)/block.x, (ny.block.y-1)/block.y); I found this code in Professional CUDA C Programming on page 53. It's meant to be a naive example of matrix multiplication. nx is the number of columns and ny is the number of rows. Can you explain how the grid size is computed?...

How can I pass a struct to a kernel in JCuda

I have already looked at this which says I must modify my kernel to take only single dimensional arrays. However I refuse to believe that it is impossible to create a struct and copy it to device memory in JCuda. I would imagine the usual implementation would be to...

DirectX 9 not rendering after adding transforms

so far I got a cube rendered without any transforms (thus it was rendered in an orthographic perspective), and I am working on the previous code to get it into a perspective view, with all the matrices involved. I changed the Flexible Vertex Format so as not to have RHW...


I have a very simple scala jcuda program that adds a very large array. Everything compiles and runs just fine until I want to copy more than 4 bytes from my device to host. I am getting CUDA_ERROR_INVALID_VALUE when I try to copy more than 4 bytes. // This does...

Cuda cub:Device Scan

I'm using cub to implement device scan. When I run the default example for device scan I keep getting : identifier "cudaOccupancyMaxActiveBlocksPerMultiprocessor" is undefined Does anyone have any idea about this problem? Thanks,...

cuda thrust: selective copying and resizing results

I am copying items selectively between two thrust device arrays using copy_if as follows: thrust::device_vector<float4> collated = thrust::device_vector<float4> original_vec.size()); thrust::copy_if(original_vec.begin(), original_vec.end(), collated.begin(), is_valid_pt()); collated.shrink_to_fit(); The is_valid_pt is implemented as: struct is_valid_kpt { __host__ __device__ bool operator()(const float4 x) { return x.w >= 0; } }; Now after running this code,...

how to generalize square matrix multiplication to handle arbitrary dimensions

I have written this program and I am having some trouble understanding how to use multiple blocks by using dim3 variable in the kernel call line. This code works fine when I am doing 1000*1000 matrix multiplication, but not getting correct answer for lower dimensions like 100*100 , 200*200. #include...

Threads syncronization in CUDA

I have a 3D grid of 3D blocks, and within each block I need to compute sequentially on the "z" layers of the block. In other words, I want to execute first all (x,y,0) threads, then all (x,y,1), etc. I need to execute my threads layer by layer (counting layers...

Amount of cores per SM and threads per block in CUDA

As NVIDIA GPU evolve the amount of cores per SM changes: in Fermi we have 32 of them, but in Maxwell the number is 128 according to the white papers. So, my questions are following: Is that better to create grids with blocks, containing 128 threads each? Will such code...

purposely causing bank conflicts for shared memory on CUDA device

It is a mystery for me how shared memory on cuda devices work. I was curious to count threads having access to the same shared memory. For this I wrote a simple program #include <cuda_runtime.h> #include <stdio.h> #define nblc 13 #define nthr 1024 //[email protected] __device__ int inwarpD[nblc]; __global__ void kernel(){...

Update a D3D9 texture from CUDA

I’m working on a prototype that integrates WPF, Direct3D9 (using Microsoft’s D3DImage WPF class), and CUDA (I need to be able to generate a texture for the D3DImage on the GPU). The problem is, CUDA doesn’t update my texture. No error codes are returned, the texture just stays unchanged. Even...

cuda device function and templates

I am using CUDA 7 and am trying to pass a function as a template parameter to a device function as follows: typedef float(*Op)(float, float); template<typename Op> __device__ bool is_maxima(float ax, float ay, cudaTextureObject_t current) { // I try to use the passed function as: float cv = tex2D<float>(current, ax,...

What is version of cuda for nvidia 304.125

I am using ubuntu 14.04. I want to install CUDA. But I don't know which version is good for my laptop. I trace my drive that is $cat /proc/driver/nvidia/version NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.125 Mon Dec 1 19:58:28 PST 2014 GCC version: gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)...

Building a tiny R package with CUDA and Rcpp

I'm working on a tiny R package that uses CUDA and Rcpp, adapted from the output of Rcpp.package.skeleton(). I will first describe what happens on the master branch for the commit entitled "fixed namespace". The package installs successfully if I forget CUDA (i.e., if I remove the src/Makefile, change src/

How do you build the example CUDA Thrust device sort?

I am trying to build and run the Thrust example code in Visual Studio 2010 with the latest version (7.0) of CUDA and the THURST install that comes with it. I cannot get the example code to build and run. By eliminating parts of the code, I found the problem...

Access violation reading location when calling cudaMemcpy2DToArray

I allocated a 2D array in device and want to copy a 2D float array to device. ImgSrc is a Mat type in openCV that I copied the elements of it into a 2D float array named ImgSrc_f.then by using cudaMemcpy2DToArray() I copied my host 2D array(ImgSrc_f) to device 2D...

CUDA strange behavior accessing vector

I have implemented a simple fft program in cuda. This is the kernel function: __global__ void fftKernel(cuComplex* dev_samples, size_t length, size_t llog, Direction direction) { int tid = threadIdx.x + blockDim.x * blockIdx.x; if (tid < length / 2) { // First step, sorts data with bit reversing and compute...

Do I need to free device_ptr returned by thrust?

I have a function to get the minimum value of an array and it's executed within a loop. thrust::device_ptr<float> min_ptr = thrust::min_element(populationFitness, populationFitness + POPULATION); Do I have to free the returned device_ptr? I tried with thrust::device_free(min_ptr) but an exception is thrown....

How does CUDA's cudaMemcpyFromSymbol work?

I understand the concept of passing a symbol, but was wondering what exactly is going on behind the scenes. If it's not the address of the variable, then what is it?

cudaMalloc vs cudaMalloc3D performance for a 2D array

I want to know the impact on performance when using cudaMalloc or cudaMalloc3D when allocating, copying and accessing memory for a 2D array. I have code that I tried to test the run time on where on one I use cudaMalloc and on the other cudaMalloc3D. I have included the...

How to pass struct containing array to the kernel in CUDA?

In the following code I have an array in a struct which I need to pass to the kernel function. I can't seem to find the proper way. I tried looking at other posts on SO but do not understand their methods that well. In my actual code, I receive...

Building CUDA-aware openMPI on Ubuntu 12.04 cannot find cuda.h

I am building openMPI 1.8.5 on Ubuntu 12.04 with CUDA 6.5 installed and tested with default samples. I intend to run it on a single node with following configuration: Dell Precision T7400 Dual Xeon X5450 Nvidia GT730/Tesla C1060 The configure command issued was $ ./configure --prefix=/usr --with-cuda=/usr/local/cuda In the generated...