Why my cuda program became slower after using 128 threads on blocks?

I have a simple cuda application with the following code: #include <stdio.h> #include <sys/time.h> #include <stdint.h> __global__ void daxpy(int n, int a, int *x, int *y) { int i = blockIdx.x*blockDim.x + threadIdx.x; y[i] = x[i]; int j; for(j = 0; j < 1024*10000; ++j) { y[i] += j%10; }...

PyCUDA test_cumath.py fails on cosh

I've installed pycuda on a machine featuring a TESLA C2075. I'm running on Ubuntu 14.04 with the CUDA-6.0 compiler installed. Using python 2.7.9 (via the anaconda distribution) and numpy 1.9.0, I have installed pycuda 2014.1 from the ZIP file that Andreas Kloeckner provides on his website. (http://mathema.tician.de/software/pycuda/) Running the tests...

Tesla k20m interoperability with Direct3D 11

I would like to know if I can work with Nvidia Tesla K20 and Direct3D 11? I'd like to render an image using Direct3D, Then process the rendered image with CUDA, [ I know how to work out the CUDA interoperability]. Tesla k20 doesn't have a display adapter (physically remote...