c,cuda , 'an illegal memory access' when trying to write to a 2D array allocated using cudaMalloc3D

'an illegal memory access' when trying to write to a 2D array allocated using cudaMalloc3D


Tag: c,cuda

I am trying to allocate and copy memory of a flattened 2D array on to the device using cudaMalloc3D to test the performance of cudaMalloc3D. But when I try to write to the array from the kernel it throws 'an illegal memory access was encountered' exception. The program runs fine if I am just reading from the array but when I try to write to it, there is an error. Any help on this will be greatly appreciated. Below is my code and the syntax for compiling the code.

Compile using

nvcc -O2 -arch sm_20 test.cu 

Code: test.cu

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

#define PI 3.14159265 
#define NX 8192     /* includes boundary points on both end */
#define NY 4096     /* includes boundary points on both end */
#define NZ 1        /* needed for cudaMalloc3D */

#define N_THREADS_X 16
#define N_THREADS_Y 16

#define LX 4.0    /* length of the domain in x-direction  */
#define LY 2.0    /* length of the domain in x-direction  */
#define dx       (REAL) ( LX/( (REAL) (NX) ) )
#define cSqrd     5.0
#define dt       (REAL) ( 0.4 * dx / sqrt(cSqrd) )
#define FACTOR   ( cSqrd * (dt*dt)/(dx*dx) )

#define IC  (i + j*NX)       /* (i,j)   */
#define IM1 (i + j*NX - 1)   /* (i-1,j) */
#define IP1 (i + j*NX + 1)   /* (i+1,j) */
#define JM1 (i + (j-1)*NX)   /* (i,j-1) */
#define JP1 (i + (j+1)*NX)   /* (i,j+1) */

// Macro for checking CUDA errors following a CUDA launch or API call
#define cudaCheckError() {\
  cudaError_t e = cudaGetLastError();\
  if( e != cudaSuccess ) {\
    printf("\nCuda failure %s:%d: '%s'\n",__FILE__,__LINE__,cudaGetErrorString(e));\

typedef double REAL;
typedef int    INT;

void meshGrid ( REAL *x, REAL *y )

  INT i,j;
  REAL a;
  for (j=0; j<NY; j++) {
    a = dx * ( (REAL) j );
    for (i=0; i<NX; i++) {
      x[IC] =  dx * ( (REAL) i );
      y[IC] = a;

void initWave ( REAL *u, REAL *uold, REAL *x, REAL *y )
  INT i,j;
  for (j=1; j<NY-1; j++) {
    for (i=1; i<NX-1; i++) {
      u[IC] =  0.1 * (4.0*x[IC]-x[IC]*x[IC]) * ( 2.0*y[IC] - y[IC]*y[IC] );

  for (j=1; j<NY-1; j++) {
    for (i=1; i<NX-1; i++) {
      uold[IC] = u[IC] + 0.5*FACTOR*( u[IP1] + u[IM1] + u[JP1] + u[JM1] - 4.0*u[IC] );

__global__ void solveWaveGPU ( cudaPitchedPtr uold, cudaPitchedPtr u, cudaPitchedPtr unew )

 INT i,j;

 i = blockIdx.x*blockDim.x + threadIdx.x;
 j = blockIdx.y*blockDim.y + threadIdx.y;

 if (i>0 && i < (NX-1) && j>0 && j < (NY-1) ) {

  char *unewPtr  = (char *) unew.ptr;
  REAL *unew_row = (REAL *) (unewPtr + i * unew.pitch);

  REAL tmp = unew_row[j]; // no error on this line
  unew_row[j] = 1.2; // this is where I get the error


INT main(INT argc, char *argv[])

  INT nTimeSteps = 10;  

  // pointers for the host side
  REAL *unew, *u, *uold, *uFinal, *x, *y;

  // allocate memory on the host
  unew        = (REAL *)calloc(NX*NY,sizeof(REAL));
  u           = (REAL *)calloc(NX*NY,sizeof(REAL));
  uold        = (REAL *)calloc(NX*NY,sizeof(REAL));
  uFinal      = (REAL *)calloc(NX*NY,sizeof(REAL));
  x           = (REAL *)calloc(NX*NY,sizeof(REAL));
  y           = (REAL *)calloc(NX*NY,sizeof(REAL));

  // pointer for the device side
  size_t pitch = NX * sizeof(REAL);
  cudaPitchedPtr  d_u, d_uold, d_unew, d_tmp;
  cudaExtent myExtent = make_cudaExtent(pitch, NY, NZ);

  // allocate 3D memory on the device
  cudaMalloc3D( &d_u, myExtent );    cudaCheckError();
  cudaMalloc3D( &d_uold, myExtent ); cudaCheckError();
  cudaMalloc3D( &d_unew, myExtent ); cudaCheckError();

  // initialize grid and wave
  meshGrid( x, y );
  initWave( u, uold, x, y );

  // copy host memory to 3D device memory
  cudaMemcpy3DParms cpy3D = { 0 };
  cpy3D.kind = cudaMemcpyHostToDevice;

  // copying u to d_u
  cpy3D.srcPtr = make_cudaPitchedPtr(u, pitch, NX, NY);
  cpy3D.dstPtr = d_u;
  cpy3D.extent = myExtent;
  cudaMemcpy3D( &cpy3D ); cudaCheckError();  

  // copying uold to d_uold
  cpy3D.srcPtr = make_cudaPitchedPtr(uold, pitch, NX, NY);
  cpy3D.dstPtr = d_uold;
  cpy3D.extent = myExtent;
  cudaMemcpy3D( &cpy3D ); cudaCheckError();  

  //  set up the GPU grid/block model
  dim3 dimGrid  ( N_BLOCKS_X , N_BLOCKS_Y  );
  dim3 dimBlock ( N_THREADS_X, N_THREADS_Y );

  for ( INT n = 1; n < nTimeSteps + 1; n++ ) {
    solveWaveGPU <<< dimGrid, dimBlock >>> ( d_uold, d_u, d_unew );

    d_tmp  = d_uold;
    d_uold = d_u;
    d_u    = d_unew;
    d_unew = d_tmp;

  // copy the memory back to host
  cpy3D.kind = cudaMemcpyDeviceToHost;

  // copying d_unew to uFinal
  cpy3D.srcPtr = d_unew;
  cpy3D.dstPtr = make_cudaPitchedPtr(uFinal, pitch, NX, NY);
  cpy3D.extent = myExtent;
  cudaMemcpy3D( &cpy3D ); cudaCheckError();  

  free(u);    cudaFree(d_u.ptr);
  free(unew); cudaFree(d_unew.ptr);
  free(uold); cudaFree(d_uold.ptr);

  free(uFinal); free(x); free(y);

  return EXIT_SUCCESS;


The reason the error doesn't occur on this line:

REAL tmp = unew_row[j]; // no error on this line

is because the compiler is optimizing that line out. It doesn't do anything useful, and so the compiler completely eliminates it. The compiler warning:

xxx.cu(87): warning: variable "tmp" was declared but never referenced

is a hint to that effect.

Your code is very nearly correct. The issue is here:

REAL *unew_row = (REAL *) (unewPtr + i * unew.pitch);

It should be:

REAL *unew_row = (REAL *) (unewPtr + j * unew.pitch);

The i variable in your kernel is the width (i.e. X) dimension. The j variable is the height (i.e. Y) dimension.

The height is the one that refers to which row you are on, therefore the row pitch should be multiplied by the height parameter, i.e. j, not i.

Similarly, although it's not the source of the specific failure for your particular dimensions, this code may be not what you intended either:

REAL tmp = unew_row[j]; // no error on this line
unew_row[j] = 1.2; // this is where I get the error

If, for example, you were intending to compute the offset to the row and then index into the row (perhaps to set every element in the alocation, for example) then I think you would want to use i not j as your final index:

REAL tmp = unew_row[i]; // no error on this line
unew_row[i] = 1.2; // this is where I get the error

However, for this particular example, this is not the actual source of the illegal memory access.


Efficient comparison of small integer vectors

I have small vectors. Each of them is made of 10 integers which are between 0 and 15. This means that every element in a vector can be written using 4 bits. Hence I can concatenate my vector elements and store the whole vector in a single long type (in...

How to read string until two consecutive spaces?

A well known function of the scanf() functions is that you can pass a format to scan input according to this format. For my case, I cannot seem to find a solution searching this and this documentation. I have a string (sInput) as the following: #something VAR1 this is a...

Counting bytes received by posix read()

I get confused with one line of code: temp_uart_count = read(VCOM, temp_uart_data, 4096); I found more about read function at http://linux.die.net/man/3/read, but if everything is okay it returns 0, so how we can get num of bytes received from that? temp_uart_count is used to count how much bytes we received...

Is it safe to read and write on an array of 32 bit data byte by byte?

So I have a void * data of 32 bit unsigned integers which represents the pixels. Is it okay for me to access one of the pixels with a char * and modify the values directly? Or is it better to store my new pixel in a temporary uint32_t variable...

CallXXXMethod undefined using JNI in C

So I've tried to use the JNI interface to call Java methods from C. Calling static methods is no problem, but I get stuck when I want to call a method on an object. The code is as follows: #include <stdio.h> #include <string.h> #include <jni.h> int main() { JavaVMOption options[1];...

C language, vector of struct, miss something?

This is a part of my program that I want to create a vector of struct typedef struct { char nome[501]; int qtd; int linha; int coluna; } tPeca; tPeca* criarPecas(FILE *pFile, int tam) { int i; tPeca *pecaJogo = (tPeca*)malloc(tam*sizeof(tPeca)); if (pecaJogo == NULL) return NULL; for (i =...

Galois LFSR - how to specify the output bit number

I am trying to understand how change the galois LFSR code to be able to specify the output bit number as a parameter for the function mentioned below. I mean I need to return not the last bit of LFSR as output bit, but any bit of the LFSR (...

CGO converting Xlib XEvent struct to byte array?

I am creating a simple window manager (code based of the c code in tinywm) in Golang. To use Xlib, I am using cgo, so my header is: // #cgo LDFLAGS: -lX11 // #include <X11/Xlib.h> And I have a variable declaration, like: event := C.XEvent{} And then, I use this...

How can I pass a struct to a kernel in JCuda

I have already looked at this http://www.javacodegeeks.com/2011/10/gpgpu-with-jcuda-good-bad-and-ugly.html which says I must modify my kernel to take only single dimensional arrays. However I refuse to believe that it is impossible to create a struct and copy it to device memory in JCuda. I would imagine the usual implementation would be to...

Array breaking in Pebble C

I'm trying to create a simple dice-rolling application in Pebble using C on CloudPebble. I have an array of different die sizes you can scroll through using Up/Down, and you roll (currently just generate a random number, it'll get fancier later) using the middle button. There's also a label at...

What does `strcpy(x+1, SEQX)` do?

I'm wondering what this syntax of strcpy() does in line 65 and 66: 24 #define SEQX "TTCATA" 25 #define SEQY "TGCTCGTA" 61 M = strlen(SEQX); 62 N = strlen(SEQY); 63 x = malloc(sizeof(char) * (M+2)); /* +2: leading blank, and trailing \0 */ 64 y = malloc(sizeof(char) * (N+2)); 65...

Segmentation Fault if I don't say int i=0

void removeVowels(char* array){ int i,j,v; i=0; char vowel[]={'a','e','i','o','u'}; while(array[i]!='\0') { for(v=0;v<5;v++) { if (array[i]==vowel[v]) { j=i; while(array[j]!='\0') { array[j]=array[j+1]; j++; } i--; break; } } i++; } } in function removeVowels() if I don't include i=0; and just say int i; why does it give segmentation fault? Isn't it automatically...

Segmentation fault with generating an RSA and saving in ASN.1/DER?

#include <string.h> #include <openssl/aes.h> #include <openssl/rand.h> #include <openssl/bio.h> #include <openssl/rsa.h> #include <openssl/evp.h> #include <openssl/pem.h> #define RSA_LEN 2048 #define RSA_FACTOR 65537 int genRSA2048(unsigned char **pub,unsigned int *pub_l,unsigned char **priv,unsigned int *priv_l){ RSA *pRSA = NULL; pRSA = RSA_generate_key(RSA_LEN,RSA_FACTOR,NULL,NULL); if (pRSA){ pub_l = malloc(sizeof(pub_l)); *pub_l = i2d_RSAPublicKey(pRSA,pub); priv_l = malloc(sizeof(priv_l));...

How to increment the value of an unsigned char * (C)

I have a value stored as an unsigned char * (in C). This holds the SHA1 hash of a string. My goal is to cover the SHA1 key space. Since I'm using <openssl/evp.h> to generate the hashes, I end up with an unsigned char* holding the SHA1 value. Now I...

How convert unsigned int to unsigned char array

I just need to extract those bytes using bitwise & operator. 0xFF is a hexadecimal mask to extract one byte. For 2 bytes, this code is working correctly: #include <stdio.h> int main() { unsigned int i = 0x7ee; unsigned char c[2]; c[0] = i & 0xFF; c[1] = (i>>8) &...

Is post-increment operator guaranteed to run instantly?

Let's say I have the following code: int i = 0; func(i++, i++); The increment is happening right after returning the value? Is it guaranteed that the first argument will be 0, and the second argument will be 1?...

How to control C Macro Precedence

#define VAL1CHK 20 #define NUM 1 #define JOIN(A,B,C) A##B##C int x = JOIN(VAL,NUM,CHK); With above code my expectation was int x = 20; But i get compilation error as macro expands to int x = VALNUMCHK; // Which is undefined How to make it so that NUM is replaced first...

How can I align stack to the end of SRAM?

I have a STM32F103VCT6 microcontroller with 48kb of SRAM, and recently i've got a memory collision: I have some static variable (lets call it A) located in heap with size of 0x7000 and I wrote some simple function to get info about stack and heap: void check(int depth) { char...

OpenGL glTexImage2D memory issue

I'm loading a cubemap to create a skybox, everything is fine and the skybox renders properly with a correct texture application. However, I decided to check my program safety with valgrind, Valgrind gives this error: http://pastebin.com/seqmXjyx The line 53 in sky.c is: glTexImage2D(GL_TEXTURE_CUBE_MAP_POSITIVE_X + i, 0, GL_RGB, texture.width, texture.height, 0,...

Is there Predefined-Macros define about byte order in armcc

Is there Predefined-Macros define about byte order in armcc. I am a novice on the armcc.and sorry for my English. In gcc these are macros: __BYTE_ORDER__ __ORDER_LITTLE_ENDIAN__ __ORDER_BIG_ENDIAN__ __ORDER_PDP_ENDIAN__ ... Now I have to use armcc, Is there same like these with armcc? Thank a lot. by the way,the armcc...

Is i=i+1 an undefined behaviour?

I'm using codeblocks and it is giving a different output to other compilers and I can't find a solution to it.What's the undefined behaviour in this program and is there any solution to avoid it? This is the code to print the nth number in a number system with only...

execl() works on one of my code, but doesn't work on another

I already used execl() in code, and it worked well. But this time, I really have no idea why it doesn't work. So here's the code that do not work #include <unistd.h> #include <stdio.h> int main() { int i = 896; printf("please\n"); execl("home/ubuntu/server/LC/admin/admin", (char*)i, NULL); printf("i have no idea why\n");...

How does ((a++,b)) work? [duplicate]

This question already has an answer here: What does the comma operator `,` do in C? 8 answers In the below block of code, I am trying to understand how the line return reverse((i++, i)) is working. #include <stdio.h> void reverse(int i); int main() { reverse(1); } void reverse(int...

Set precision dynamically using sprintf

Using sprintf and the general syntax "%A.B" I can do this: double a = 0.0000005l; char myNumber[50]; sprintf(myNumber,"%.2lf",a); Can I set A and B dynamically in the format string?...

Infinite loop with fread

I'm trying to allocate an array 64 bytes in size and then loop over the array indexes to put a read a byte each from the inputfile. but when I don't malloc() the array indexes, the loop stays in index0 (so each time it loops it replaces the content in...

Multiple definition and file management

I'm writing a program for vocabulary training, for myself. And the program itself should be available in different languages, atm in German and English. What I want is to have a Main File which manage all and two separate files for the functions in the right language. I compile all...

VS2012 Identifer not found when part of static lib

Using VS2012 C/C++: I created and linked a static lib called "libtools" to my project. Calls to functions in the libtools lib worked as expected. I created and linked a second static lib called "shunt" to my project. But when I incorporate a call to a function in shunt, I...

C binary tree sort - extending it

I need some help in C Help me to extend the binary tree sort on C. I need to return a sorted array in sort function. here it is: #include <stdio.h> #include <stdlib.h> struct btreenode { struct btreenode *leftchild ; int data ; struct btreenode *rightchild ; } ; void...

Tesla k20m interoperability with Direct3D 11

I would like to know if I can work with Nvidia Tesla K20 and Direct3D 11? I'd like to render an image using Direct3D, Then process the rendered image with CUDA, [ I know how to work out the CUDA interoperability]. Tesla k20 doesn't have a display adapter (physically remote...

fread(), solaris to unix portability and use of uninitialised values

Valgrind found the following error and I, after reading the documentation, the code and other questions in here couldn't figure it out why. Valgrind: first warning ~$ valgrind --vgdb=yes --vgdb-error=0 --read-var-info=yes --leak-check=yes --track-origins=yes debitadmin* debitadmin ==20720== Conditional jump or move depends on uninitialised value(s) ==20720== at 0x4013BC6: initialise (dbg.c:199) ==20720==...

How does this code print odd and even?

#define MACRO(num, str) {\ printf("%d", num);\ printf(" is");\ printf(" %s number", str);\ printf("\n");\ } int main(void) { int num; printf("Enter a number: "); scanf("%d", &num); if (num & 1) { MACRO(num, "Odd"); } else { MACRO(num, "Even"); } return 0; } Please explain the above code (if/else condition and how...

Program to reverse a string in C without declaring a char[]

I need to reverse a given string and display it without using the value At[index] notation , I tried the below program using pointers,but it does not print anything for the reverse string, Please help! int main() { char* name=malloc(256); printf("\nEnter string\n"); scanf("%s",name); printf("\nYou entered%s",name); int i,count; count=0; //find the...

Does strlen() always correctly report the number of char's in a pointer initialized string?

As long as I use the char and not some wchar_t type to declare a string will strlen() correctly report the number of chars in the string or are there some very specific cases I need to be aware of? Here is an example: char *something = "Report all my...

Text justification C language

I have to solve a problem that involves left justification string length and leading zeros. I have the following table : BEGIN CLOSE CONCATENATE DELETE END INITIALIZE PRINT WRITE This is produced by a simple program. My problem is to find out how to convert it like that : It...

C++ / C #define macro calculation

Suppose I have #define DETUNE1 sqrt(7)-sqrt(5) #define DETUNE2 sqrt(11)-sqrt(7) And I call these multiple times in my program. Are DETUNE1 and DETUNE2 calculated every time it is called? Thanks. Please don't downvote this, I really want to know and a search didn't turn up anything definite. ...

getchar() not working in c

getchar() is not working in the below program, can anyone help me to solve this out. I tried scanf() function in place of getchar() then also it is not working. I am not able to figure out the root cause of the issue, can anyone please help me. #include<stdio.h> int...

scanf get multiple values at once

I need to get in one single shot different inputs from one single line. In particular I need to get a single char and then, depending on which char value I just read, it can be a string and an int or a string, an int and another string and...

Passing int using char pointer in C

I'm trying to figure out how to pass an int using a char pointer. It fails once the int value is too large for the char. This is what I'm trying to figure out: char *args[5]; int i = 20; /*some other code/assignments*/ args[2] = (char *)&i; execv(path, args); How...

free causing different results from malloc

Below is a C program i have written to print different combination of characters in a string. This is not an efficient way as this algorithm creats a lot of extra strings. However my question is NOT about how to solve this problem more efficiently. The program works(inefficiently though) and...

CUDA cuBlasGetmatrix / cublasSetMatrix fails | Explanation of arguments

I've attempted to copy the matrix [1 2 3 4 ; 5 6 7 8 ; 9 10 11 12 ] stored in column-major format as x, by first copying it to a matrix in an NVIDIA GPU d_x using cublasSetMatrix, and then copying d_x to y using cublasGetMatrix(). #include<stdio.h>...

What all local variables goto Data/BSS segment?

The man page of nm here: MAN NM says that The symbol type. At least the following types are used; others are, as well, depending on the object file format. If lowercase, the symbol is usually local; if uppercase, the symbol is global (external) And underneath it has "b" and...

Is there any way of protecting a variable for being modified at runtime in C?

I was wondering if there is any way of protecting a variable for being modified once initialized (something like "constantize" a variable at runtime ). For example: #include <stdio.h> #include <stdlib.h> int main(void) { int v, op; scanf( "%d", &op ); if( op == 0 ) v = 1; else...

Loop through database table and compare user input

I am trying to loop through the rows in a MySql table and compare the data in a certain column to some user input using C. Currently my code looks like this: MYSQL *cxn = mysql_init(NULL); MYSQL_RES *result; unsigned int num_fields; unsigned int num_rows; char *query_string; MYSQL_ROW *row; if (mysql_real_connect(cxn,...

Does realloc() invalidate all pointers?

Note, this question is not asking if realloc() invalidates pointers within the original block, but if it invalidates all the other pointers. I'm new to C, and am a bit confused about the nature of realloc(), specifically if it moves any other memory. For example: void* ptr1 = malloc(2); void*...

On entry to NIT parameter number 9 had an illegal value

I go this ex1.c file from Intel 11. However, when I execute it, it fails: [email protected]:~/konstantis$ ../mpich-install/bin/mpicc -o test ex1.c -I../intel/mkl/include ../intel/mkl/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group ../intel/mkl/lib/intel64/libmkl_intel_ilp64.a ../intel/mkl/lib/intel64/libmkl_core.a ../intel/mkl/lib/intel64/libmkl_sequential.a -Wl,--end-group ../intel/mkl/lib/intel64/libmkl_blacs_intelmpi_ilp64.a -lpthread -lm -ldl [email protected]:~/konstantis$ mpiexec -n 4 ./test { 0, 0}: On entry to DESCI{...

Disadvantages of calling realloc in a loop

I'm trying to implement some math algorithms in C on Windows 7, and I need to repeatedly increase size of my array. Sometimes it fails because realloc can't allocate memory. But if I allocate a lot of memory at once in the beginning it works fine. Is it a problem...

C programming - Confusion regarding curly braces

The following code is for replacing multiple consecutive spaces into 1 space. Although I manage to do it, I am confused in the use of curly braces. This one is actually running fine: #include <stdio.h> #include <stdlib.h> int main() { int ch, lastch; lastch = 'a'; while((ch = getchar())!= EOF)...

Recursive signal call using kill function

I'm now learning signals in computer system and I've stuck with a problem. There is a code given below; int i = 0; void handler(int s) { if(!i) kill(getpid(), SIGINT); i++; } int main() { signal(SIGINT, handler); kill(getpid(), SIGINT); printf("%d\n", i); return 0; } And the solution says that the...

Unexpected result when calculating a percentage - even when factoring in integer division rules

I am trying to express a battery voltage as a percentage. My battery level is a (global) uint16 in mV. I have a 16-bit CPU. Here is my code: static uint8 convertBattery(void){ uint16 const fullBattery = 3000; /* 3V = 3000mV */ uint8 charge; charge = ((battery*100)/fullBattery); return charge; }...

Reverse ^ operator for decryption

I'm trying to reverse the following code in order to provide a function which takes the buffer and decrypts it. void crypt_buffer(unsigned char *buffer, size_t size, char *key) { size_t i; int j; j = 0; for(i = 0; i < size; i++) { if(j >= KEY_SIZE) j = 0;...