CUDA: using variable sized shared memory

  c++, cuda, shared-memory, transpose

I am trying to implement matrix transposition using shared memory. In all examples that I have come across the programmer declares __shared__ T shared_mem[WIDTH][WIDTH], where WIDTH is usually some #define‘d constant.

I saw in the CUDA C programming manual that variable sized shared memory blocks are declared using extern __shared__ T shared_mem[]. So, this is what I came up with.

template <class value_type, class container_type = value_type*>
transpose_tiled(container_type input, container_type output, size_t width)
    // assuming square blocks
    extern __shared__ value_type input_tile[];

    auto index_y = blockDim.y*blockIdx.y + threadIdx.y;
    auto index_x = blockDim.x*blockIdx.x + threadIdx.x;
    auto matrix_index = width*index_y + index_x;

    auto tr_index_x = blockIdx.y*blockDim.x + threadIdx.x;
    auto tr_index_y = blockIdx.x*blockDim.y + threadIdx.y;
    auto tr_index = tr_index_y*width + tr_index_x;
    // coalesced global memory access
    auto shared_index = threadIdx.y*blockDim.x+threadIdx.x;
    input_tile[shared_index] = input[matrix_index];

    output[tr_index] = input_tile[shared_index];


template<class value_type, class container_type = value_type*>
transposeHost(container_type input, container_type output, size_t width)
    const dim3 blockConfig(16, 16);
    const size_t dimGrid =  (size_t)( ceil( ((double) width)/16.0l) );
    debug("dimGrid: %lu", dimGrid);
    const dim3 gridConfig(dimGrid, dimGrid);

    const auto mem_size = width * width * sizeof(value_type);
    const auto shared_mem_size = gridConfig.x * gridConfig.y * sizeof(value_type);
    value_type *d_input, *d_output;
    cudaMalloc((void **) &d_input, mem_size);
    cudaMalloc((void **) &d_output, mem_size);
    cudaMemcpy(d_input, input, mem_size, cudaMemcpyHostToDevice);

    transpose_tiled<value_type><<<gridConfig, blockConfig, shared_mem_size>>>(d_input, d_output, width);
    cudaMemcpy(output, d_output, mem_size, cudaMemcpyDeviceToHost);


I have used the extern __shared__ declaration and provided the shared memory size in the kernel launch as the third argument to the <<<...>>> specifier. Upon printing output, i.e. output matrix on the host, I get all 0s. Running with cuda-gdb causes the following error in transpose_tiled

CUDA Exception: Warp Out-of-range Address
The exception was triggered at PC 0x555555b57aa0 (

Thread 1 "a.out" received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.
[Switching focus to CUDA kernel 0, grid 1, block (0,0,0), thread (0,8,0), device 0, sm 0, warp 4, lane 0]
0x0000555555b57ac0 in transpose_tiled<double, double*>
   <<<(2,2,1),(16,16,1)>>> (input=0x7fffcfa00000,
    output=0x7fffcfa01e00, width=30) at
42      __syncthreads();

While I have a strong feeling that I am missing something trivial, I’ve been wrestling with this problem for too long now. What is it that I am doing wrong? What could I have done differently to debug this code?

Source: Windows Questions C++