CUDA C++: optimizing for slow branch convergence (one kernel vs multiple kernels)

  c++, cuda, gpu

I’m working with CUDA and am trying to understand thread blocks, warps, etc, and how to better optimize kernels in the process.

One general question I have is whether or not it’s a better idea to run multiple kernels with branches that are quicker to converge, or to run one one giant kernel with a branch that is slow to converge? Here’s an example of what I mean:

Using a single-kernel approach:

//inside kernel_1 loop

if (test(data[i])) //50% of threads will test true
{
    data[i].veryHeavyComputation(); 
}else
{
    data[i].anotherVeryHeavyComputation(); 
}

 

Now imagine three kernels where the second and third kernel performs the heavy computation on objects which are simply collected by the first kernel:

//inside kernel_1 loop

if (test(data[i])) //only 50% of threads will test true
{
    collectorA[atomicAdd(counterA, 1)] = i;
}else
{
    collectorB[atomicAdd(counterB, 1)] = i;
}

and

//inside kernel_2 loop

data[collectorA[i]].veryHeavyComputation();

and

//inside kernel_3 loop

data[collectorB[i]].anotherVeryHeavyComputation();

Basically I’m looking for "CUDA best practice"-style insights into whether or not multiple small kernels with lightweight branches are better than big kernels with slowly-converging branches.

Source: Windows Questions C++

LEAVE A COMMENT