Coalesced Access
__global__ void kernel(float* data, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
data[idx] += 1.0f;
}
^ This is Faster?
__global__ void kernel(float* data, int N, int stride) {
int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
data[idx] += 1.0f;
}
^ This is Faster?
* For illustration purposes only, see FAQ for more details.