Coalesced Access

__global__ void kernel(float* data, int N) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  data[idx] += 1.0f;
}

^ This is Faster?

__global__ void kernel(float* data, int N, int stride) {
  int idx = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
  data[idx] += 1.0f;
}

^ This is Faster?

* For illustration purposes only, see FAQ for more details.