Misaligned Access

__global__ void kernel(float* data) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  data[idx] += 1.0f;
}
^ This is Faster?
__global__ void kernel(float* data) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  data[idx + 1] += 1.0f;
}
^ This is Faster?

* For illustration purposes only, see FAQ for more details.