Misaligned Access
__global__ void kernel(float* data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
data[idx] += 1.0f;
}
^ This is Faster?
__global__ void kernel(float* data) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
data[idx + 1] += 1.0f;
}
^ This is Faster?
* For illustration purposes only, see FAQ for more details.