The Swizzle Operator
__shared__ float smem[BLOCK_SIZE][BLOCK_SIZE + 1];
smem[threadIdx.y][threadIdx.x ^ threadIdx.y] = data[idx];
^ This is Faster?
__shared__ float smem[BLOCK_SIZE][BLOCK_SIZE];
smem[threadIdx.y][threadIdx.x] = data[idx];
^ This is Faster?
* For illustration purposes only, see FAQ for more details.