Kernel Fusion — Which is Faster?

__global__ void fused(float* data, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; float x = data[idx]; x = x * 2.0f + 1.0f; // op1 x = sqrtf(x); // op2 data[idx] = x; }

__global__ void op1(float* data, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; data[idx] = data[idx] * 2.0f + 1.0f; } __global__ void op2(float* data, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; data[idx] = sqrtf(data[idx]); }

Kernel fusion combines multiple operations into a single kernel, reducing memory round-trips. Each separate kernel must read data from global memory and write results back. A fused kernel keeps intermediate values in registers, avoiding the global memory write-read between steps.

For memory-bound operations, fusion can nearly double throughput by halving memory accesses. However, fusion increases register pressure and may reduce occupancy, so there's a tradeoff.

Frameworks like PyTorch and TensorFlow heavily rely on kernel fusion (via torch.compile, XLA) to achieve good GPU performance.

References: