Kernel Fission — Which is Faster?

__global__ void compute(float* out, float* in, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; out[idx] = heavy_compute(in[idx]); } __global__ void scatter(float* dst, float* src, int* map, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; dst[map[idx]] = src[idx]; }

Kernel fission is the opposite of fusion: splitting one kernel into multiple specialized kernels. This can help when a kernel has both compute-heavy and memory-heavy parts with different resource requirements.

By splitting, each kernel can be tuned independently: the compute kernel can use more registers and higher occupancy for compute, while the scatter kernel can optimize for memory access patterns. The scheduler can also better overlap different resource types.

Fission is beneficial when the combined kernel has poor occupancy due to high register usage, or when different parts of the kernel have fundamentally different access patterns.

References:

Wikipedia - Loop Fission
NVIDIA - Occupancy