Non Temporal Write

// 160MB buffer, well beyond L3 cache
__m128i val = _mm_set1_epi64x(42);
__m128i* p = (__m128i*)arr;
for (size_t i = 0; i < N; ++i) {
  _mm_stream_si128(p + i, val);
}
_mm_sfence();
^ This is Faster?
// 160MB buffer, well beyond L3 cache
__m128i val = _mm_set1_epi64x(42);
__m128i* p = (__m128i*)arr;
for (size_t i = 0; i < N; ++i) {
  _mm_store_si128(p + i, val);
}
^ This is Faster?

* The benchmark is run under AMD Ryzen 9.

* For the full benchmark code, please refer here.

* For illustration purposes only, see FAQ for more details.