Non Temporal Write
// 160MB buffer, well beyond L3 cache
__m128i val = _mm_set1_epi64x(42);
__m128i* p = (__m128i*)arr;
for (size_t i = 0; i < N; ++i) {
_mm_stream_si128(p + i, val);
}
_mm_sfence();
^ This is Faster?
// 160MB buffer, well beyond L3 cache
__m128i val = _mm_set1_epi64x(42);
__m128i* p = (__m128i*)arr;
for (size_t i = 0; i < N; ++i) {
_mm_store_si128(p + i, val);
}
^ This is Faster?
* The benchmark is run under AMD Ryzen 9.
* For the full benchmark code, please refer here.
* For illustration purposes only, see FAQ for more details.