Misalign

std::vector<int64_t> arr(1'000'000);
for (auto i = 0u; i < arr.size(); ++i) {
  sum += arr[i];
}
^ This is Faster?
char buf[N * sizeof(int64_t) + 3];
char* p = buf + 3;
for (int i = 0; i < N; ++i) {
  int64_t v;
  memcpy(&v, p + i * sizeof(int64_t), sizeof(int64_t));
  sum += v;
}
^ This is Faster?

* The benchmark is run under AMD Ryzen 9.

* For the full benchmark code, please refer here.

* For illustration purposes only, see FAQ for more details.