alignas(64) is slower/causes pessimization

  alignment, c++, c++17

While reading beginner material/tutorials relevant to lock-free programming (and after doing some exercises) I found this happening.

Given the following piece of code (where the only difference is pretty much using alignas(64) in struct node2 while manually "aligning" in struct node1)

#include <mutex>
#include <benchmark/benchmark.h>

struct node1 {
    node1* next{ nullptr };
    int value{ 0 };
private:
    char align_bytes[64 - sizeof(node1*) - sizeof(value)];
};

struct alignas(64) node2 {
    node2* next{ nullptr };
    int value{ 0 };
};

static_assert(sizeof(node1) == 64);
static_assert(sizeof(node2) == 64);

std::mutex node_mutex;
node1* locked1{nullptr};
node2* locked2{nullptr};

auto handmade = [](benchmark::State& state) {
    for (auto _ : state)
    {
        node1* new_node{ new node1() };
        node1* old{ nullptr };
        {
            std::lock_guard<std::mutex> lock(node_mutex);
            old = locked1;
            locked1 = new_node;
        }
        delete old;
    }
};

auto aligned = [](benchmark::State& state) {
    for (auto _ : state)
    {
        node2* new_node{ new node2() };
        node2* old{ nullptr };
        {
            std::lock_guard<std::mutex> lock(node_mutex);
            old = locked2;
            locked2 = new_node;
        }
        delete old;
    }
};

BENCHMARK(handmade)->Threads(1)->UseRealTime();
BENCHMARK(aligned)->Threads(1)->UseRealTime();

BENCHMARK(handmade)->Threads(2)->UseRealTime();
BENCHMARK(aligned)->Threads(2)->UseRealTime();

BENCHMARK(handmade)->Threads(4)->UseRealTime();
BENCHMARK(aligned)->Threads(4)->UseRealTime();

BENCHMARK(handmade)->Threads(8)->UseRealTime();
BENCHMARK(aligned)->Threads(8)->UseRealTime();

BENCHMARK_MAIN();

When compiled with

g++ -std=c++17 -Wall -Werror -O3 -o alignas_bench alignas.cpp -lbenchmark

using g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Gave me these results

g++ -std=c++17 -Wall -Werror -O3 -o alignas_bench alignas.cpp -lbenchmark
./alignas_bench
2021-08-26 19:52:16
Running ./alignas_bench
Run on (4 X 2300 MHz CPU s)
CPU Caches:
  L1 Data 24K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 1024K (x2)
Load Average: 1.01, 1.13, 1.02
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
handmade/real_time/threads:1       53.4 ns         53.4 ns     12753880
aligned/real_time/threads:1         198 ns          198 ns      3503060

handmade/real_time/threads:2        134 ns          267 ns      5135622
aligned/real_time/threads:2         315 ns          630 ns      2073218

handmade/real_time/threads:4        207 ns          754 ns      3949912
aligned/real_time/threads:4         325 ns         1260 ns      2111384

handmade/real_time/threads:8        202 ns          807 ns      3555384
aligned/real_time/threads:8         327 ns         1303 ns      2268416

I cannot explain why the "handmade alignment" node1 object appears to do exactly the same work 50-100% faster, when the only difference appears to be the alignas(64)

Source: Windows Questions C++

LEAVE A COMMENT