Why don’t modern compilers coalesce neighboring memory accesses?

  c++, compiler-optimization, optimization

Consider the following code:

bool AllZeroes(const char buf[4])
{
    return buf[0] == 0 &&
           buf[1] == 0 &&
           buf[2] == 0 &&
           buf[3] == 0;
}

Output assembly from Clang 13 with -O3:

AllZeroes(char const*):                        # @AllZeroes(char const*)
        cmp     byte ptr [rdi], 0
        je      .LBB0_2
        xor     eax, eax
        ret
.LBB0_2:
        cmp     byte ptr [rdi + 1], 0
        je      .LBB0_4
        xor     eax, eax
        ret
.LBB0_4:
        cmp     byte ptr [rdi + 2], 0
        je      .LBB0_6
        xor     eax, eax
        ret
.LBB0_6:
        cmp     byte ptr [rdi + 3], 0
        sete    al
        ret

Each byte is compared individually, but it could’ve been optimized into a single 32-bit int comparison:

bool AllZeroes(const char buf[4])
{
    return *(int*)buf == 0;
}

Resulting in:

AllZeroes2(char const*):                      # @AllZeroes2(char const*)
        cmp     dword ptr [rdi], 0
        sete    al
        ret

I’ve also checked GCC and MSVC, and neither of them does this optimization. Is this disallowed by the C++ specification?

Edit:
Changing the short-circuited AND (&&) to bitwise AND (&) will generate the optimized code. Also, changing the order the bytes are compared doesn’t affect the code gen: https://godbolt.org/z/Y7TcG93sP

Source: Windows Questions C++

LEAVE A COMMENT