How to load into __m256 from a float* but reading backwards in memory as opposed to forwards?

  avx, c++, intrinsics, x86-64

I’ve got an array of floats that I’d like to access in reverse order. In my non-vectorized code this is easy.

Here is a simplifed version of the data that I have.

float A[8] = {a, b, c, d, e, f, g, h};
float B[8] = {s, t, u, v, w, x, y, z};

Here is the operation I would like to do.

float C[8] = {a*z, b*y, c*x, d*w, e*v, f*u, g*t, h*s};

I’d like to be able to do some kind of load_ps operation that will give me something like this:

__m256 A_Loaded         = _mm256_load_ps(&A[0]);
                        = {a, b, c, d, e, f, g, h};

__m256 B_LoadedReversed = _mm256_loadr_ps(&B[7]);
                        = {z, y, x, w, v, u, t, s};

__m256 Output = _mm256_mul_ps(A_Loaded, B_LoadedReversed);
              = {a*z, b*y, c*x, d*w, e*v, f*u, g*t, h*s};

One of the data sources I have is a lookup table, so could be reversed if push comes to shove, but would much prefer to avoid that as that would compilcate other areas of the program.

I’ve currently got a botch workaround using _mm256_set_ps() and manually pointing to the data I need, but that is not as performative as I would like.

I know there is a ‘reversed’ _mm256_set_ps() (_mm256_setr_ps()), but there doesn’t seem to be the _mm256_loadr_ps() that I need.

Any ideas and thoughts about this problem would be greatly appreciated! Thanks in advance.

Source: Windows Questions C++