I am vectorizing a code that overall uses more than 50 AVX2 and SSE instructions including gather, shuffle, pack, unpack, extract, cast and etc. By profiling, I noticed that a single call to _mm256_permute4x64_epi64 makes the code 10 times slower while the rest of the code performs well.
The architecture that I am testing on is AMD Ryzen7 PRO4750U. I am interested to know:
- Has anybody else come across a similar problem?
- Any suggestion for replacing the following call with a more efficient instruction?
tmp = _mm256_permute4x64_epi64(tmpp10, 0xD8);
Source: Windows Questions C++