_mm256_permute4x64_epi64 makes the code an order of magnitude slower on AMD Ryzen7 PRO4750U

  amd-processor, c++, performance, vectorization

I am vectorizing a code that overall uses more than 50 AVX2 and SSE instructions including gather, shuffle, pack, unpack, extract, cast and etc. By profiling, I noticed that a single call to _mm256_permute4x64_epi64 makes the code 10 times slower while the rest of the code performs well.
The architecture that I am testing on is AMD Ryzen7 PRO4750U. I am interested to know:

  1. Has anybody else come across a similar problem?
  2. Any suggestion for replacing the following call with a more efficient instruction?
tmp = _mm256_permute4x64_epi64(tmpp10, 0xD8);

Source: Windows Questions C++