Category : vectorization

I have a vector of 1 billion ints. I want to initialize with the numbers 0 to 1 billion. STL’s std::iota (time: 0.63 s): std::iota(std::begin(v), std::end(v), 0); To make it parallel we can use std::for_each with par policy: time: 0.68 s (8% slower than sequentail iota) std::for_each(std::execution::par, v.begin(), v.end(), [&v](int& x) { x = (&x ..

Read more

I have a minimally reproducible sample which is as follows – template<typename type> void AddMatrixOpenMP(type* matA, type* matB, type* result, size_t size){ for(size_t i=0; i < size * size; i++){ result[i] = matA[i] + matB[i]; } } int main(){ size_t size = 8192; //std::cout<<sizeof(double) * 8<<std::endl; auto matA = (float*) aligned_alloc(sizeof(float), size * size * ..

Read more

I am vectorizing a code that overall uses more than 50 AVX2 and SSE instructions including gather, shuffle, pack, unpack, extract, cast and etc. By profiling, I noticed that a single call to _mm256_permute4x64_epi64 makes the code 10 times slower while the rest of the code performs well. The architecture that I am testing on ..

Read more