#### Category : vectorization

Good day, I’m new to intrinsic in cpp and I’m still trying to learn it, I just want to ask how can we load an array of int on a __m128i variable, and perform addition with it, then store it’s values into another array of the same size? This is the working function that I ..

For me, one of the most interesting features in languages such as R or Scilab is the possibility of parallelizing operations by vectorizing functions ("meaning that the function will operate on all elements of a vector without needing to loop through and act on each element one at a time", in the words of The ..

Suppose I have a type T in C++. It has all sorts of methods, it can be used as a parameter to a bunch of functions etc. Now suppose I want to work on k elements of type T, with k being known at compile-time and also small (e.g. k=2 or k=3); and with most/all ..

I would like to take as input a very long list of objects, verify a condition for each of the objects and build as output a list of the objects that succeeded the verification. The current code I have is the following for( auto& my_iterator: my_input_list ) { my_output_list.push_back( my_iterator ); if( ! my_output_list.back().my_condition_valid() ) ..

I have a vector of 1 billion ints. I want to initialize with the numbers 0 to 1 billion. STL’s std::iota (time: 0.63 s): std::iota(std::begin(v), std::end(v), 0); To make it parallel we can use std::for_each with par policy: time: 0.68 s (8% slower than sequentail iota) std::for_each(std::execution::par, v.begin(), v.end(), [&v](int& x) { x = (&x ..

I have a minimally reproducible sample which is as follows – template<typename type> void AddMatrixOpenMP(type* matA, type* matB, type* result, size_t size){ for(size_t i=0; i < size * size; i++){ result[i] = matA[i] + matB[i]; } } int main(){ size_t size = 8192; //std::cout<<sizeof(double) * 8<<std::endl; auto matA = (float*) aligned_alloc(sizeof(float), size * size * ..

I am vectorizing a code that overall uses more than 50 AVX2 and SSE instructions including gather, shuffle, pack, unpack, extract, cast and etc. By profiling, I noticed that a single call to _mm256_permute4x64_epi64 makes the code 10 times slower while the rest of the code performs well. The architecture that I am testing on .. I am wondering if there is any way to vectorize the below loop? #define NUMEL 10000000 int a[NUMEL]; for (int i = 1; i < NUMEL; i++){ a[i] += a[i-1]; } I guess it needs a SIMD operation like below to be able to do that. Please let me know if that is possible. Regards ..