When I can use SSE3 or AVX, are then older SSE versions as SSE2 or MMX available - or do I still need to check for them separately?

I am optimizing the "winner-take-all" portion of a disparity estimation algorithm using AVX2. My scalar routine is accurate, but at QVGA resolution and 48 disparities the runtime is disappointingly slow at ~14 ms on my laptop. I create both LR and RL disparity images, but for simplicity here I will...

I`ve tried to implement dot product of this two arrays using AVX http://stackoverflow.com/a/10459028. But my code is very slow. A and xb are arrays of doubles, n is even number. Can you help me? const int mask = 0x31; int sum =0; for (int i = 0; i < n;...

I am performing a census-transform on an image doing 32 comparisons per pixel. I can efficiently generate a 256-bit vector of 0x0100010100010100... where each 8-bits correspond to 0x00 or 0x01. The vector is identified below as 'comparisons'. I need to collapse this 256-bit vector to generate a 32-bit hamming string....

As part of a compression algorithm, I am looking for the optimal way to achieve the following: I have a simple bitmap in a uint8_t. For example 01010011 What I want is a __m256i of the form: (0, maxint, 0, maxint, 0, 0, maxint, maxint) One way to achieve this...

I have the following code extract written in C, double* res; posix_memalign((void **)&res, 32, sizeof(double)*4); __m256 ymm0, ymm1, ymm2, ymm3; ymm0 = _mm256_load_pd(vector_a); ymm1 = _mm256_load_pd(vector_b); ymm2 = _mm256_mul_pd(ymm1, ymm2); ymm3 = _mm256_store_pd((double*)res, ymm3); <--- problem line, When I compile, I get the following error message, error: assigning to '__m256'...

When I run a simple series of load, subtract and multiply using the AVX intrinsics I'm constantly getting the following error, Process terminating with default action of signal 11 (SIGSEGV) ==2995== General Protection Fault from the C code, double res[4] = {0.0, 0.0, 0.0, 0.0}; for(int i = 0; i...

I want to implement SIMD minmag and maxmag functions. As far as I understand these functions are minmag(a,b) = |a|<|b| ? a : b maxmag(a,b) = |a|>|b| ? a : b I want these for float and double and my target hardware is Haswell. What I really need is code...

I'm working on writing an OpenCL benchmark in C. Currently, it measures the fused multiply-accumulate performance of both a CL device, and the system's processor using C code. The results are then cross checked for accuracy. I wrote the native code to take advantage of GCC's auto vectorizer, and it...

Recently I am working on a numerical solver on computational Electrodynamics by Finite difference method. The solver was very simple to implement, but it is very difficult to reach the theoretical throughput of modern processors, because there is only 1 math operation on the loaded data, for example: #pragma ivdep...

I have a union that looks like this union bareVec8f { __m256 m256; //avx 8x float vector float floats[8]; int ints[8]; inline bareVec8f(){ } inline bareVec8f(__m256 vec){ this->m256 = vec; } inline bareVec8f &operator=(__m256 m256) { this->m256 = m256; return *this; } inline operator __m256 &() { return m256; }...

I'm trying to work around an apparent bug in the clang compiler where using the AVX intrinsic _mm256_loadu_ps results in unnecessary instructions being output in assembly. In particular, first it does a vmovups on the first half of the input vector into an xmm register, then joins the second half...

What is the best way to check whether a AVX intrinsic __m256 (vector of 8 float) contains any inf? I tried __m256 X=_mm256_set1_ps(1.0f/0.0f); _mm256_cmp_ps(X,X,_CMP_EQ_OQ); but this compares to true. Note that this method will find nan (which compare to false). So one way is to check for X!=nan && 0*X==nan:...

While playing around with overclocking and running burn tests, I noticed that the AVX-optimized version of LINPACK measured lower multithreaded floating-point throughput when Hyperthreading was enabled than with it disabled. This was on an Ivy Bridge i7 (3770k). I also noticed that with Hyperthreading disabled LINPACK resulted in higher core...

In the context of preparing some presentation, it occurred to me that I don't know what the theoretical limit is for the number of integer operations a Haswell core can perform at once. I used to naively assume "Intel cores have HT, but that's probably parallelizing different kinds of work,...

for my current project I need to create a vector of 256bit AVX vectors. I used myVector = vector<__m256d>(nrVars(), _mm256_set1_pd(1.0)); which worked fine once but after executing the line twice it gives me a segmentation fault. I was able to come up with the following piece of code vector<__m256d> temp;...

what is the best way to multiply each 32bit entry of two _mm256i registers with each other? _mm256_mul_epu32 is not what I'm looking for because it produces 64bit outputs. Moreover, I'm sure that the multiplication of two 32bit values will not overflow. Thanks!...

I'm trying to optimize some matrix computations and I was wondering if it was possible to detect at compile-time if SSE or/and AVX or/and AVX2 is enabled by the compiler ? Ideally for G++ and Clang, but I can manage with only one of them. I'm not sure it is...

I want to convert a vector of double precision values to char. I have to make two distinct approaches, one for SSE2 and the other for AVX2. I started with AVX2. __m128i sub_proc(__m256d& in) { __m256d _zero_pd = _mm256_setzero_pd(); __m256d ih_pd = _mm256_unpackhi_pd(in,_zero_pd); __m256d il_pd = _mm256_unpacklo_pd(in,_zero_pd); __m128i ih_si =...

I am using the Microsoft Visual Studio compiler. I am trying to find out if a 256 bit vector contains any non-zero values. I have tried res_simd = ! _mm256_testz_ps(*pSrc1, *pSrc1); but it does not work.

I hope someone can help here. I have a large byte vector from which i create a small byte vector ( based on a mask ) which I then process with simd. Currently the mask is an array of baseOffset + submask (byte[256]) , optimized for storage as there are...

I'm implementing a particle system using Intel AVX intrinsics. When the Y-position of a particle is less than or equal to zero I want to reset the particle. The particle system is ordered in a SOA-pattern like this: class ParticleSystem { private: float* mXPosition; float* mYPosition; float* mZPosition; .... Rest...