FAQ Database Discussion Community

Does Hyperthreading have trouble with AVX?

While playing around with overclocking and running burn tests, I noticed that the AVX-optimized version of LINPACK measured lower multithreaded floating-point throughput when Hyperthreading was enabled than with it disabled. This was on an Ivy Bridge i7 (3770k). I also noticed that with Hyperthreading disabled LINPACK resulted in higher core...

GCC emits vastly different code using “-march=native” on similar architectures

I'm working on writing an OpenCL benchmark in C. Currently, it measures the fused multiply-accumulate performance of both a CL device, and the system's processor using C code. The results are then cross checked for accuracy. I wrote the native code to take advantage of GCC's auto vectorizer, and it...

Are older SIMD-versions available when using newer ones?

When I can use SSE3 or AVX, are then older SSE versions as SSE2 or MMX available - or do I still need to check for them separately?

32-bit Hamming String formation from 32 8-bit comparisons

I am performing a census-transform on an image doing 32 comparisons per pixel. I can efficiently generate a 256-bit vector of 0x0100010100010100... where each 8-bits correspond to 0x00 or 0x01. The vector is identified below as 'comparisons'. I need to collapse this 256-bit vector to generate a 32-bit hamming string....

How to check inf for AVX intrinsic __m256

What is the best way to check whether a AVX intrinsic __m256 (vector of 8 float) contains any inf? I tried __m256 X=_mm256_set1_ps(1.0f/0.0f); _mm256_cmp_ps(X,X,_CMP_EQ_OQ); but this compares to true. Note that this method will find nan (which compare to false). So one way is to check for X!=nan && 0*X==nan:...

SIMD minmag and maxmag

I want to implement SIMD minmag and maxmag functions. As far as I understand these functions are minmag(a,b) = |a|<|b| ? a : b maxmag(a,b) = |a|>|b| ? a : b I want these for float and double and my target hardware is Haswell. What I really need is code...

AVX2 — multiply two __m256i integers

what is the best way to multiply each 32bit entry of two _mm256i registers with each other? _mm256_mul_epu32 is not what I'm looking for because it produces 64bit outputs. Moreover, I'm sure that the multiplication of two 32bit values will not overflow. Thanks!...

Wrapper for __m256 producing segmentation fault with constructor

I have a union that looks like this union bareVec8f { __m256 m256; //avx 8x float vector float floats[8]; int ints[8]; inline bareVec8f(){ } inline bareVec8f(__m256 vec){ this->m256 = vec; } inline bareVec8f &operator=(__m256 m256) { this->m256 = m256; return *this; } inline operator __m256 &() { return m256; }...

c++ inline function wrapping single vmovups in GCC inline assembly

I'm trying to work around an apparent bug in the clang compiler where using the AVX intrinsic _mm256_loadu_ps results in unnecessary instructions being output in assembly. In particular, first it does a vmovups on the first half of the input vector into an xmm register, then joins the second half...

AVX2 Winner-Take-All Disparity Search

I am optimizing the "winner-take-all" portion of a disparity estimation algorithm using AVX2. My scalar routine is accurate, but at QVGA resolution and 48 disparities the runtime is disappointingly slow at ~14 ms on my laptop. I create both LR and RL disparity images, but for simplicity here I will...

Optimal uint8_t bitmap into a 8 x 32bit SIMD “bool” vector

As part of a compression algorithm, I am looking for the optimal way to achieve the following: I have a simple bitmap in a uint8_t. For example 01010011 What I want is a __m256i of the form: (0, maxint, 0, maxint, 0, 0, maxint, maxint) One way to achieve this...

Intel SIMD - How can I check if an __m256* contains any non-zero values

I am using the Microsoft Visual Studio compiler. I am trying to find out if a 256 bit vector contains any non-zero values. I have tried res_simd = ! _mm256_testz_ps(*pSrc1, *pSrc1); but it does not work.

Segfault while creating a vector of avx vectors

for my current project I need to create a vector of 256bit AVX vectors. I used myVector = vector<__m256d>(nrVars(), _mm256_set1_pd(1.0)); which worked fine once but after executing the line twice it gives me a segmentation fault. I was able to come up with the following piece of code vector<__m256d> temp;...

How to reach AVX computation throughput for a simple loop

Recently I am working on a numerical solver on computational Electrodynamics by Finite difference method. The solver was very simple to implement, but it is very difficult to reach the theoretical throughput of modern processors, because there is only 1 math operation on the loaded data, for example: #pragma ivdep...

AVX - storing __256 vector back to the memory (void**) in C,

I have the following code extract written in C, double* res; posix_memalign((void **)&res, 32, sizeof(double)*4); __m256 ymm0, ymm1, ymm2, ymm3; ymm0 = _mm256_load_pd(vector_a); ymm1 = _mm256_load_pd(vector_b); ymm2 = _mm256_mul_pd(ymm1, ymm2); ymm3 = _mm256_store_pd((double*)res, ymm3); <--- problem line, When I compile, I get the following error message, error: assigning to '__m256'...

How to detect SSE/AVX/AVX2 availability at compile-time ?

I'm trying to optimize some matrix computations and I was wondering if it was possible to detect at compile-time if SSE or/and AVX or/and AVX2 is enabled by the compiler ? Ideally for G++ and Clang, but I can manage with only one of them. I'm not sure it is...

intel AVX multiplication error in C,

When I run a simple series of load, subtract and multiply using the AVX intrinsics I'm constantly getting the following error, Process terminating with default action of signal 11 (SIGSEGV) ==2995== General Protection Fault from the C code, double res[4] = {0.0, 0.0, 0.0, 0.0}; for(int i = 0; i...

How many 32-bit integer ops can a Haswell core perform at once?

In the context of preparing some presentation, it occurred to me that I don't know what the theoretical limit is for the number of integer operations a Haswell core can perform at once. I used to naively assume "Intel cores have HT, but that's probably parallelizing different kinds of work,...

load vector from large vector with simd based on mask

I hope someone can help here. I have a large byte vector from which i create a small byte vector ( based on a mask ) which I then process with simd. Currently the mask is an array of baseOffset + submask (byte[256]) , optimized for storage as there are...

Most efficient way to test a 256-bit YMM AVX register element for equal or less than zero

I'm implementing a particle system using Intel AVX intrinsics. When the Y-position of a particle is less than or equal to zero I want to reset the particle. The particle system is ordered in a SOA-pattern like this: class ParticleSystem { private: float* mXPosition; float* mYPosition; float* mZPosition; .... Rest...

SSE - AVX conversion from double to char

I want to convert a vector of double precision values to char. I have to make two distinct approaches, one for SSE2 and the other for AVX2. I started with AVX2. __m128i sub_proc(__m256d& in) { __m256d _zero_pd = _mm256_setzero_pd(); __m256d ih_pd = _mm256_unpackhi_pd(in,_zero_pd); __m256d il_pd = _mm256_unpacklo_pd(in,_zero_pd); __m128i ih_si =...

How can i optimize my AVX implementation of dot product?

I`ve tried to implement dot product of this two arrays using AVX http://stackoverflow.com/a/10459028. But my code is very slow. A and xb are arrays of doubles, n is even number. Can you help me? const int mask = 0x31; int sum =0; for (int i = 0; i < n;...