FAQ Database Discussion Community


AVX2 — multiply two __m256i integers

vectorization,sse,intrinsics,avx,avx2
what is the best way to multiply each 32bit entry of two _mm256i registers with each other? _mm256_mul_epu32 is not what I'm looking for because it produces 64bit outputs. Moreover, I'm sure that the multiplication of two 32bit values will not overflow. Thanks!...

Segmentation fault in openMP program with SSE instructions with threads > 4

c++,multithreading,segmentation-fault,openmp,sse
I wrote a simple C++ openMP program that uses SSE instructions, and I am facing a segmentation fault when the number of threads is bigger than 4. I am using g++ on Linux. #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/time.h> #include <emmintrin.h> #include <assert.h> #include <stdint.h> #include <omp.h> unsigned...

Intel SSE Intrinsics _mm_load_si128 segmentation fault,

c,sse,simd,memory-alignment,intrinsics
I'm currently working with a 5 x 5 matrix using SSE features. I'm trying to load x4 128bit integer values to the xmm registers as follows, #include <emmintrin.h> #include <smmintrin.h> //===================================== Initialising matrix int* aligned_matrix; posix_memalign((void **)&aligned_matrix, 16, sizeof(int) * 25); for (ssize_t i = 0; i < 25; i++)...

SSE intrinsics: Convert 32-bit floats to UNSIGNED 8-bit integers

x86,sse,mmx
Using SSE intrinsics, I've gotten a vector of four 32-bit floats clamped to the range 0-255 and rounded to nearest integer. I'd now like to write those four out as bytes. There is an intrinsic _mm_cvtps_pi8 that will convert 32-bit to 8-bit signed int, but the problem there is that...

Modifying a function to use SSE intrinsics

c++,c++11,floating-point,sse,simd
I am trying to calculate the approximate value of the radical: sqrt(i + sqrt(i + sqrt(i + ...))) using SSE in order to get a speedup from vectorization (I also read that the SIMD square-root function runs approximately 4.7x faster than the innate FPU square-root function). However, I am having...

SSE division by integer

assembly,floating-point,x86-64,sse
I am currently working on function, that calculates Taylor approximation of sin(x) function, using C & 64-bit assembly combined (C using asm function). I am moderately new to assembly & low-level programming, and I still don't get few things. Let's call function in C: float taylor(float fi, float n); where...

Load two 64-bit integers into lower & upper xmm, respectively

assembly,sse,cpu-registers
What's the easiest way to move two longs in say RDX, R8 into XMM0 where RDX is moved to the lower 64 bits and R8 to the upper 64 bits? MOVQ will only set the lower and 0 the upper. I am limited to SSSE3....

xmm, cmp two 32-bit float

assembly,floating-point,sse
I'm trying to understand how to compare two floating point numbers (32-bit) using the xmm registers. To test I've written this code in C (which calls the code in assembly): #include "stdio.h" extern int compare(); int main() { printf("Result: %d\n", compare()); return 0; } Here is the assembly, I want...

SIMD minmag and maxmag

assembly,floating-point,x86,sse,avx
I want to implement SIMD minmag and maxmag functions. As far as I understand these functions are minmag(a,b) = |a|<|b| ? a : b maxmag(a,b) = |a|>|b| ? a : b I want these for float and double and my target hardware is Haswell. What I really need is code...

why does _mm_mulhrs_epi16() always do biased rounding to positive infinity?

rounding,sse
Does anyone know why the pmulhrsw instruction or _mm_mulhrs_epi16(x) := RoundDown((x * y + 16384) / 32768) always rounds towards positive infinity? To me, this is terribly biased for negative numbers, because then a sequence like -0.6, 0.6, -0.6, 0.6, ... won't add up to 0 on average. Is this...

Square root of a OpenCV's grey image using SSE

c++,opencv,sse,simd
given a grey cv::Mat (CV_8UC1) I want to return another cv::Mat containing the square root of the elements (CV_32FC1) and I want to do it with SSE2 intrinsics. I am having some problems with the conversion from 8-bit values to 32 float values to perform the square root. I would...

minigw-w64 is incapable of 32 byte stack alignment, easy work around or switch compilers?

c++,windows,gcc,alignment,sse
I'm trying to work with AVX instructions and windows 64bit. I'm comfortable with g++ compiler so I've been using that, however, there is a big bug described reported here and very rough solutions were presented here. Basically, m256 variable can't be aligned on the stack to work properly with avx...

Effective way to extract from SSE vector on AMD processors

sse,simd,amd-processor
I'm looking for an effective way to extract lower 64 bit integer from __m128i on AMD Piledriver. Something like this: static inline int64_t extractlo_64(__m128i x) { int64_t result; // extract into result return result; } Instruction tables say that common approach - using _mm_extract_epi64() - is ineffective on this processor....

How to detect SSE/AVX/AVX2 availability at compile-time ?

gcc,clang,sse,avx,avx2
I'm trying to optimize some matrix computations and I was wondering if it was possible to detect at compile-time if SSE or/and AVX or/and AVX2 is enabled by the compiler ? Ideally for G++ and Clang, but I can manage with only one of them. I'm not sure it is...

Why do MSVC optimizations break SSE code when function arguments are const refs to temporaries or temporaries copied by value?

c++,c++11,visual-c++,sse,msvc12
Ran into this yesterday, I will try to give clear and simple examples which fail for me with MSVC12 (VS2013, 120) and MSVC14 (VS2015, 140). Everything is implicitly /arch:SSE+ with x64. I will trivialize the issue to a simple matrix transpose example using defined macros _MM_TRANSPOSE4_PS for illustration purposes. This...

32-bit Hamming String formation from 32 8-bit comparisons

c++,c,sse,simd,avx
I am performing a census-transform on an image doing 32 comparisons per pixel. I can efficiently generate a 256-bit vector of 0x0100010100010100... where each 8-bits correspond to 0x00 or 0x01. The vector is identified below as 'comparisons'. I need to collapse this 256-bit vector to generate a 32-bit hamming string....

Is this a data alignment crash? (potentially involving stack misalignment, XNAMath, Visual Studio 2103)

c++,visual-studio-2013,sse,memory-alignment,xna-math-library
My Win32, DirectX game is crashing in release mode within code that is manipulating vectors and matrices. Specifically the crash occurs on this instruction: 014E2752 unpcklps xmm1,xmmword ptr [esp+3Ch] First-chance exception at 0x014E2752 in RodinaRelease.exe: 0xC0000005: Access violation reading location 0xFFFFFFFF I'm not too experienced with digging into assembly and...

OpenCV FAST corner detection SSE implementation walkthrough

c,performance,opencv,optimization,sse
Could someone help me understanding the SSE implementation of the FAST corner detection in OpenCV? I understand the algorithm but not the implementation. Could somebody walk me through the code? The code is long, so thank you in advance. I am using OpenCV 2.4.11 and the code goes like this:...

VC++ SSE code generation - is this a compiler bug?

visual-c++,assembly,x86,sse,visual-studio-debugging
A very particular code sequence in VC++ generated the following instruction (for Win32): unpcklpd xmm0,xmmword ptr [ebp-40h] 2 questions arise: (1) As far as I understand the intel manual, unpcklpd accepts as 2nd argument a 128-aligned memory address. If the address is relative to a stack frame alignment cannot be...

Choosing SSE instruction execution domains in mixed contexts

assembly,vector,sse,sse-execution-domain
I am playing with a bit of SSE assembly code in which I do not have enough xmm registers to keep all the temporary results and useful constants in registers at the same time. As a workaround, for some constant vectors that have identical components, I “compress” several vectors into...

Is possible to address the output SIMD register by using an input register

c++,c,sse,simd
Is it possible to use the scalar values of an input vector to index the output vector? I try to implement the following function in SIMD but I can not find any solution. void shuffle(unsigned char * a, // input a unsigned char * r){ // output r for (i=0;...

Semantics of mov widths in x64 and SSE

assembly,64bit,sse,freepascal
Consider the following from here: mov BYTE PTR [ebx], 2 ; Move 2 into the single byte at the address stored in EBX. mov WORD PTR [ebx], 2 ; Move the 16-bit integer representation of 2 into the 2 bytes starting at the address in EBX. mov DWORD PTR [ebx],...

Are older SIMD-versions available when using newer ones?

c++,c,sse,simd,avx
When I can use SSE3 or AVX, are then older SSE versions as SSE2 or MMX available - or do I still need to check for them separately?

SSE intrinsics to copy bytes within a register

c++,c,sse,simd,intrinsics
Assume I have four floats loaded into a register (f0 to f3), as illustrated by the following pseudo code: __m128 xmm1 = < f0, f1, f2, f3 > Now I want to copy the first element to the other positions, so that I get a register that looks as follows:...

comparing a xmmX vector

assembly,sse,cmp,xmm,eflags
So say you loaded an xmm1 vector with 4 single precision floating points {1.5, 1.5, 1.5, 1.5} and xmm2 with the same points, so xmm1 == xmm2. Now you want to compare them so you write in assembly: movaps %xmm1, %xmm2 cmpeqps %xmm0, %xmm2 Since cmpeqps doesn't set the eflags,...

Tell C++ that pointer data is 16 byte aligned

c++,gcc,sse,memory-alignment
I wrote some code with static arrays and it vectorizes just fine. float data[1024] __attribute__((aligned(16))); I would like to make the arrays dynamically allocated. I tried doing something like this: float *data = (float*) aligned_alloc(16, size*sizeof(float)); But the compiler (GCC 4.9.2), no longer can vectorize the code. I assume this...

SIMD signed with unsigned multiplication for 64-bit * 64-bit to 128-bit

c,x86,integer,bit-manipulation,sse
I have created a function which does 64-bit * 64-bit to 128-bit using SIMD. Currently I have implemented it using SSE2 (acutally SSE4.1). This means it does two 64b*64b to 128b products at the same time. The same idea could be extended to AVX2 or AVX512 giving four or eight...

SIMD latency throughput

c++,sse,simd
On the Intel Intrisics Guide for most instructions, it also has a value for both latency and throughput. Example: __m128i _mm_min_epi32 Performance Architecture Latency Throughput Haswell 1 0.5 Ivy Bridge 1 0.5 Sandy Bridge 1 0.5 Westmere 1 1 Nehalem 1 1 What exactly do these numbers mean? I guess...

assembly function with C segfault

c,assembly,x86,sse,fpu
I am trying to make assembly function that uses SSE and FPU for parallel calculations. Unfortunately I am receiving segmentation fault(core dumped) error(while debugging it doesn't show in assembly function). I also cannot step out from assembly function. Gdb shows: Warning: Cannot insert breakpoint 0. Cannot access memory at address...

Segmentation fault with __m128 in C

c,sse
I am getting Segmentation fault when running the compilation of following short C code: #include <pmmintrin.h> #include <stdio.h> #include <stdlib.h> #define VALUE 4242 typedef short int Type; void threshold(Type *dst, const Type *src, int len) { short int i, N=16; short int checkval[] = { VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE,VALUE }; const short int*...

Using SSE to mimic the standard Math.pow function

c,assembly,x86,sse,simd
I'm trying to learn how to work with SSE and I decided to realize a simple code that computes n^d, using a function that gets called by a C program. Here's my NASM code: section .data resmsg: db '%d^%d = %d', 0 section .bss section .text extern printf ; ------------------------------------------------------------...

GCC emits vastly different code using “-march=native” on similar architectures

c,gcc,assembly,sse,avx
I'm working on writing an OpenCL benchmark in C. Currently, it measures the fused multiply-accumulate performance of both a CL device, and the system's processor using C code. The results are then cross checked for accuracy. I wrote the native code to take advantage of GCC's auto vectorizer, and it...

What is the difference between non-packed and packed instruction in the context of SIMD-operations?

sse,simd
What is the difference between non-packed and packed instruction in the context of SIMD-operations? I was reading an article on optimizing your code for SSE: http://www.cortstratton.org/articles/OptimizingForSSE.php#batch and this question arose when I read "As an added bonus, movss is a non-packed instruction, which allows us to make better use of...

Counting the number of leading zeros in a 128-bit integer

c++,gcc,bit-manipulation,sse
How can I count the number of leading zeros in a 128-bit integer (uint128_t) efficiently? I know GCC's built-in functions: __builtin_clz, __builtin_clzl, __builtin_clzll __builtin_ffs, __builtin_ffsl, __builtin_ffsll However, these functions only work with 32- and 64-bit integers. I also found some SSE instructions: __lzcnt16, __lzcnt, __lzcnt64 As you may guess, these...

0xFFFF flags in SSE

c,vectorization,sse
I would like to create an SSE register with values that I can store in an array of integers, from another SSE register which contains flags 0xFFFF and zeros. For example: __m128i regComp = _mm_cmpgt_epi16(regA, regB); For the sake of argument, lets assume that regComp was loaded with { 0,...

sorting component-wise multi value (SIMD) array

algorithm,sorting,time-complexity,sse,simd
I'm trying to find an O(n∙log(n)) sorting method to sort several arrays simultaneously so that an element in a multi-value array will represent elements from 4 different single value arrays and the sorting method would sort the multi-value elements. For example: For a given 4 single value arrays An, Bn,...

Converting from __m128 to __m128i results in wrong value

c++,type-conversion,clang,sse,intrinsics
I need to convert a float vector (__m128) to an integer vector (__m128i), and I am using _mm_cvtps_epi32, but I am not getting the expected value. Here is a very simple example: __m128 test = _mm_set1_ps(4.5f); __m128i test_i = _mm_cvtps_epi32(test); The debugger output I get: (lldb) po test ([0] =...

AVX2 Winner-Take-All Disparity Search

c++,sse,avx,disparity-mapping,avx2
I am optimizing the "winner-take-all" portion of a disparity estimation algorithm using AVX2. My scalar routine is accurate, but at QVGA resolution and 48 disparities the runtime is disappointingly slow at ~14 ms on my laptop. I create both LR and RL disparity images, but for simplicity here I will...

Storing a constant in SSE register (GCC, C++)

c++,c,assembly,sse,inline-assembly
Hello StackOverflow community I have encountered a following challenge: In my C++ application I have quite complex (cubic) loop in which, at all depths, I perform the following: Compute 4 float values Multiply all 4 values by a constant Convert the floats to integers This code is to be run...

C/C++: -msse and -msse2 Flags do not have any effect on the binaries?

c++,gcc,sse,sse2
I'm just playing around with gcc (g++) and the compilerflags -msse and -msse2. I have a little test program which looks like that: #include <iostream> int main(int argc, char **argv) { float a = 12558.5688; float b = 6.5585; float result = 0.0; result = a * b; std::cout <<...

GCC -msse2 does not generate SIMD code

c++,gcc,x86,sse,simd
I am trying to figure out why g++ does not generate a SIMD code. Info GCC / OS / CPU: $ gcc -v gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) $ cat /proc/cpuinfo ... model name : Intel(R) Core(TM)2 Duo CPU P8600 @ 2.40GHz ... and here is my C++ code: #include...

How to check inf for AVX intrinsic __m256

c++,c,sse,intrinsics,avx
What is the best way to check whether a AVX intrinsic __m256 (vector of 8 float) contains any inf? I tried __m256 X=_mm256_set1_ps(1.0f/0.0f); _mm256_cmp_ps(X,X,_CMP_EQ_OQ); but this compares to true. Note that this method will find nan (which compare to false). So one way is to check for X!=nan && 0*X==nan:...

Packed masking in SSE

c,assembly,x86,nasm,sse
I need to build some kind of masking system for a packed single because I need to use packed operations on vectors that contain less than 4 elements. So, for example, I need to do something like this: section .data align 16 a: dd 1.5, 2.3, 5.0 align 16 x:...