Re: fast resize transparent images
Posted: Sun Mar 27, 2016 12:46 pm
Yes, I know that loading and processing the 4 values (BGRA) as one is faster.
But, before thinking of converting to asm, I tried this:
It works, but it's also 20% slower.
That's why I haven't tried to read all 4 values in one step in the asm.
And, regarding your advice to not use x86/64 instructions to fill the SSE2 registers, to my shame I don't know how to do it using SSE2 commands. This is my first SSE2 code + I searched on Google but couldn't find any information.
I tried all SSE2 mov* commands in "mov* xmm0, [xmm2 + xmm4]" but the compiler didn't like any of them.
You say to work on 16 bit values but the problem is in that lines of code there are 4 32 bit variables: w11, w12, w21, w22. So somehow I have to work on 32 bit.
That's why I used xmm0, xmm2, xmm4 and xmm6 (each one is 16x2).
Plus the result of multiplying 2 32 bit values is a 64 bit value so I had to "sacrifice" xmm2 and xmm6 by putting the 64 bit results in xmm0 and xmm4.
But, before thinking of converting to asm, I tried this:
Code: Select all
var
pc: PCardinal;
...............
pc := @dbLine^[0];
pc^ := ((sbLine1[xp1] * w11 + sbLine1[xp2] * w21 + sbLine2[xp1] * w12 + sbLine2[xp2] * w22) shr 16) +
((sbLine1[xp1 + 1] * w11 + sbLine1[xp2 + 1] * w21 + sbLine2[xp1 + 1] * w12 + sbLine2[xp2 + 1] * w22) shr 16) shl 8 +
((sbLine1[xp1 + 2] * w11 + sbLine1[xp2 + 2] * w21 + sbLine2[xp1 + 2] * w12 + sbLine2[xp2 + 2] * w22) shr 16) shl 16 +
((sbLine1[xp1 + 3] * w11 + sbLine1[xp2 + 3] * w21 + sbLine2[xp1 + 3] * w12 + sbLine2[xp2 + 3] * w22) shr 16) shl 24;
That's why I haven't tried to read all 4 values in one step in the asm.
And, regarding your advice to not use x86/64 instructions to fill the SSE2 registers, to my shame I don't know how to do it using SSE2 commands. This is my first SSE2 code + I searched on Google but couldn't find any information.
I tried all SSE2 mov* commands in "mov* xmm0, [xmm2 + xmm4]" but the compiler didn't like any of them.
You say to work on 16 bit values but the problem is in that lines of code there are 4 32 bit variables: w11, w12, w21, w22. So somehow I have to work on 32 bit.
That's why I used xmm0, xmm2, xmm4 and xmm6 (each one is 16x2).
Plus the result of multiplying 2 32 bit values is a 64 bit value so I had to "sacrifice" xmm2 and xmm6 by putting the 64 bit results in xmm0 and xmm4.