madshi.net

Posted: **Sun Mar 27, 2016 12:46 pm**

Yes, I know that loading and processing the 4 values (BGRA) as one is faster.
But, before thinking of converting to asm, I tried this:

Code: Select all

var
          pc: PCardinal;
...............
            pc := @dbLine^[0];
            pc^ := ((sbLine1[xp1] * w11 + sbLine1[xp2] * w21 + sbLine2[xp1] * w12 + sbLine2[xp2] * w22) shr 16) +
               ((sbLine1[xp1 + 1] * w11 + sbLine1[xp2 + 1] * w21 + sbLine2[xp1 + 1] * w12 + sbLine2[xp2 + 1] * w22) shr 16) shl 8 +
               ((sbLine1[xp1 + 2] * w11 + sbLine1[xp2 + 2] * w21 + sbLine2[xp1 + 2] * w12 + sbLine2[xp2 + 2] * w22) shr 16) shl 16 +
               ((sbLine1[xp1 + 3] * w11 + sbLine1[xp2 + 3] * w21 + sbLine2[xp1 + 3] * w12 + sbLine2[xp2 + 3] * w22) shr 16) shl 24;

It works, but it's also 20% slower.
That's why I haven't tried to read all 4 values in one step in the asm.

And, regarding your advice to not use x86/64 instructions to fill the SSE2 registers, to my shame I don't know how to do it using SSE2 commands. This is my first SSE2 code + I searched on Google but couldn't find any information.
I tried all SSE2 mov* commands in "mov* xmm0, [xmm2 + xmm4]" but the compiler didn't like any of them.

You say to work on 16 bit values but the problem is in that lines of code there are 4 32 bit variables: w11, w12, w21, w22. So somehow I have to work on 32 bit.
That's why I used xmm0, xmm2, xmm4 and xmm6 (each one is 16x2).
Plus the result of multiplying 2 32 bit values is a 64 bit value so I had to "sacrifice" xmm2 and xmm6 by putting the 64 bit results in xmm0 and xmm4.

Posted: **Sun Mar 27, 2016 4:20 pm**

You can use x64/86 registers as pointers. Just don't load values into x64/86 registers and then move them to SSE2 registers. That way you can only fill 4 bytes of the SSE2 registers.

You can't translate the code 1:1 to SSE2. You have to understand what the code does and rewrite the code for SSE2, so that you can work on 128 bit at once.

Posted: **Mon Mar 28, 2016 9:12 am**

You're right.

But the problem is that your code it is so complex (for assembler I mean) and has many variables. Even if I use pointers it's not easy to find a way to parallelize the operations on the BGRA and to use ONLY SSE2 registers,
A genius in assembler is needed for this. Unfortunately, I'm only a beginner.

Thank you very much for your effort and time to help me.
And sorry I couldn't do it.

madshi.net

fast resize transparent images

Re: fast resize transparent images

Re: fast resize transparent images

Re: fast resize transparent images