fast resize transparent images

Yes, I know that loading and processing the 4 values (BGRA) as one is faster.
But, before thinking of converting to asm, I tried this:

          pc: PCardinal;
            pc := @dbLine^[0];
            pc^ := ((sbLine1[xp1] * w11 + sbLine1[xp2] * w21 + sbLine2[xp1] * w12 + sbLine2[xp2] * w22) shr 16) +
               ((sbLine1[xp1 + 1] * w11 + sbLine1[xp2 + 1] * w21 + sbLine2[xp1 + 1] * w12 + sbLine2[xp2 + 1] * w22) shr 16) shl 8 +
               ((sbLine1[xp1 + 2] * w11 + sbLine1[xp2 + 2] * w21 + sbLine2[xp1 + 2] * w12 + sbLine2[xp2 + 2] * w22) shr 16) shl 16 +
               ((sbLine1[xp1 + 3] * w11 + sbLine1[xp2 + 3] * w21 + sbLine2[xp1 + 3] * w12 + sbLine2[xp2 + 3] * w22) shr 16) shl 24;
It works, but it's also 20% slower.
That's why I haven't tried to read all 4 values in one step in the asm.

And, regarding your advice to not use x86/64 instructions to fill the SSE2 registers, to my shame I don't know how to do it using SSE2 commands. This is my first SSE2 code + I searched on Google but couldn't find any information.
I tried all SSE2 mov* commands in "mov* xmm0, [xmm2 + xmm4]" but the compiler didn't like any of them.

You say to work on 16 bit values but the problem is in that lines of code there are 4 32 bit variables: w11, w12, w21, w22. So somehow I have to work on 32 bit.
That's why I used xmm0, xmm2, xmm4 and xmm6 (each one is 16x2).
Plus the result of multiplying 2 32 bit values is a 64 bit value so I had to "sacrifice" xmm2 and xmm6 by putting the 64 bit results in xmm0 and xmm4.
You can use x64/86 registers as pointers. Just don't load values into x64/86 registers and then move them to SSE2 registers. That way you can only fill 4 bytes of the SSE2 registers.

You can't translate the code 1:1 to SSE2. You have to understand what the code does and rewrite the code for SSE2, so that you can work on 128 bit at once.
You're right.

But the problem is that your code it is so complex (for assembler I mean) and has many variables. Even if I use pointers it's not easy to find a way to parallelize the operations on the BGRA and to use ONLY SSE2 registers,
A genius in assembler is needed for this. Unfortunately, I'm only a beginner.

Thank you very much for your effort and time to help me.
And sorry I couldn't do it.
