madshi.net

Posted: **Thu Mar 24, 2016 2:00 pm**

Hi.

Very nice collection.

I use it in Delphi XE8 update 1. OS: Windows 8.1.
For example I need it for fast resizing many transparent bitmaps (pf32bit) and display them as animation.
I tried the StretchBitmap function(s) from madGraphics.pas. But, although I can input 32 bit bitmaps and it outputs 32 bit, it doesn't process the alpha too (transparency).
For example, the only difference from Bilinear32 and Bilinear24 is a few 3's turn into 4's. It just outputs as 32 bit but it doesn't seem to process the alpha field.

Could you please make it process the alpha too?

Thank you.

Posted: **Thu Mar 24, 2016 2:10 pm**

Hello,

I haven't worked on madGraphics for years. Of course it would be possible to add alpha processing. But to be honest, my to do list is already more than full with stuff I'm earning money with. So right now I simply have no time left to work on free parts of madCollection. That said, if you feel like changing madGraphics yourself I'd be happy to include your changes into my source code base.

Posted: **Thu Mar 24, 2016 2:37 pm**

I understand.

But can you at least add some comments to the code from Bilinear32 function explaining what the code lines do? I know how to use bitmap's scanline function but I've never seen code like yours.
It would help me a lot.

Thank you in advance.

Posted: **Thu Mar 24, 2016 3:50 pm**

I guess I should have added more comments, but back at the time when I wrote that code I wasn't used to write a lot of comments. Anyway, I think this code changes the RGB values:

Code: Select all

dbLine^[0] := (sbLine1[xp1    ] * w11 + sbLine1[xp2    ] * w21 + sbLine2[xp1    ] * w12 + sbLine2[xp2    ] * w22) shr 16;
dbLine^[1] := (sbLine1[xp1 + 1] * w11 + sbLine1[xp2 + 1] * w21 + sbLine2[xp1 + 1] * w12 + sbLine2[xp2 + 1] * w22) shr 16;
dbLine^[2] := (sbLine1[xp1 + 2] * w11 + sbLine1[xp2 + 2] * w21 + sbLine2[xp1 + 2] * w12 + sbLine2[xp2 + 2] * w22) shr 16;

So if you just add one more line like this:

Code: Select all

dbLine^[3] := (sbLine1[xp1 + 3] * w11 + sbLine1[xp2 + 3] * w21 + sbLine2[xp1 + 3] * w12 + sbLine2[xp2 + 3] * w22) shr 16;

That might already take care of the alpha channel.

Posted: **Thu Mar 24, 2016 6:24 pm**

Yes, it woks

Thank you very much.

Posted: **Sat Mar 26, 2016 10:18 am**

Just asking a question:

I'm thinking of rewriting the Bilinear32 function in asm, maybe even with MMX/SSE.
Do you think it will perform a lot faster, so it would worth the work?

Posted: **Sat Mar 26, 2016 10:28 am**

I'd go SSE2, almost all modern CPUs support that, it's nicer to work with and should produce a very noticeable performance improvement.

Posted: **Sat Mar 26, 2016 10:51 am**

Not sure about SSE2, my application has to work on older CPU's too.
Btw, AMD implementation of SSE2 doesn't work as expected. Until a year ago I had an AMD CPU. The difference in performance wasn't so high as with Intel CPU's when using SSE2 optimized code.
I'm hoping MMX has a better implementation.

Posted: **Sat Mar 26, 2016 1:20 pm**

Just so you can have an idea about what I'm trying to make, here is a testing app:
https://drive.google.com/open?id=0ByKxA ... VlFQ1BzVlU
And a testing file:
https://drive.google.com/open?id=0ByKxA ... 2tGZEYybGs
Use the load button from the middle of the form to load the file and then click on Preview.
After it starts, use + - to resize or 0 to reset to original size. When its size is different from default then your code is used.
In the caption of the main window the "delay display" parameter will jump from ~5..10 ms to 30..50 ms (on my 2 GHz processor).
Well, I want to decrease that delay.

Btw, if you're interested I'll show you the code too.

Posted: **Sat Mar 26, 2016 1:28 pm**

I'm really short on time atm. But if you want to do "real time" animation scaling, you might want to consider using Direct3D. GPUs are much faster at that sort of stuff than even MMX/SSE/SSE2 etc.

Posted: **Sat Mar 26, 2016 1:38 pm**

madshi wrote:I'm really short on time atm. But if you want to do "real time" animation scaling, you might want to consider using Direct3D. GPUs are much faster at that sort of stuff than even MMX/SSE/SSE2 etc.

Yes, I know.
I already found something called DelphiX http://www.micrel.cz/Dx/
But the problem is I have to transfer all the frames into the video memory as textures so I can display them. And the animation I showed you is 3 GB uncompressed (!). Do you know a video card with 3+ GB video memory?

Posted: **Sat Mar 26, 2016 1:47 pm**

Many GPUs these days have 2GB, some 4GB, some even more.

Anyway, you don't have to upload all the frames at once. Just create a queue of 3 frames, and delete frames from GPU RAM which were already displayed. That's how video players work.

Posted: **Sat Mar 26, 2016 1:54 pm**

madshi wrote:Many GPUs these days have 2GB, some 4GB, some even more.

Not so many but, like I said, my app should work on older hardware too.

madshi wrote:Anyway, you don't have to upload all the frames at once. Just create a queue of 3 frames, and delete frames from GPU RAM which were already displayed. That's how video players work.

Good idea.
That's what I'm doing in RAM memory now.
Unfortunately with DelphiX this is too slow. It takes a few hundred ms to transfer just a frame (768x768).

Also I thought about using DSPack (to make a sort of "video player")

Posted: **Sun Mar 27, 2016 10:02 am**

I started working to the asm conversion. And I understood why you recommended SSE2 - because MMX and SSE don't have 32 bit integer multiplication.

For now I just tried to convert a code line:

Code: Select all

dbLine^[0] := (sbLine1[xp1] * w11 + sbLine1[xp2] * w21 + sbLine2[xp1] * w12 + sbLine2[xp2] * w22) shr 16;

The SSE2 asm version:

Code: Select all

            asm
               mov       eax,[sbline1]
               mov       edx,[xp1]
               movzx     ecx,[eax+edx]
               movd      xmm0, ecx                //sbLine1[xp1]

               mov       edx,[xp2]
               movzx     ecx,[eax+edx]
               movd      xmm4, ecx                //sbLine1[xp2]

               movd      xmm2, [w11]
               movd      xmm6, [w21]

               pmuludq   xmm0, xmm2               //sbLine1[xp1] * w11
               pmuludq   xmm4, xmm6               //sbLine1[xp2] * w21

               addpd     xmm0, xmm4              //sbLine1[xp1] * w11 + sbLine1[xp2] * w21

               movd      eax, xmm0
               push      eax                     //send  sbLine1[xp1] * w11 + sbLine1[xp2] * w21  to stack

               mov       eax,[sbline2]
               movzx     ecx,[eax+edx]
               movd      xmm0, ecx                //sbLine2[xp2]

               mov       edx,[xp1]
               movzx     ecx,[eax+edx]
               movd      xmm4, ecx                //sbLine2[xp1]

               movd      xmm2, [w22]
               movd      xmm6, [w12]

               pmuludq   xmm0, xmm2               //sbLine2[xp2] * w22
               pmuludq   xmm4, xmm6               //sbLine2[xp1] * w12

               addpd     xmm0, xmm4              //sbLine2[xp2] * w22 + sbLine2[xp1] * w12

               movd      eax, xmm0

               pop       edx                     //get sbLine1[xp1] * w11 + sbLine1[xp2] * w21 from stack

               add       eax, edx                //sbLine1[xp1] * w11 + sbLine1[xp2] * w21   +   sbLine2[xp2] * w22 + sbLine2[xp1] * w12

               shr       eax,$10                 //(sbLine1[xp1] * w11 + sbLine1[xp2] * w21   +   sbLine2[xp2] * w22 + sbLine2[xp1] * w12)   shr   16
               mov       edx,[dbLine]
               mov       [edx],al
            end;

But, instead of been faster, the code is slower (!?).

I wonder what am I doing wrong?

Posted: **Sun Mar 27, 2016 11:31 am**

Just using SSE2 instructions instead of normal x86/64 ASM instructions won't bring you any benefit. SSE2 doesn't multiply faster than x86/64. The purpose of SSE2 is not to do a single multiplication per instruction. It's to do 4 (dwords), 8 (words) or 16 (bytes) operations with one SSE2 instruction. Only if you do that, you get a speed improvement over x86/64.

So the proper way to use SSE2 is to 1) use an SSE2 instruction to load 16 bytes directly from RAM into an SSE2 register. Don't use x86/64 instructions to fill the SSE2 registers. 2) Use SSE2 instructions to operate on those 16 bytes directly somehow. 3) Use an SSE2 instruction to write the final result back to RAM.

Ideally you would do SSE2 operations on 16 different bytes (you know, 1 byte is one Red, Green, Blue or Alpha component of a 32bit RGBA pixel) "at once". Doing that will give you a very big speed gain. However, the code is more difficult to write, of course.

madshi.net

fast resize transparent images

fast resize transparent images

Re: fast resize transparent images

Re: fast resize transparent images

Re: fast resize transparent images

Re: fast resize transparent images

Re: fast resize transparent images

Re: fast resize transparent images

Re: fast resize transparent images

Re: fast resize transparent images

Re: fast resize transparent images

Re: fast resize transparent images

Re: fast resize transparent images

Re: fast resize transparent images

Re: fast resize transparent images

Re: fast resize transparent images