News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Optimizing dw2hex_ex

Started by DoomyD, August 09, 2008, 09:54:14 PM

Previous topic - Next topic

DoomyD

I've based my algorithm on the following.
I'll be happy to hear what you have to say, and hopefully to see some good results  :Pmmx_dw2hexA proc ;Value: EAX Buffer: EDX
.data
align 8
mmx_dw2hex_isomsk db 8 dup (0Fh)
mmx_dw2hex_cmpmsk db 8 dup (0Ah)
mmx_dw2hex_ashmsk db 8 dup (07h)
mmx_dw2hex_ascmsk db 8 dup (30h)

.code
bswap       eax         ;swap the bytes
movd        mm0, eax
movd        mm1, eax
psrlw       mm0, 4
punpcklbw   mm0, mm1    ;unpack & swap the nibble order
pand        mm0, qword ptr [mmx_dw2hex_isomsk]

movq        mm1, qword ptr [mmx_dw2hex_cmpmsk]  ;a mask that filters 0Ah~0Fh - (09)
pcmpgtb     mm1, mm0                            ;apply the mask, the result is reversed
pandn       mm1, qword ptr [mmx_dw2hex_ashmsk]  ;re-reverse the mask (NAND) - (07)
paddb       mm0, qword ptr [mmx_dw2hex_ascmsk]  ;assign ASCII encoding
paddb       mm0, mm1                            ;assign fixed ASCII encodeing

movq        qword ptr [edx],mm0
retn
mmx_dw2hexA endp



[attachment deleted by admin]

NightWare

hi,
i haven't tested your code, just the algo you've posted and i have an error with 090ABCDEFh value.

your modified algo : bswap eax
movd MM0,eax
movd MM1,eax
psrlq MM1,4
punpcklbw MM1,MM0
pand MM1,QWORD PTR [mmx_dw2hex_isomsk]
movq MM0,MM1
pcmpgtb MM1,QWORD PTR [mmx_dw2hex_cmpmsk]
pand MM1,QWORD PTR [mmx_dw2hex_ashmsk]
paddb MM0,QWORD PTR [mmx_dw2hex_ascmsk]
paddb MM0,MM1
movq QWORD PTR [edx],MM0
; mov BYTE PTR [edx+8],0
; bswap eax

qWord

Quote from: DoomyD on August 09, 2008, 09:54:14 PM
I'll be happy to hear what you have to say, and hopefully, to see some good results :P

I've use NightWare's modification for testing.

on core2duo:
9 clocks mmx_dw2hexA , 43 for dw2hex_ex

i have also test my original algorithm(for comparising):

with bswap:    11-12 clocks
with pshufw:   10 clocks

so, one clock faster and using mmx only - congratulation  :thumbu


regards, qWord

FPU in a trice: SmplMath
It's that simple!

DoomyD

Hmm, in order to fix my code, set mmx_dw2hex_cmpmsk to 0Ah.

lingo

I've use NightWare's modification for testing."

"NightWare's modification" is from here:  :lol
http://www.masm32.com/board/index.php?topic=2974.msg23114#msg23114


NightWare

Quote from: lingo on August 10, 2008, 02:03:31 PM
"NightWare's modification" is from here:  :lol
http://www.masm32.com/board/index.php?topic=2974.msg23114#msg23114
no, i've honestly modified doomyd's algo, beside, if you look carefully, the variables are not exactly the same...  :wink

Mark_Larson


  I shaved off 2 cycles by converting it to a macro.

;eax = value, edx = buffer
macro_mmx_dw2hexA macro
      bswap       eax         ;swap the bytes
      movd        mm0, eax
      movd        mm1, eax
      psrlw       mm0, 4
      punpcklbw   mm0, mm1    ;unpack & swap the nibble order
      pand        mm0, qword ptr [mmx_dw2hex_isomsk]
      
      movq        mm1, qword ptr [mmx_dw2hex_cmpmsk]  ;a mask that filters 0Ah~0Fh - (09)
      pcmpgtb     mm1, mm0                            ;apply the mask, the result is reversed
      pandn       mm1, qword ptr [mmx_dw2hex_ashmsk]  ;re-reverse the mask (NAND) - (07)
      paddb       mm0, qword ptr [mmx_dw2hex_ascmsk]  ;assign ASCII encoding
      paddb       mm0, mm1                            ;assign fixed ASCII encodeing
   
      movq        qword ptr [edx],mm0
endm
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

 I got it down to 5 processor cycles using a 64k lookup table.  I did two reads from the table to get the lower 16 bits and upper 16 bits of the conversion.  If you need to see the FULL code, I'll upload it.  The 64k buffer is really big :)  The routine is still a macro.  EDX no longer points to the start of the buffer.  It point to the middle.  That way I fill the buffer in backwards so I don't have to do a bswap.  That saves one cycle off the execution time.


push ebx
movzx ecx,ax
shr eax,16
movzx ebx,ax
mov cx, [convert + ecx]
mov bx, [convert + ebx]
mov [edx],cx
mov [edx-3],bx
pop ebx



Mark
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

DoomyD

Quote from: Mark_Larson on August 11, 2008, 02:58:10 PM
I got it down to 5 processor cycles using a 64k lookup table. 
Out of curiosity,is this timming for one execution or two?(my executable calls each function twice)

RuiLoureiro

DoomyD,
                The results on my P4:

19 clocks       - mmx_dw2hexA   [output: 1234ABCD]
20 clocks       - dw2hex_ex     [output: 1234ABCD]
press any key to continue...   

Rui

Mark_Larson

Quote from: DoomyD on August 11, 2008, 04:08:04 PM
[
Mark
Out of curiosity, is the timing done for one execution or two?(my executable calls each function twice)

I did 2.  Same as you did in your code.  I wanted to keep it the same.  For the MMX I couldn't speed it up anymore except to do a macro.  So you did a good job. :)  That is why I went for the lookup table.  What kind of processor do you have?
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

DoomyD

My current is Intel Core 2 Duo, 1.86 GHz
I would like to see your code (if it's possible).

@Rui: Thanks for testing, I'm just glad It doesn't fall behind, even if by 1 clock  :bg

I've attached a new executable. Let's see how it goes... =)8 clocks - mmx_dw2hexA x2 (proc)
6 clocks - m_mmx_dw2hex x2 (macro)
1 clocks - m_mmx_dw2hex x1 (macro)

40 clocks - dw2hex_ex x2
19 clocks - dw2hex_ex x1

press any key to continue...

[attachment deleted by admin]

Mark_Larson

Quote from: DoomyD on August 11, 2008, 05:22:10 PM
My current is Intel Core 2 Duo, 1.86 GHz
I would like to see your code (if it's possible).

actually my first reply, I cut and pasted my code.  The only thing that is missing is the 64k lookup table.  I forgot to cut and paste macro / endm.  I will do that real quick.  And re-post it.

mark_dw2hex macro
push ebx
movzx ecx,ax
shr eax,16
movzx ebx,ax
mov cx, [convert + ecx]
mov bx, [convert + ebx]
mov [edx],cx
mov [edx-3],bx
pop ebx
ret
endm
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm

Mark_Larson

  My timings from your new code

8 clocks        - mmx_dw2hexA x2 (proc)
6 clocks        - m_mmx_dw2hex x2 (macro)
1 clocks        - m_mmx_dw2hex x1 (macro)

40 clocks       - dw2hex_ex x2
19 clocks       - dw2hex_ex x1

press any key to continue...


EDIT: I changed my code to only get called once, and it also runs in 1 cycle.  I love my core 2 duo! :)


1 clocks        - mark_dw2hex
BIOS programmers do it fastest, hehe.  ;)

My Optimization webpage
htttp://www.website.masmforum.com/mark/index.htm