News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Which is faster?

Started by Neil, May 01, 2009, 10:56:52 AM

Previous topic - Next topic

Neil

I've been looking at Mark's optimisation webpage, am I correct in thinking that this code :-

               movzx eax, BYTE PTR [esi]
               inc esi                              ;or maybe add esi,1?

is faster than this:-

                mov al,[esi]                     ;lob
                inc esi                             ;Macro
                and eax,00000000000000000000000011111111b

jj2007

On a Celeron M, inc and add yield equal timings:

96      cycles for 100*movzx, inc esi
96      cycles for 100*movzx, add esi, 1
396     cycles for 100*mov al

Test it yourself:
.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm

LOOP_COUNT = 1000000
.data
MainString db "This is a long string meant for testing the code", 0

.code
start:
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; --------- the traditional way ---------
mov esi, offset MainString
REPEAT 100
movzx eax, BYTE PTR [esi]
inc esi
ENDM
counter_end
print str$(eax), 9, "cycles for 100*movzx, inc esi", 13, 10, 10 ; --------- end traditional way ---------

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; --------- the traditional way ---------
mov esi, offset MainString
REPEAT 100
movzx eax, BYTE PTR [esi]
add esi, 1
ENDM
counter_end
print str$(eax), 9, "cycles for 100*movzx, add esi, 1", 13, 10, 10 ; --------- end traditional way ---------

counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS ; --------- the traditional way ---------
mov esi, offset MainString
REPEAT 100
mov al, [esi] ;lob
inc esi ;Macro
and eax,00000000000000000000000011111111b
ENDM
counter_end
print str$(eax), 9, "cycles for 100*mov al", 13, 10, 10 ; --------- end traditional way ---------
inkey "--- ok ---"
exit
end start

Neil

This is what I got:-

95     cycles for 100*movzx, inc esi

95     cycles for 100*movzx, add esi,1

371    cycles for 100*mov al

So inc & add are the same & the first method is much quicker than the second.
Thanks JJ  :U


hutch--

Neil,

It depends on the processor hardware between INC and ADD REG, 1. On the PIV family ADD is faster, on much other hardware INC is faster. As most speed issues are related to memory access speed, you may not need to lose any sleep over which one you choose. Go over the algo and reduce any memory accesses that you can and you may see it go faster, twiddling between INC and ADD will very rarely ever give you any useful difference.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Neil

Thanks hutch, I'm going to stick with inc, it's quicker to type :bg

Jimg

Not to mention 1/3 the size!

dedndave

i must be missing sumpin - lol

     LODSB

Mark Jones

Neil, generally INC/DEC are considerably faster than ADD/SUB on the AMD Athlon processors.

As always, timing the code is the best bet. Of course, to determine this condition, this requires one actually own these processors. Too bad there isn't some service out there which could time code snippets on all major processor types. (Or a relative comparison of processor instruction latency between all the major brands.)
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08

Neil

Thanks for that Mark, my test was done on an Intel processor but I have a spare computer with an Athlon processor, I'll fire it up tomorrow & see what the test results are on that.

jj2007

Quote from: dedndave on May 01, 2009, 03:58:29 PM
i must be missing sumpin - lol

     LODSB

Sorry :bg

Quote96      cycles for 100*movzx, inc esi
364     cycles for 100*lodsb

Generally, the lods, scas, movs etc stuff is a bit slow - with one exception: rep movsd is blazingly fast for aligned memcopies, see inter alia this post by Hutch. I use lodsb if speed is not important.

dedndave

ahhhhh - that is good to know
i guess, when i do use LODSB (without the REP prefix), it is a case where speed is not critical
generally speaking, i use it in cases like parsing a command line
still, this is good info - i will have to take a look at Marks' page
btw - REP LODS doesn't make much sense - lol
i don't think i have ever used that

Jimg

It really depends upon how you write the test.  This test uses repeat 1000, and only does it once.  lodsb is 3 times faster on my AMD, 4 times faster on my celeron and about 15% slower on my 1.8Ghz pentium M

[attachment deleted by admin]

dedndave

trying to locate Marks' page
heliosstudios says i don't have permission to access - is that the one ?

MichaelW

eschew obfuscation

Mark Jones

What's that? Oh that page is so antiquated, was started and never completed (like so many other things in my life, sigh.)

I thought you were talking about Mark Larson's page. That has some useful stuff on it. :bg
"To deny our impulses... foolish; to revel in them, chaos." MCJ 2003.08