Which is faster?

Neil · May 01, 2009, 10:56:52 AM

I've been looking at Mark's optimisation webpage, am I correct in thinking that this code :-

movzx eax, BYTE PTR [esi]
inc esi ;or maybe add esi,1?

is faster than this:-

mov al,[esi] ;lob
inc esi ;Macro
and eax,00000000000000000000000011111111b

jj2007 · May 01, 2009, 12:24:11 PM

On a Celeron M, inc and add yield equal timings:

Code Select

96      cycles for 100*movzx, inc esi
96      cycles for 100*movzx, add esi, 1
396     cycles for 100*mov al

Test it yourself:

Code Select

.nolist
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm

	LOOP_COUNT = 1000000
.data
MainString	db "This is a long string meant for testing the code", 0

.code
start:
	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS		; --------- the traditional way ---------
		mov esi, offset MainString
		REPEAT 100
			movzx eax, BYTE PTR [esi]
			inc esi
		ENDM
	counter_end
	print str$(eax), 9, "cycles for 100*movzx, inc esi", 13, 10, 10	; --------- end traditional way ---------

	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS		; --------- the traditional way ---------
		mov esi, offset MainString
		REPEAT 100
			movzx eax, BYTE PTR [esi]
			add esi, 1
		ENDM
	counter_end
	print str$(eax), 9, "cycles for 100*movzx, add esi, 1", 13, 10, 10	; --------- end traditional way ---------

	counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS		; --------- the traditional way ---------
		mov esi, offset MainString
		REPEAT 100
			mov al, [esi]		;lob
			inc esi		;Macro
			and eax,00000000000000000000000011111111b
		ENDM
	counter_end
	print str$(eax), 9, "cycles for 100*mov al", 13, 10, 10	; --------- end traditional way ---------
	inkey "--- ok ---"
	exit
end start

Neil · May 01, 2009, 12:46:04 PM

This is what I got:-

95 cycles for 100*movzx, inc esi

95 cycles for 100*movzx, add esi,1

371 cycles for 100*mov al

So inc & add are the same & the first method is much quicker than the second.
Thanks JJ :U

hutch-- · May 01, 2009, 01:09:14 PM

Neil,

It depends on the processor hardware between INC and ADD REG, 1. On the PIV family ADD is faster, on much other hardware INC is faster. As most speed issues are related to memory access speed, you may not need to lose any sleep over which one you choose. Go over the algo and reduce any memory accesses that you can and you may see it go faster, twiddling between INC and ADD will very rarely ever give you any useful difference.

Neil · May 01, 2009, 02:50:04 PM

Thanks hutch, I'm going to stick with inc, it's quicker to type :bg

Jimg · May 01, 2009, 03:22:26 PM

Not to mention 1/3 the size!

dedndave · May 01, 2009, 03:58:29 PM

i must be missing sumpin - lol

LODSB

Mark Jones · May 01, 2009, 04:47:32 PM

Neil, generally INC/DEC are considerably faster than ADD/SUB on the AMD Athlon processors.

As always, timing the code is the best bet. Of course, to determine this condition, this requires one actually own these processors. Too bad there isn't some service out there which could time code snippets on all major processor types. (Or a relative comparison of processor instruction latency between all the major brands.)

Neil · May 01, 2009, 05:38:50 PM

Thanks for that Mark, my test was done on an Intel processor but I have a spare computer with an Athlon processor, I'll fire it up tomorrow & see what the test results are on that.

jj2007 · May 01, 2009, 05:51:54 PM

Quote from: dedndave on May 01, 2009, 03:58:29 PM
i must be missing sumpin - lol

LODSB

Sorry :bg

Quote96 cycles for 100*movzx, inc esi
364 cycles for 100*lodsb

Generally, the lods, scas, movs etc stuff is a bit slow - with one exception: rep movsd is blazingly fast for aligned memcopies, see inter alia this post by Hutch. I use lodsb if speed is not important.

dedndave · May 01, 2009, 08:41:33 PM

ahhhhh - that is good to know
i guess, when i do use LODSB (without the REP prefix), it is a case where speed is not critical
generally speaking, i use it in cases like parsing a command line
still, this is good info - i will have to take a look at Marks' page
btw - REP LODS doesn't make much sense - lol
i don't think i have ever used that

Jimg · May 01, 2009, 10:48:08 PM

It really depends upon how you write the test. This test uses repeat 1000, and only does it once. lodsb is 3 times faster on my AMD, 4 times faster on my celeron and about 15% slower on my 1.8Ghz pentium M

[attachment deleted by admin]

dedndave · May 02, 2009, 02:59:49 AM

trying to locate Marks' page
heliosstudios says i don't have permission to access - is that the one ?

MichaelW · May 02, 2009, 03:24:46 AM

http://heliosstudios.net/index.html.disabled

Mark Jones · May 02, 2009, 04:09:21 AM

What's that? Oh that page is so antiquated, was started and never completed (like so many other things in my life, sigh.)

I thought you were talking about Mark Larson's page. That has some useful stuff on it. :bg

News:

Which is faster?