News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

8-bit vs 32-bit

Started by sinsi, August 25, 2007, 03:43:33 PM

Previous topic - Next topic

sinsi

Using bytes (e.g. AH and AL) rather than using dwords (e.g. EAX and EBX) for essentially 8-bit stuff.
Is there a speed penalty or a prefetch/register stall using 8 bits (AH/AL or similar) vs 32 bits

jeez I love Intel manuals :sarcasmwhereisitwhenyouneedit:
Light travels faster than sound, that's why some people seem bright until you hear them.

MichaelW

There at least can be a speed penalty.

; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    include \masm32\include\masm32rt.inc
    .686
    include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    .data
      membyte db 1
    .code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
    LOOP_COUNT equ 1000000
    REPEAT_COUNT equ 100

    invoke Sleep, 3000

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        mov al, membyte
        and eax, 1
        mov bl, membyte
        and ebx, 1
        mov cl, membyte
        and ecx, 1
        mov dl, membyte
        and edx, 1
      ENDM
    counter_end
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        xor eax, eax
        xor ebx, ebx
        xor ecx, ecx
        xor edx, edx
        mov al, membyte
        and eax, 1
        mov bl, membyte
        and ebx, 1
        mov cl, membyte
        and ecx, 1
        mov dl, membyte
        and edx, 1
      ENDM
    counter_end
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        movzx eax, membyte
        and eax, 1
        movzx ebx, membyte
        and ebx, 1
        movzx ecx, membyte
        and ecx, 1
        movzx edx, membyte
        and edx, 1
      ENDM 
    counter_end
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        mov al, ah
        and eax, 1
        mov bl, bh
        and ebx, 1
        mov cl, ch
        and ecx, 1
        mov dl, dh
        and edx, 1
      ENDM
    counter_end
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        xor eax, eax
        xor ebx, ebx
        xor ecx, ecx
        xor edx, edx
        mov al, ah
        and eax, 1
        mov bl, bh
        and ebx, 1
        mov cl, ch
        and ecx, 1
        mov dl, dh
        and edx, 1
      ENDM
    counter_end
    print ustr$(eax)," cycles",13,10

    counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
      REPEAT REPEAT_COUNT
        movzx eax, ah
        and eax, 1
        movzx ebx, bh
        and ebx, 1
        movzx ecx, ch
        and ecx, 1
        movzx edx, dh
        and edx, 1
      ENDM 
    counter_end
    print ustr$(eax)," cycles",13,10

    inkey "Press any key to exit..."
    exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start

P3:

3355 cycles
499 cycles
397 cycles
2559 cycles
599 cycles
403 cycles

P4:

398 cycles
552 cycles
398 cycles
486 cycles
697 cycles
486 cycles

I think on the P4 there are some smaller penalties that this code does not test for.
eschew obfuscation

Jimg

Not much difference on an AMD
________________________________________________

382 cycles
509 cycles
413 cycles
334 cycles
490 cycles
359 cycles

it really depends on just how you do the byte manipulations, and alignment.

Rockoon


The biggest issue is that 8-bit register ops break the register renaming strategy that allows out of order execution .. The CPU cannot allocate a new ('renamed') register for the work to be done in .. it really needs a fully-up-to-date 'EAX' before processing can commence on 'AL', 'AH', or even 'AX' because it cannot predict the future use of the register .. the future instructions might depend on the untouched portions to be accurate but the CPU cannot make that determination...

There are methodologies that help the CPU out here and shorten the stall when mixing 8-bit ops with 32-bit ones on the same register (see the second example above) but the best approach is the 3rd example ... turn the 8-bit ops into 32-bit ones with sign or zero extension .. any stalls with the 3rd example will be for something other than register renaming or out-of-order issues (I would hazard a guess that code similar to the 3rd example would be limited by the short supply of memory reading ports)


When C++ compilers can be coerced to emit rcl and rcr, I *might* consider using one.