Using bytes (e.g. AH and AL) rather than using dwords (e.g. EAX and EBX) for essentially 8-bit stuff.
Is there a speed penalty or a prefetch/register stall using 8 bits (AH/AL or similar) vs 32 bits
jeez I love Intel manuals :sarcasmwhereisitwhenyouneedit:
There at least can be a speed penalty.
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
.data
membyte db 1
.code
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
start:
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
LOOP_COUNT equ 1000000
REPEAT_COUNT equ 100
invoke Sleep, 3000
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
mov al, membyte
and eax, 1
mov bl, membyte
and ebx, 1
mov cl, membyte
and ecx, 1
mov dl, membyte
and edx, 1
ENDM
counter_end
print ustr$(eax)," cycles",13,10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
xor eax, eax
xor ebx, ebx
xor ecx, ecx
xor edx, edx
mov al, membyte
and eax, 1
mov bl, membyte
and ebx, 1
mov cl, membyte
and ecx, 1
mov dl, membyte
and edx, 1
ENDM
counter_end
print ustr$(eax)," cycles",13,10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
movzx eax, membyte
and eax, 1
movzx ebx, membyte
and ebx, 1
movzx ecx, membyte
and ecx, 1
movzx edx, membyte
and edx, 1
ENDM
counter_end
print ustr$(eax)," cycles",13,10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
mov al, ah
and eax, 1
mov bl, bh
and ebx, 1
mov cl, ch
and ecx, 1
mov dl, dh
and edx, 1
ENDM
counter_end
print ustr$(eax)," cycles",13,10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
xor eax, eax
xor ebx, ebx
xor ecx, ecx
xor edx, edx
mov al, ah
and eax, 1
mov bl, bh
and ebx, 1
mov cl, ch
and ecx, 1
mov dl, dh
and edx, 1
ENDM
counter_end
print ustr$(eax)," cycles",13,10
counter_begin LOOP_COUNT, HIGH_PRIORITY_CLASS
REPEAT REPEAT_COUNT
movzx eax, ah
and eax, 1
movzx ebx, bh
and ebx, 1
movzx ecx, ch
and ecx, 1
movzx edx, dh
and edx, 1
ENDM
counter_end
print ustr$(eax)," cycles",13,10
inkey "Press any key to exit..."
exit
; «««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««««
end start
P3:
3355 cycles
499 cycles
397 cycles
2559 cycles
599 cycles
403 cycles
P4:
398 cycles
552 cycles
398 cycles
486 cycles
697 cycles
486 cycles
I think on the P4 there are some smaller penalties that this code does not test for.
Not much difference on an AMD
________________________________________________
382 cycles
509 cycles
413 cycles
334 cycles
490 cycles
359 cycles
it really depends on just how you do the byte manipulations, and alignment.
The biggest issue is that 8-bit register ops break the register renaming strategy that allows out of order execution .. The CPU cannot allocate a new ('renamed') register for the work to be done in .. it really needs a fully-up-to-date 'EAX' before processing can commence on 'AL', 'AH', or even 'AX' because it cannot predict the future use of the register .. the future instructions might depend on the untouched portions to be accurate but the CPU cannot make that determination...
There are methodologies that help the CPU out here and shorten the stall when mixing 8-bit ops with 32-bit ones on the same register (see the second example above) but the best approach is the 3rd example ... turn the 8-bit ops into 32-bit ones with sign or zero extension .. any stalls with the 3rd example will be for something other than register renaming or out-of-order issues (I would hazard a guess that code similar to the 3rd example would be limited by the short supply of memory reading ports)