Faster loop?

Eddy · May 24, 2005, 08:32:14 PM

Hi all
I don't know if this is the right forum to post this question.
Moderator, feel free to move this code if you think appropriate..

Don't let the PowerBASIC code scare you off. The subject of my question is related to a small portion of the assembler part only.

* What does the code do?
- it performs an integer addition of 2 operands that can have any length.
- addition (with carry) is performed DWORD per DWORD

* What's the question?
- The addition is done in a loop. Can the loop be made any faster?

* Constraints
- Code should run on any pentium system.
- Operands are little endian: the least significant DWORD is stored in the lowest memory address. That's why I need to INC ecx and I can't seem to use the LOOP instruction.

Any ideas?

Kind regards
Eddy

Code Select


#COMPILE EXE
#DIM ALL



FUNCTION PBMAIN AS LONG
    LOCAL p1        AS STRING 
    LOCAL p2        AS STRING 
    LOCAL pr        AS STRING 
    LOCAL p1Ptr     AS DWORD  
    LOCAL p2Ptr     AS DWORD  
    LOCAL prPtr     AS DWORD  
    LOCAL DWords    AS DWORD
    LOCAL Temp      AS DWORD
    LOCAL OLen      AS DWORD  
    LOCAL StartTime AS SINGLE 

        '--Fill addition operands with some (dummy) data
    p1 = "1111222233334444"     'Operand 1
    p2 = "1111222233334444"     'Operand 2
    
        '--Reserve space to store result (ignore carry for now)
    pr = "1234123412341234"     
    
        '--Get operand and result (end) addresses
        '  (StrPtr() returns the memory address of a string)
    OLen  = LEN(p1)             'Operand length (in bytes)    
    p1Ptr = STRPTR(p1) + OLen   'End address of p1
    p2Ptr = STRPTR(p2) + OLen   'End address of p2
    prPtr = STRPTR(pr) + OLen   'End address of pr (result of addition)
    
    DWords = LEN(p1) \ 4    'String length in DWords
    DWords = - DWords       'Make negative so we can INC and 
                            'check for zero
    
    ! mov  eax, p1Ptr       'copy end address of operand 1
    ! mov  ebx, p2Ptr       'copy end address of operand 2
    ! mov  esi, prPtr       'copy end address of result 
    ! mov  ecx, DWords      'copy number of DWords to add 

        '===Start of code to optimise=======================
    Add1:
    ! mov  edx, [eax+ecx*4] 'get DWORD of p1
    ! adc  edx, [ebx+ecx*4] 'add DWORD of p2
    ! mov  [esi+ecx*4], edx 'store result in pr
    ! inc  ecx
    ! jnz Add1
        '===End of code to optimise=========================
         
                 
END FUNCTION

Mark_Larson · May 24, 2005, 08:59:05 PM

Yea, really should be in the laboratory :)

A few things you can try.

1) try unrolling the loop more to remove loop overhead.
2) On P4 a ADC is very slow, try using MMX. On my P4 using MMX to avoid an ADC executes faster.
3) Try aligning your code labels
4) make sure both data arrays are dword aligned.

Eddy · May 24, 2005, 09:26:22 PM

" 1) try unrolling the loop more to remove loop overhead."
---Difficult to do, since the length of the operands is not fixed, so can be any length...

" 2) On P4 a ADC is very slow, try using MMX. On my P4 using MMX to avoid an ADC executes faster."
---Sorry for my ignorance, but are the MMX instructions available on all the Pentium versions? P I, P II ?

" 3) Try aligning your code labels"
" 4) make sure both data arrays are dword aligned."
---I'll check but I believe the PowerBASIC compiler handles this automatically.

Thanks!

Kind regards
Eddy

Quote

hutch-- · May 24, 2005, 11:32:03 PM

Eddy,

here is a quick play. It does an unroll by 2 and I tidied up the handling of ECX. The first block within the loop can be duplicated a number of times to test if unrolling help with the speed.

Code Select


    ! mov  eax, p1Ptr       ' copy end address of operand 1
    ! mov  ebx, p2Ptr       ' copy end address of operand 2
    ! mov  esi, prPtr       ' copy end address of result
    ! mov  ecx, DWords      ' copy number of DWords to add
    ! shr ecx, 2            ' divide by 4
    ! neg  ecx              ' invert sign

        '===Start of code to optimise=======================
  Add1:
    ! mov  edx, [eax+ecx*4] 'get DWORD of p1
    ! adc  edx, [ebx+ecx*4] 'add DWORD of p2
    ! mov  [esi+ecx*4], edx 'store result in pr
    ! add ecx, 1
    ! jz AddOut

    ! mov  edx, [eax+ecx*4] 'get DWORD of p1
    ! adc  edx, [ebx+ecx*4] 'add DWORD of p2
    ! mov  [esi+ecx*4], edx 'store result in pr
    ! add ecx, 1
    ! jnz Add1
        '===End of code to optimise=========================

  AddOut:

Phil · May 25, 2005, 12:47:26 AM

When you're unrolling a loop it can also be helpful to add extra code to the loop initialization and enter the unrolled loop at precisely the right point so you only need to make one test for loop completion at the bottom. In other words, if you've unrolled the loop to do four operations, check first and make sure your initial entry point leaves an even multple of four operations to do when the initial 1, 2, 3, or 4, operations are completed. Doing this adds complexity to the initialization and creates a penalty when the total number of operations is small but it can be a big boost when you're dealing with increasingly large iterations.

You can also code the proper offsets directly into your mov instructions when you've unrolled the loop and thus avoid doing all the individual increments for each operation. Sometimes it can help, sometimes it hurts. That way you can do one add ecx,4 at the end of the loop. However, that's where the overhead in the initialization code comes in. If you're going to do things like that then you have to adjust ecx to be a multiple of 4 and then enter the loop at the point where the offset and syncronization is correct.

Also, I didn't understand why the shr ecx ,2 and the neg ecx instructions where needed above. In the statements above the code Dwords = -Dwords had been specified. Maybe I missed something?

hutch-- · May 25, 2005, 01:35:44 AM

Quote
Also, I didn't understand why the shr ecx ,2 and the neg ecx instructions where needed above. In the statements above the code Dwords = -Dwords had been specified. Maybe I missed something?

Simple, its smaller and more efficient code and very easy to write. "Dwords = -Dwords" is an assignment where NEG ECX is simpler. The SHR is simper than a division and is preferred in tha context.

Jimg · May 25, 2005, 01:55:02 AM

I may be missing something here, but you can't add two characters like that and get meaningful results. "2" + "2" = "d" in this case. Is there some hidden PB function that is converting the input to binary, and the output back to string?

dsouza123 · May 25, 2005, 02:32:33 AM

Remember the comment '--Fill addition operands with some (dummy) data
The numeric characters in the strings are just to remind that the values for each character (byte) will be a number.
They serve the purpose of taking up a byte each, and are valid values for testing, 2 + 2 = d is really 50 + 50 = 100
and it is a very convenient way to fill a series of bytes. The input and result are easy to display, just print the strings.

roticv · May 25, 2005, 01:26:26 PM

Eddy,

You can have different routines optimised for different processors if you are really that concerned.

Eddy · May 25, 2005, 02:42:40 PM

Thanks all for replying so far!

- Hutch: Loop unrolling.. Nice concept! Hadn't heared of it before.. (that's why I had this code originally posted in the Campus..;) )
I will do the test later today to see if it makes a big difference. I guess the basic idea is to not interrupt the CPU's pipelining..?

- Phil: About the DWords = - DWords line:
I need to increment ecx (because of little endian), so if I start with an operand base address that points to the beginning of the operand, I have to do a CMP to check when ecx equals DWord (end of operand) to end theloop. Right? So that gives me a INC, a CMP and a JNE.
If I make the base address point to the end of the operand (that's why the 'STRPTR(p1) + OLen'), and I start off with a negative ecx, I can suffice to check when ecx equals zero at the end of the operand. So this gives me an INC and a JNZ only (no need for a CMP).
If I were not stuck to the little endian, I could simply use a LOOP, which automatically decrements ecx..:(

- Jimg: Don't worry about the data. Using a string is just an easy way for me to store some known number of bytes somewhere (I need a multiple of 4 bytes) and to display the result. So dsouza123 was right on the spot!

- Roticv: Yes I could do that, although I think I can safely ignore pre-pentium CPU's. Going MMX would be an option. Probably not that many Pentiums without MMX around..(?)

Kind regards
Eddy

hutch-- · May 25, 2005, 03:28:15 PM

Eddy,

the main win with a loop unroll is to reduce the number of taken jumps. A falled through jump is a lot faster than a taken jump. The short demo is an unroll by 2. If it seems to be any faster try an unroll by 4 as it may help. There tends to be a point of diminishing returns which is usually about 4 times but its worth a try.

PB aligns the data in arrays so its OK there, it may be worth seeing if you can do the calculation without the ADC as it is slower that the preferred set.

Eddy · May 25, 2005, 09:58:41 PM

Hutch,
Unrolling the loop surely is a good move.
Unrolling by a factor 2 (like your example) gives a run time from 30 to 50% compared to the original runtime.
I tested with operand lengths from 40 to 40K bytes. Of course I had to put the calculation in a loop to get a measurable run time.

I also tried adding a 3rd and a 4th section but gain was only minimal (a few percent).

Amazing for me was also to see that replacing 'ADD ecx, 1' with 'INC ecx' almost totally destroyed the benefit of unrolling the loop. I always thought that INC was more efficient...

This sure is a good improvement! I will also look into the MMX thing. See if that brings anything..
Thanks!!!!

Kind regards
Eddy

Mark Jones · May 25, 2005, 10:47:53 PM

Quote from: Eddy on May 25, 2005, 09:58:41 PM
Amazing for me was also to see that replacing 'ADD ecx, 1' with 'INC ecx' almost totally destroyed the benefit of unrolling the loop. I always thought that INC was more efficient...

INC/DEC is faster than ADD/SUB on AMD processors.

QvasiModo · May 27, 2005, 12:50:55 AM

Quote from: Mark Jones on May 25, 2005, 10:47:53 PM
Quote from: Eddy on May 25, 2005, 09:58:41 PM
Amazing for me was also to see that replacing 'ADD ecx, 1' with 'INC ecx' almost totally destroyed the benefit of unrolling the loop. I always thought that INC was more efficient...

INC/DEC is faster than ADD/SUB on AMD processors.

But slower on Intel processors.

http://www.agner.org/assem/

hutch-- · May 27, 2005, 01:16:11 AM

Welcome to the world of writing mixed model, mixed hardware code. :bg

News:

Faster loop?

Phil

dsouza123

roticv

QvasiModo