Hi all
I don't know if this is the right forum to post this question.
Moderator, feel free to move this code if you think appropriate..
Don't let the PowerBASIC code scare you off. The subject of my question is related to a small portion of the assembler part only.
* What does the code do?
- it performs an integer addition of 2 operands that can have any length.
- addition (with carry) is performed DWORD per DWORD
* What's the question?
- The addition is done in a loop. Can the loop be made any faster?
* Constraints
- Code should run on any pentium system.
- Operands are little endian: the least significant DWORD is stored in the lowest memory address. That's why I need to INC ecx and I can't seem to use the LOOP instruction.
Any ideas?
Kind regards
Eddy
#COMPILE EXE
#DIM ALL
FUNCTION PBMAIN AS LONG
LOCAL p1 AS STRING
LOCAL p2 AS STRING
LOCAL pr AS STRING
LOCAL p1Ptr AS DWORD
LOCAL p2Ptr AS DWORD
LOCAL prPtr AS DWORD
LOCAL DWords AS DWORD
LOCAL Temp AS DWORD
LOCAL OLen AS DWORD
LOCAL StartTime AS SINGLE
'--Fill addition operands with some (dummy) data
p1 = "1111222233334444" 'Operand 1
p2 = "1111222233334444" 'Operand 2
'--Reserve space to store result (ignore carry for now)
pr = "1234123412341234"
'--Get operand and result (end) addresses
' (StrPtr() returns the memory address of a string)
OLen = LEN(p1) 'Operand length (in bytes)
p1Ptr = STRPTR(p1) + OLen 'End address of p1
p2Ptr = STRPTR(p2) + OLen 'End address of p2
prPtr = STRPTR(pr) + OLen 'End address of pr (result of addition)
DWords = LEN(p1) \ 4 'String length in DWords
DWords = - DWords 'Make negative so we can INC and
'check for zero
! mov eax, p1Ptr 'copy end address of operand 1
! mov ebx, p2Ptr 'copy end address of operand 2
! mov esi, prPtr 'copy end address of result
! mov ecx, DWords 'copy number of DWords to add
'===Start of code to optimise=======================
Add1:
! mov edx, [eax+ecx*4] 'get DWORD of p1
! adc edx, [ebx+ecx*4] 'add DWORD of p2
! mov [esi+ecx*4], edx 'store result in pr
! inc ecx
! jnz Add1
'===End of code to optimise=========================
END FUNCTION
Yea, really should be in the laboratory :)
A few things you can try.
1) try unrolling the loop more to remove loop overhead.
2) On P4 a ADC is very slow, try using MMX. On my P4 using MMX to avoid an ADC executes faster.
3) Try aligning your code labels
4) make sure both data arrays are dword aligned.
" 1) try unrolling the loop more to remove loop overhead."
---Difficult to do, since the length of the operands is not fixed, so can be any length...
" 2) On P4 a ADC is very slow, try using MMX. On my P4 using MMX to avoid an ADC executes faster."
---Sorry for my ignorance, but are the MMX instructions available on all the Pentium versions? P I, P II ?
" 3) Try aligning your code labels"
" 4) make sure both data arrays are dword aligned."
---I'll check but I believe the PowerBASIC compiler handles this automatically.
Thanks!
Kind regards
Eddy
Quote
Eddy,
here is a quick play. It does an unroll by 2 and I tidied up the handling of ECX. The first block within the loop can be duplicated a number of times to test if unrolling help with the speed.
! mov eax, p1Ptr ' copy end address of operand 1
! mov ebx, p2Ptr ' copy end address of operand 2
! mov esi, prPtr ' copy end address of result
! mov ecx, DWords ' copy number of DWords to add
! shr ecx, 2 ' divide by 4
! neg ecx ' invert sign
'===Start of code to optimise=======================
Add1:
! mov edx, [eax+ecx*4] 'get DWORD of p1
! adc edx, [ebx+ecx*4] 'add DWORD of p2
! mov [esi+ecx*4], edx 'store result in pr
! add ecx, 1
! jz AddOut
! mov edx, [eax+ecx*4] 'get DWORD of p1
! adc edx, [ebx+ecx*4] 'add DWORD of p2
! mov [esi+ecx*4], edx 'store result in pr
! add ecx, 1
! jnz Add1
'===End of code to optimise=========================
AddOut:
When you're unrolling a loop it can also be helpful to add extra code to the loop initialization and enter the unrolled loop at precisely the right point so you only need to make one test for loop completion at the bottom. In other words, if you've unrolled the loop to do four operations, check first and make sure your initial entry point leaves an even multple of four operations to do when the initial 1, 2, 3, or 4, operations are completed. Doing this adds complexity to the initialization and creates a penalty when the total number of operations is small but it can be a big boost when you're dealing with increasingly large iterations.
You can also code the proper offsets directly into your mov instructions when you've unrolled the loop and thus avoid doing all the individual increments for each operation. Sometimes it can help, sometimes it hurts. That way you can do one add ecx,4 at the end of the loop. However, that's where the overhead in the initialization code comes in. If you're going to do things like that then you have to adjust ecx to be a multiple of 4 and then enter the loop at the point where the offset and syncronization is correct.
Also, I didn't understand why the shr ecx ,2 and the neg ecx instructions where needed above. In the statements above the code Dwords = -Dwords had been specified. Maybe I missed something?
Quote
Also, I didn't understand why the shr ecx ,2 and the neg ecx instructions where needed above. In the statements above the code Dwords = -Dwords had been specified. Maybe I missed something?
Simple, its smaller and more efficient code and very easy to write. "Dwords = -Dwords" is an assignment where NEG ECX is simpler. The SHR is simper than a division and is preferred in tha context.
I may be missing something here, but you can't add two characters like that and get meaningful results. "2" + "2" = "d" in this case. Is there some hidden PB function that is converting the input to binary, and the output back to string?
Remember the comment '--Fill addition operands with some (dummy) data
The numeric characters in the strings are just to remind that the values for each character (byte) will be a number.
They serve the purpose of taking up a byte each, and are valid values for testing, 2 + 2 = d is really 50 + 50 = 100
and it is a very convenient way to fill a series of bytes. The input and result are easy to display, just print the strings.
Eddy,
You can have different routines optimised for different processors if you are really that concerned.
Thanks all for replying so far!
- Hutch: Loop unrolling.. Nice concept! Hadn't heared of it before.. (that's why I had this code originally posted in the Campus..;) )
I will do the test later today to see if it makes a big difference. I guess the basic idea is to not interrupt the CPU's pipelining..?
- Phil: About the DWords = - DWords line:
I need to increment ecx (because of little endian), so if I start with an operand base address that points to the beginning of the operand, I have to do a CMP to check when ecx equals DWord (end of operand) to end theloop. Right? So that gives me a INC, a CMP and a JNE.
If I make the base address point to the end of the operand (that's why the 'STRPTR(p1) + OLen'), and I start off with a negative ecx, I can suffice to check when ecx equals zero at the end of the operand. So this gives me an INC and a JNZ only (no need for a CMP).
If I were not stuck to the little endian, I could simply use a LOOP, which automatically decrements ecx..:(
- Jimg: Don't worry about the data. Using a string is just an easy way for me to store some known number of bytes somewhere (I need a multiple of 4 bytes) and to display the result. So dsouza123 was right on the spot!
- Roticv: Yes I could do that, although I think I can safely ignore pre-pentium CPU's. Going MMX would be an option. Probably not that many Pentiums without MMX around..(?)
Kind regards
Eddy
Eddy,
the main win with a loop unroll is to reduce the number of taken jumps. A falled through jump is a lot faster than a taken jump. The short demo is an unroll by 2. If it seems to be any faster try an unroll by 4 as it may help. There tends to be a point of diminishing returns which is usually about 4 times but its worth a try.
PB aligns the data in arrays so its OK there, it may be worth seeing if you can do the calculation without the ADC as it is slower that the preferred set.
Hutch,
Unrolling the loop surely is a good move.
Unrolling by a factor 2 (like your example) gives a run time from 30 to 50% compared to the original runtime.
I tested with operand lengths from 40 to 40K bytes. Of course I had to put the calculation in a loop to get a measurable run time.
I also tried adding a 3rd and a 4th section but gain was only minimal (a few percent).
Amazing for me was also to see that replacing 'ADD ecx, 1' with 'INC ecx' almost totally destroyed the benefit of unrolling the loop. I always thought that INC was more efficient...
This sure is a good improvement! I will also look into the MMX thing. See if that brings anything..
Thanks!!!!
Kind regards
Eddy
Quote from: Eddy on May 25, 2005, 09:58:41 PM
Amazing for me was also to see that replacing 'ADD ecx, 1' with 'INC ecx' almost totally destroyed the benefit of unrolling the loop. I always thought that INC was more efficient...
INC/DEC
is faster than ADD/SUB on AMD processors.
Quote from: Mark Jones on May 25, 2005, 10:47:53 PM
Quote from: Eddy on May 25, 2005, 09:58:41 PM
Amazing for me was also to see that replacing 'ADD ecx, 1' with 'INC ecx' almost totally destroyed the benefit of unrolling the loop. I always thought that INC was more efficient...
INC/DEC is faster than ADD/SUB on AMD processors.
But slower on Intel processors.
http://www.agner.org/assem/
Welcome to the world of writing mixed model, mixed hardware code. :bg
Speed gain by loop unrolling is somewhat less dramatic than I mentioned above.
I got the big difference in running time by leaving the 'INC ecx' in my initial routine... ::)
After using there also the 'ADD ecx, 1' the running time of the optimised routine is about 95% of that of the original routine.
Still worth using..
I'm looking into MMX and x87 FPU now. Totally new to me, so have to study first... :dazzled:
Kind regards
Eddy
Eddy,
Is there a way you can achieve the same functionality without using ADC (add with carry) ? If you can perform the action without the ADC and use ADD instead, it is likely to be faster.
hutch, he needs adc because of the fact that he is doing big int addition. But of course, he can replace it with mmx. like for instance
bnAdd2 proc bigintdes:dword,bigintsrc1,bigintsrc2
;bigintdes = bigint1 + bigint2
mov eax, [esp+4]
mov ecx, [esp+8]
mov edx, [esp+12]
movd mm0, [ecx]
movd mm1, [edx]
paddq mm1, mm0
movd [eax], mm1
psrlq mm1, 32
i = 4
REPT N/2 - 1
movd mm0, [ecx+i]
movd mm2, [edx+i]
paddq mm2, mm0
paddq mm2, mm1
movd [eax+i],mm2
psrlq mm2, 32
i = i + 4
movd mm0, [ecx+i]
movd mm1, [edx+i]
paddq mm1, mm0
paddq mm1, mm2
movd [eax+i], mm1
psrlq mm1, 32
i = i + 4
ENDM
movd mm0,[ecx+i]
movd mm2,[edx+i]
paddq mm2, mm0
paddq mm2, mm1
movq [eax+i], mm2
retn 12
bnAdd2 endp
Just modify it slightly so that you can make the length of teh big int be variable
Hutch,
I can't really think of a way (other than by creating a lot of overhead) to avoid the ADC.
Addition of the first (Least significant) dword could be done by ADD, but this could already set the carry bit, so for the rest of the additions, the carry bit has to be tagged along.
Depending on the operand values, the carry can 'ripple' through all the dwords, up to the end. So, when using ADD, you would have to add ways to handle the carry.
ADD sets the CF, so after every ADD I could check if a CF is set.
If CF set, I could do: ADD edx, 1 (where edx holds the next dword). I could try if that would gain something but I fear that the extra CF check destroys the benefit of using ADD over ADC.
This said, does anyone know of a handy document (or tool) to look up the number of clock cycles that a certain instruction requires?
Could not find that kind of info in the Intel manuals I checked...
Thanks for the idea!
Kind regards
Eddy
Roticv,
Am I understanding your code correctly?
Do you use paddq instead of paddd to include the carry bit?
Something like this?
movd mm0, [ecx+i] '<--get dword of operand 1
movd mm2, [edx+i] '<--get dword of operand 2
paddq mm2, mm0 'Add 2 quadwords
paddq mm2, mm1 'Add carry bit (generated previously)
movd [eax+i],mm2 'Store addition result in mm2
psrlq mm2, 32 'Shift carry bit to bit 0 for next addition
Is there a way to do this with paddd instead of paddq ?
Kind regards
Eddy
Quote from: Eddy on May 27, 2005, 02:51:48 PM
This said, does anyone know of a handy document (or tool) to look up the number of clock cycles that a certain instruction requires?
Could not find that kind of info in the Intel manuals I checked...
There is really interesting document: http://www.agner.org/assem/pentopt.pdf
Quote from: Eddy on May 27, 2005, 02:51:48 PM
I can't really think of a way (other than by creating a lot of overhead) to avoid the ADC.
BTW, if you know how to measure the code speed, you can try the following modification. It may seem strange, but I remember that somebody said it is much quicker on P4. I don't have the proof, though.
Add1:
! mov edx, [eax+ecx*4] 'get DWORD of p1
! jnc over
! add edx, 1 ' or inc edx
over:
! add edx, [ebx+ecx*4] 'add DWORD of p2
! mov [esi+ecx*4], edx 'store result in pr
! inc ecx
! jnz Add1
Mazegen,
your code with the ADD instead of ADC does not give a shorter run time. Execution time is twice as long on my system (centrino).
Thanks for your suggestion though!
Kind regards
Eddy
Another thing you could try is to first copy all of the first operand into the memory reserved for the result, then add the second operand to that memory. So the addition loop becomes
;ecx - contains length of 2nd operand
;esi - points to 2nd operand
;edi - points to result containing copy of 1st operand
xor eax,eax
AddDigit:
mov edx, [esi+eax*4] ;get DWORD of p1
adc [edi+eax*4], edx ;add to DWORD of p2 stored in pr
add eax,1
loop AddDigit
Posit,
Good idea! But this way still takes about 250% of the original routine.. :(
Kind regards
Eddy
Quote from: Eddy on May 27, 2005, 03:09:49 PM
Roticv,
Am I understanding your code correctly?
Do you use paddq instead of paddd to include the carry bit?
Something like this?
movd mm0, [ecx+i] '<--get dword of operand 1
movd mm2, [edx+i] '<--get dword of operand 2
paddq mm2, mm0 'Add 2 quadwords
paddq mm2, mm1 'Add carry bit (generated previously)
movd [eax+i],mm2 'Store addition result in mm2
psrlq mm2, 32 'Shift carry bit to bit 0 for next addition
Is there a way to do this with paddd instead of paddq ?
Kind regards
Eddy
Yes you are correct, the carry bit is stored in mm2/mm1 depending on which part of the loop you are referring to.
PS: I think the code was from mark larson
Nice list of pentium opcodes and clock cycles:
http://home.zcu.cz/~dudacek/SOJ/manualy/80x86set.pdf
Kind regards
Eddy
I think it is outdated because we are no longer using pentium and it does not apply to PIV.
I couldn't find a more recent one... :(
Thanks for the link Eddy.
If it were possible to time single instructions with Michael's timing macros, I'd code an app to time every instruction. Then we could see timings for every processor.
Quote from: Eddy on May 30, 2005, 04:48:48 PM
I couldn't find a more recent one... :(
Well known, updated one is here: http://www.agner.org/assem/pentopt.pdf
Quote from: Mark Jones on May 30, 2005, 05:01:17 PM
If it were possible to time single instructions with Michael's timing macros, I'd code an app to time every instruction. Then we could see timings for every processor.
On the old forum Mark Larson suggested a project like this, but it never really went anywhere. It is possible to derive a repeatable cycle count for a single instruction, but if the goal were to determine the minimum cycle count, it would be necessary to create optimal execution conditions for the instruction. I think for some instructions this would be no problem, but for others it would be difficult. If I recall correctly, Mark wanted to automate the process as much as possible, but I could not envision any workable method of doing that. It seemed to me that the task was just going to require a lot of coding and testing. But that was before any (to my knowledge) easy-to-use timing macros were available :bg
Hmm indeed. I was thinking of three aspects:
1. Use a lookup table (probably by macro) to get the next instruction(s) to test. This way a lot of redundant code could be mitigated. The table could be ordered from 386 --> 786 so that all available instructions for the current processor would be tested.
2. Loop each single instruction 100 times and subtract the accrued loop delay from the result. The various processors might have different loop timings, so the compensation value might be CPU-dependent. I tried timing a NOP (then a series of NOPs) 10M times and sometimes would get 0 clocks, or even -2. Which might be right or wrong but leads us to:
3. Better results if all CPU cache, look-ahead features, and optimizations can be turned off. IF they can be turned off, or otherwise mitigated. (Cache flush routine?)
But I agree, the semantics would be the most difficult. How would variable-length instructions be tested for instance? With a fast random number generator?
It's an intriguing idea. :)
Quote from: Mark Jones on May 31, 2005, 01:19:42 PM
3. Better results if all CPU cache, look-ahead features, and optimizations can be turned off. IF they can be turned off, or otherwise mitigated. (Cache flush routine?)
I agree that this would make it easier to get consistent results, but the results would not reflect the real world where processors are used with all of these performance features enabled. I think an optimal cycle count would be more useful, because it would allow you to readily judge the level of optimization for an instruction sequence by comparing the sum of the optimal cycle counts for the sequence to the actual cycle count.
Quote from: MazeGen on May 27, 2005, 03:12:06 PM
Quote from: Eddy on May 27, 2005, 02:51:48 PM
I can't really think of a way (other than by creating a lot of overhead) to avoid the ADC.
BTW, if you know how to measure the code speed, you can try the following modification. It may seem strange, but I remember that somebody said it is much quicker on P4. I don't have the proof, though.
Add1:
! mov edx, [eax+ecx*4] 'get DWORD of p1
! jnc over
! add edx, 1 ' or inc edx
over:
! add edx, [ebx+ecx*4] 'add DWORD of p2
! mov [esi+ecx*4], edx 'store result in pr
! inc ecx
! jnz Add1
I know someone already showed that this was slower, but this may be because (depending on the range of data you are generally adding) the jump taken over the addition of the carry will more often be mispredicted. There is some of my code doing something similar on the old forum, and I worked out that for me it was faster to
add all the possible carries first, then jump over code to
remove them if they weren't actually needed.
Yes, your code can be easily found on the old forum.
For completeness' sake, I was reffering to this (http://www.old.masmforum.com/viewtopic.php?p=19951#19951) topic.