News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Odd tmiing behavior

Started by Merrick, November 03, 2006, 08:04:56 PM

Previous topic - Next topic

Merrick

There are several pieces of scientific software which were developed on Macs "back in the day" but which were ported to PCs over a decade ago. Most are relatively normal, but some do really stupid things. One example is LabVIEW, which still saves all of it's binary data in Big Endian form. I have dealt with this problem on a number of occasions, and finally decided to develop an all purpose DLL/LIB for handling BE<==> LE translations.

So, I developed six separate routines:
One translates words between BE and LE with a related routine that handles arrays of words.
One translates dwords between BE and LE with a related routine that handles arrays of dwords
    - with aliasing this works for both 4 byte integers and 4 byte reals.
One that translates qwords between BE and LE with a related routine that handles arrays of qwords.

They are fast, though I have made absolutely no attempts to optimize them. I measure about 18 clocks per word for word arrays, 32 clocks per dword for both integer and real arrays, and about 74 clocks per word for double arrays. Using a VB front end I also test conversion one value at a time - not surprisingly a lot slower. Much of the slow down has to do with the overhead of calling the DLL routine for every value instead of once for the entire array as well as the added overhead of VB doing the looping within the array instead of the DLL.

Now, there is one very odd thing I'm seeing, and I was hoping that people might have some ideas about what is going on. For integer words and real singles and doubles the loop timings are 239, 244, and 256 clocks per value, respectively. But when I time integer dwords (which use exactly the same code as real singles) the timing is 321 clocks per value. That's even longer than qwords. I've been over the VB code a large number of times and am absolutely certain that there is nothing there (at the code level) that is responsible for this unexpected behavior. The basic routine is:
generate an array of random values of the appropriate type
start a timer
loop through each array a large number of times
for the DLL array handling routines, loop through n times, for the DLL single value routines loop through each individual value of the array - the entire array n times
stop the timer

Does anyone have idea what might be causing this?

Here are the routines, if anyones interested, though since both 4 byte integers and 4 byte reals use exactly the code I doubt it's in the assembly, per se.


Endian2 proc endianInput:dword, endianOutput:dword
mov edx, endianInput ;mem = B1 B2
mov ax, word ptr[edx] ;ax = B2 B1
xchg al, ah ;ax = B1 B2
mov edx, endianOutput
mov word ptr[edx], ax ;mem = B2 B1
ret
Endian2 endp

Endian2Array proc endianInput:dword, endianOutput:dword, arrayCount:dword
mov ecx, arrayCount
test ecx, ecx
je doneEndian2Array
nextEndian2Array:
mov edx, endianInput ;mem = B1 B2
mov ax, word ptr[edx+2*ecx-2] ;ax = B2 B1
xchg al, ah ;ax = B1 B2
mov edx, endianOutput
mov word ptr[edx+2*ecx-2], ax ;mem = B2 B1
dec ecx
test ecx, ecx
je doneEndian2Array
jmp nextEndian2Array
doneEndian2Array:
ret
Endian2Array endp

Endian4 proc endianInput:dword, endianOutput:dword
mov edx, endianInput ;mem = B1 B2 B3 B4
mov ax, word ptr[edx] ;ax = B2 B1
mov bx, word ptr[edx+2] ;bx = B4 B3
xchg al, ah ;ax = B1 B2
xchg bl, bh ;bx = B3 B4
shl eax, 16 ;eax = B1 B2 00 00
mov ax, bx ;eax = B1 B2 B3 B4
mov edx, endianOutput
mov dword ptr[edx], eax ;mem = B4 B3 B2 B1
ret
Endian4 endp

Endian4Array proc endianInput:dword, endianOutput:dword, arrayCount:dword
mov ecx, arrayCount
test ecx, ecx
je doneEndian4Array
nextEndian4Array:
mov edx, endianInput ;mem = B1 B2 B3 B4
mov ax, word ptr[edx+4*ecx-4] ;ax = B2 B1
mov bx, word ptr[edx+4*ecx-2] ;bx = B4 B3
xchg al, ah ;ax = B1 B2
xchg bl, bh ;bx = B3 B4
shl eax, 16 ;eax = B1 B2 00 00
mov ax, bx ;eax = B1 B2 B3 B4
mov edx, endianOutput
mov dword ptr[edx+4*ecx-4], eax ;mem = B4 B3 B2 B1
dec ecx
test ecx, ecx
je doneEndian4Array
jmp nextEndian4Array
doneEndian4Array:
ret
Endian4Array endp

Endian8 proc endianInput:dword, endianOutput:dword
mov edx, endianInput ;mem = B1 B2 B3 B4 B5 B6 B7 B8
mov ax, word ptr[edx] ;ax = B2 B1
mov bx, word ptr[edx+2] ;bx = B4 B3
xchg al, ah ;ax = B1 B2
xchg bl, bh ;bx = B3 B4
shl eax, 16 ;eax = B1 B2 00 00
mov ax, bx ;eax = B1 B2 B3 B4
mov edx, endianOutput
mov dword ptr[edx+4], eax ;mem = B4 B3 B2 B1

mov edx, endianInput ;mem = B1 B2 B3 B4 B5 B6 B7 B8
mov ax, word ptr[edx+4] ;ax = B6 B5
mov bx, word ptr[edx+6] ;bx = B8 B7
xchg al, ah ;ax = B5 B6
xchg bl, bh ;bx = B7 B8
shl eax, 16 ;eax = B5 B6 00 00
mov ax, bx ;eax = B5 B6 B7 B8
mov edx, endianOutput
mov dword ptr[edx], eax ;mem = B8 B7 B6 B5
ret
Endian8 endp


Endian8Array proc endianInput:dword, endianOutput:dword, arrayCount:dword
mov ecx, arrayCount
test ecx, ecx
je doneEndian8Array
nextEndian8Array:
mov edx, endianInput ;mem = B1 B2 B3 B4 B5 B6 B7 B8
mov ax, word ptr[edx+8*ecx-8] ;ax = B2 B1
mov bx, word ptr[edx+8*ecx-6] ;bx = B4 B3
xchg al, ah ;ax = B1 B2
xchg bl, bh ;bx = B3 B4
shl eax, 16 ;eax = B1 B2 00 00
mov ax, bx ;eax = B1 B2 B3 B4
mov edx, endianOutput
mov dword ptr[edx+8*ecx-4], eax ;mem = B4 B3 B2 B1

mov edx, endianInput ;mem = B5 B6 B7 B8
mov ax, word ptr[edx+8*ecx-4] ;ax = B6 B5
mov bx, word ptr[edx+8*ecx-2] ;bx = B8 B7
xchg al, ah ;ax = B5 B6
xchg bl, bh ;bx = B7 B8
shl eax, 16 ;eax = B5 B6 00 00
mov ax, bx ;eax = B5 B6 B7 B8
mov edx, endianOutput
mov dword ptr[edx+8*ecx-8], eax ;mem = B8 B7 B6 B5

dec ecx
test ecx, ecx
je doneEndian8Array
jmp nextEndian8Array
doneEndian8Array:
ret
Endian8Array endp



Also, my tabs before the comments got slammed. I guess that's standard HTML. Any ideas how to make those work out better?

PBrennick

Merrick,
I am not an optimizing guru but I have done a lot of programming over the years and know for a fact that there is a better solution. Instead of using xchg (3-5 clocks when working at the byte level as you are), you should be using bswap (1 clock and a 4 byte approach making this a nobrainer). Actually, bswap is the recommended approach for converting from big-endian to little-endian and not just my preference.


BSWAP - Byte Swap       (486+)

        Usage:  BSWAP   reg32
        Modifies flags: none

        Changes the byte order of a 32 bit register from big endian to
        little endian or vice versa.   Result left in destination register
        is undefined if the operand is a 16 bit register.

                                 Clocks                 Size
        Operands         808x  286   386   486          Bytes

        reg32             -     -     -     1             2


One last thing, about the tab problem, the code tags handle tabs properly whereas the quote tags do not. I do not feel that this is an error but is a matter of design. Be sure to use the code tags for anything that you want to have a column-like display such as tables or assembly source. It will look awful in the edit box but when you preview it to see what everyone else will see you will be very happy.

Paul
The GeneSys Project is available from:
The Repository or My crappy website

Merrick

Hi Paul,

Thanks for you input...

1. Thanks for the bswap suggestion. I'll definitely use it.

2. I'm about 99 44/100% sure I used the code tag. Of course, simply looking at the page source doesn't prove that since the page is reformatted for display (and I don't know how this software translates the tags), but I think that a snip of the source included below suggests that I used the code tag and not the quote tag.
I've actually had this problem trying to post code before, so I'm more than a little interested in working out the issue. For the record: I cut the code out of MS Development Environment 2003, the leading tabs on each line appeared to work perfectly, the tabs between the end of code and beginning of comments on each line are not.

3. Any ideas on the timing behavior? That's just wierd!

4. I sent that first message about 1 hour after coming out of general anesthetic for extraction of impacted wisdom teeth. It's more coherent that I expected it to be!

Quote
s and 4 byte reals use exactly the code I doubt it&#039;s in the assembly, per se.<br /><br /><div class="codeheader">Code:</div><div class="code">Endian2 proc endianInput:dword, endianOutput:dword<br /><span style="white-space: pre;">   </span>mov edx, endianInput<span style="white-space: pre;">   </span><span style="white-space: pre;">   </span>;mem = B1 B2<br /><span style="white-space: pre;">   </span>mov ax, word ptr[edx]<span style="white-space: pre;">   </span><span style="white-space: pre;">   </span>;ax = B2 B1<br /><span style="white-space: pre;">

PBrennick

Merrick,
Would it be possible for you to give me a link to where you got the stuff that wont format so that I can get a firsthand copy it for testing.

Paul
The GeneSys Project is available from:
The Repository or My crappy website

Merrick

Hi Paul,

Can't think of a place to post it. Fortunately, it's small...

[attachment deleted by admin]

PBrennick

Thanks for the source of your program but that is not what I asked for.  :eek

Quote
Would it be possible for you to give me a link to where you got the stuff that wont format so that I can get a firsthand copy it for testing.

Paul
The GeneSys Project is available from:
The Repository or My crappy website

Merrick

Sorry then, Paul, I seem to have no idea what you're talking about. I just don't know what, "stuff that won't format," you are referring to if it's not my code. I assumed you were talking about the code I pasted that I was having formatting problems with. That is my code. If you're talking about something else, please help me understand what that is.

I did say that the page displayed on masmforum is reformatted from the input I placed on the page. If this is what you are referring to, then let me try to explain that again:

In the first message I typed...

{code} ... {/code} (but with square brackets, not curly brackets)

...then pasted my code in between the tags. As I said, the leading tabs in each line were reproduced flawlessly, but all the tabs on any given line which were in between code and the trailing comment on that line were not displayed so that there is no whitespace between code and comments. When I say, "the page is reformatted for display," I posted an example of that in quotes immediately below in that particular message. That quoted fragment was obtained by selecting View:Source while looking at the masmforum page with my post on it. So, if that's what you are referring to, you are staring at the, "link where you got the stuff that won't format." It is the masmforum. If you want to see that, click View:Source right now and see. You might even want to look at some code you've posted for curiosity.

Thanks.

PBrennick

The tab problem is not anything important to many as it really has no bearing on your program. I work with BB Code more than most so I wanted to see if there is an issue that should be reported to SMF. For that reason, I wanted the source so I can test the formatting. You have given me that source and that is all I really need. I have a few questions I would like to ask you and if it is not a bother, I will do this via PMs as I do not want to take the focus off of your timing issues.

Please let us know how bswap works for you.
Paul
The GeneSys Project is available from:
The Repository or My crappy website

ToutEnMasm


extract from intel book
BSWAP—Byte Swap
Description
Reverses the byte order of a 32-bit (destination) register: bits 0 through 7 are swapped with bits
24 through 31, and bits 8 through 15 are swapped with bits 16 through 23. This instruction is
provided for converting little-endian values to big-endian format and vice versa.
To swap bytes in a word value (16-bit register), use the XCHG instruction. When the BSWAP
instruction references a 16-bit register, the result is undefined.
IA-32 Architecture Compatibility
The BSWAP instruction is not supported on IA-32 processors earlier than the Intel486
processor family. For compatibility with this instruction, include functionally equivalent
code for execution on Intel processors earlier than the Intel486 processor family.
Operation
TEMP ← DEST
DEST[7..0] ← TEMP(31..24]
DEST[15..8] ← TEMP(23..16]
DEST[23..16] ← TEMP(15..8]
DEST[31..24] ← TEMP(7..0]
Flags Affected
None.
Exceptions (All Operating Modes)

                    ToutEnMasm

MichaelW

#9
Merrick,

I can't see how the odd timing behavior could be anything other than VB... variants?

I played with your code to see how much difference using bswap would make, and the difference is greater than I expected. I arbitrarily sized the arrays for 10 elements. The difference would be greater for larger arrays. In the process I determined that the Endian8 and Endian8Array procedures are not performing the conversion correctly. Results on my P3:

Endian2:        00001122        00002211
_Endian2:       00002211        00001122
Endian2Array:   00001122        00002211
_Endian2Array:  00002211        00001122
Endian4:        11223344        44332211
_Endian4:       44332211        11223344
Endian4Array:   11223344        44332211
_Endian4Array:  44332211        11223344
Endian8:        1122334455667788        8877665555667788
_Endian8:       1122334455667788        8877665544332211
Endian8Array:   1122334455667788        8877665555667788
_Endian8Array:  8877665544332211        1122334455667788
26 cycles, Endian2
14 cycles, _Endian2
194 cycles, Endian2Array
66 cycles, _Endian2Array
32 cycles, Endian4
12 cycles, _Endian4
264 cycles, Endian4Array
55 cycles, _Endian4Array
54 cycles, Endian8
16 cycles, _Endian8
534 cycles, Endian8Array
77 cycles, _Endian8Array

[attachment deleted by admin]
eschew obfuscation

Merrick

MichaelW,

Thanks for catching the problem with Endian8. I thought I'd fixed that, but apparently not.

I got similar improvements when switching to bswap as well. Thanks all for that suggestion.

I had thought about the possibility of VB doing something odd like treating the 4-byte integer arrays as variants for some arcane reason, but surprisingly when I changed to the BSWAP versions the odd behavior for that one case went away. And the only change to the VB code was calling the BSWAP version of the routine as opposed to the XCHG version of the routine.

I can only assume the odd behavior for 4-byte ints is something VB is doing on the way to p-code or executable code. I'll have to check it out with WINDBG some day.

Thanks again...

Merrick

Paul,

Yes, please, ask away.

Best,
Merrick

Merrick

MichaelW,

It took me a while to figure out what was going on here, but if you take a closer look at my code you'll see that I intended the input and output values to be different memory locations. You have assumed (not unreasonably) the calculation is done in place and modified your code accordingly. The original code does work correctly, but not in place.

Here are my results using your code:

Quote
Endian2:        00001122        00002211
_Endian2:       00002211        00001122
Endian2Array:   00001122        00002211
_Endian2Array:  00002211        00001122
Endian4:        11223344        44332211
_Endian4:       44332211        11223344
Endian4Array:   11223344        44332211
_Endian4Array:  44332211        11223344
Endian8:        1122334455667788        8877665555667788
_Endian8:       1122334455667788        8877665544332211
Endian8Array:   1122334455667788        8877665555667788
_Endian8Array:  8877665544332211        1122334455667788
4294967268 cycles, Endian2
4294966792 cycles, _Endian2
40 cycles, Endian2Array
32 cycles, _Endian2Array
4294967272 cycles, Endian4
4294966720 cycles, _Endian4
96 cycles, Endian4Array
4294967288 cycles, _Endian4Array
0 cycles, Endian8
4294966728 cycles, _Endian8
828 cycles, Endian8Array
28 cycles, _Endian8Array
Press any key to exit...

MichaelW

Yes, when I change the test code to:

    print "Endian8:",9
    print uhex$(dword ptr _array8+4)
    print uhex$(dword ptr _array8),9
    invoke Endian8, ADDR _array8, ADDR _array8+8
    print uhex$(dword ptr _array8+8+4)
    print uhex$(dword ptr _array8+8),13,10
...
    print "Endian8Array:",9
    print uhex$(dword ptr __array8+4)
    print uhex$(dword ptr __array8),9
    invoke Endian8Array, ADDR __array8, ADDR __array8+8, 1
    print uhex$(dword ptr __array8+8+4)
    print uhex$(dword ptr __array8+8),13,10

Then the procedures do the conversion correctly. I did not actually examine your code in detail - I just tried to duplicate the function of the code. My versions will do the conversion in place or not. I assumed in place in the test code as a convenience, without realizing that doing so would trip up your code.
eschew obfuscation