News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

a counter of lines in mmx instructions

Started by ToutEnMasm, March 18, 2009, 09:20:18 AM

Previous topic - Next topic

sinsi

Not looking at every post, I am wondering about the title (MMX) and jj (SSE2), but anyway...

Intel(R) Core(TM)2 Quad CPU    Q6600  @ 2.40GHz (SSE4)
Tests for correctness - 2*100 lines expected:
Mark Larson=    - /  - (throws exception)
jj2007=         100 / 100 lines
Lingo=          105 / 102 lines
ToutEnMasm=     105 / 102 lines

Counting lines of \masm32\include\windows.inc:

markl_CountFileLines (Mark Larson):
189     kilocycles for 22274 lines, 849788 bytes

getlinesJJ: (jj2007)
383     kilocycles for 22274 lines, 849788 bytes

getlines (Lingo):
493     kilocycles for 22274 lines, 849788 bytes

CompteurLignes: (ToutEnMasm)
1664    kilocycles for 22274 lines, 849788 bytes


edit: "there are mindreaders among us" ouch! that hits home jj. :bg
Light travels faster than sound, that's why some people seem bright until you hear them.

jj2007

Quote from: ToutEnMasm on March 20, 2009, 08:12:39 AM
I have gain a little time ,just making that
compare is made on 32 bytes,instead of 16
align 16 of memory is made by globalAlloc and it is not necessary to relign it as getlinesJJ do.
The minimum size of memory must be 32 bytes or there is  read memory outside the buffer
And memory allocation must rounded by 32 bytes

The timing looks better now, only 4% slower than mine on my old P4. However, with your (and Lingo's) code you must add one more condition:

- There must not be any remainders from previous files in the buffer.

In other words, you need a fresh buffer for each file. Otherwise, the line count will be wrong if the second file is shorter and there are some CrLf's left after the zero terminator, as shown in the gltestA and gltestB strings.

Re GlobalAlloc, MSDN: Memory allocated with this function is guaranteed to be aligned on an 8-byte boundary.

Since your code needs 16-byte alignment, this means some extra work, i.e. you must:
- allocate more space than the file length requires
- align the pointer to 16-bytes before loading the file and
- keep a copy of the original pointer for GlobalFree.

My code does not require any of these conditions.

ToutEnMasm


No problem with an eventual second file reloaded in the same memory.
I never reuse the same allocated memory for an another file and I put a zero at the end of the file.
i don't want to use  this  in another case . an outside read is only granted if the size of the meory is known.
That is not always the case.



ToutEnMasm


I hope last version,
This one can be used anywhere,without risk of outside read and stupid crash

Quote
CompteurLignes PROC uses ebx edi esi pmem:DWORD,taille:DWORD
         Local  Nblines:DWORD,count,reste
         local  theEnd:dword
   ;init
   mov Nblines,0
   mov reste,0   
   mov edx,pmem
   add edx,taille
   mov theEnd,edx
   mov edx,pmem
   mov esi,edx   
   and edx,0Fh
   .if edx != 0
      ;search lines in the non align memory
      mov ecx,16
      sub ecx,edx
      @@:
      .if byte ptr [esi] != 0
         .if word ptr [esi] == 0A0Dh
            inc Nblines         
         .endif
      .else
         mov eax,Nblines
         jmp FindeCompteurLignes
      .endif
      inc esi
      dec ecx
      jnz @B      
   .endif
   ;esi point on a 16 aligned memory
   ;count the number  of 32 bytes parts
   mov edx,0
   mov eax,theEnd
   sub eax,esi
   .if eax == 0
      mov eax,Nblines
      jmp FindeCompteurLignes      
   .endif
   .if eax < 32
      mov reste,eax
      mov eax,Nblines      
      jmp EndNonaligned
   .endif
   mov edx,0
   mov ecx,32
   div ecx
   mov count,eax      
   mov reste,edx
   ;--------------------------  search in aligned part -------------      
   ;init of various register
   mov eax, 0d0d0d0dh   ; Ascii 10, linefeed
   movd xmm6, eax
   pshufd xmm6, xmm6, 0   ; linefeeds for comparison in xmm2   
   mov eax,Nblines   ;line counter
   ;ready
   NewBloc:
      ;------ align 16 needed ----------
      ;1731187 cycles for 22274 lines
      movdqa xmm1,xmm6         ;charge 13
      movdqa xmm2,xmm6         ;charge 13      
      pcmpeqb  xmm1,[esi]      ;cmp with memory align 16      
      pcmpeqb  xmm2,[esi+16]      ;cmp with memory ,align 16                  
      pmovmskb ecx, xmm1 ; result in ecx
      pmovmskb edx, xmm2 ; result +16 edx
      shl edx,16
      add ecx,edx
      jz suite
      NbLineBreak:
      bsf   edx,   ecx
      jz suite
      .if    word   ptr [edx+esi] == 0A0Dh
         inc   eax
      .endif
      btr   ecx,   edx
      jmp NbLineBreak
   suite:
   lea esi,[esi+32]
   dec count
   jnz NewBloc
   
EndNonaligned:   
   .if reste != 0
      mov ecx,reste
      @@:
      .if word ptr [esi] == 0A0Dh
         inc eax
      .endif   
      inc esi
      dec ecx
      jnz @B                  
   .endif
   
FindeCompteurLignes:
ret
CompteurLignes endp


UtillMasm

Hi jj2007,

C:\ml /c /coff /nologo CountLinesSSE2.asm
Assembling: CountLinesSSE2.asm
CountLinesSSE2.asm(3) : fatal error A1000: cannot open file : \masm32\include\Cp
uId.inc

=====================
I need your file.
Help me!!!

jj2007

Quote from: UtillMasm on March 20, 2009, 12:35:18 PM
fatal error A1000: cannot open file : \masm32\include\CpuId.inc
I need your file.

Attached, together with an update that includes ToutEnMasm's latest version (it works fine, congrats).

100+2 means 2 malformed strings found, i.e. LF only.

Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Tests for correctness - 100+2/100+6 lines expected,
first string 5-byte misaligned:
Mark Larson=    - /  - (throws exception)
jj2007=         100+2 / 100+6 lines
Lingo=          - / 102 lines
ToutEnMasm=     100 / 100 lines

Codesizes:
Mark Larson =           2104
getlinesJJ =            177
getlines Lingo =        191
CompteurLignes =        237

Counting lines of \masm32\include\windows.inc:

markl_CountFileLines (Mark Larson):
437     kilocycles for 22272 lines, 849759 bytes

getlinesJJ: (jj2007)
1120    kilocycles for 22272 lines, 849759 bytes

getlines (Lingo):
1306    kilocycles for 22272 lines, 849759 bytes

CompteurLignes: (ToutEnMasm)
1344    kilocycles for 22272 lines, 849759 bytes

[attachment deleted by admin]

jj2007

Just for fun, I tried WinExtra.inc instead of Window.inc, and found one more bug.
Somewhat polished code attached. Both ToutEnMasm's and my version seem to work just fine.

              Intel(R) Pentium(R) 4 CPU 3.40GHz (SSE3)
Tests for correctness - 100 / 100 lines expected,
first string 5-byte misaligned:
Mark Larson=    --- / --- (throws exception)
jj2007=         100 / 100 lines
Lingo=          --- / 102 lines
ToutEnMasm=     100 / 100 lines

Codesizes:
Mark Larson =           2104
getlinesJJ =            155
getlines Lingo =        191
CompteurLignes =        237

Counting lines of \masm32\include\winextra.inc:

markl_CountFileLines (Mark Larson):
372     kilocycles for 20001 lines, 807877 bytes    <------------- INCORRECT COUNT ------

getlinesJJ: (jj2007)
903     kilocycles for 20025 lines, 807877 bytes

getlines (Lingo):
1083    kilocycles for 20025 lines, 807877 bytes

CompteurLignes: (ToutEnMasm)
1120    kilocycles for 20025 lines, 807877 bytes

[attachment deleted by admin]

ToutEnMasm


The more bad text format that i know,are headers files .H.
They are filled with extra caracters.

I have also made some tests
After search for speed,i have take a moment for a crash test.

test with a sznull db 0  ;len 1
test with this texte
Quote
Texte  db 13,10,13,10,13,10,13,10
   db "   windowsinc   FichMem <>",13,10
   db "InfosFichiers WIN32_FIND_DATA <>",13,10
db 13,10
db "   ;procéder à des essais sous surveillances",13,10
db "   ;en cas d'exception,si la pile n'est pas détruite",13,10
db "      ;------- ajouter des feuilles SDI,utiliser menu CODE --> créer SDI",13,10
db "               invoke ChargerFichierMem,SADR(\masm32\include\windows.inc),addr windowsinc",13,10
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
code:
Quote
   invoke lstrlen,addr Texte
   mov edx,eax
   invoke getlinesJJ,addr Texte,edx   

My last version pass all tests without problem,i have take care of this.
Others need some little changes.

jj2007

Quote from: ToutEnMasm on March 20, 2009, 04:19:23 PM

test with a sznull db 0  ;len 1
You mean len 0? No problem.

Quote
test with this texte
Quote
Texte  db 13,10,13,10,13,10,13,10
   db "   windowsinc   FichMem <>",13,10
   db "InfosFichiers WIN32_FIND_DATA <>",13,10
db 13,10
db "   ;procéder à des essais sous surveillances",13,10
db "   ;en cas d'exception,si la pile n'est pas détruite",13,10
db "      ;------- ajouter des feuilles SDI,utiliser menu CODE --> créer SDI",13,10
db "               invoke ChargerFichierMem,SADR(\masm32\include\windows.inc),addr windowsinc",13,10
db 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
code:
Quote
   invoke lstrlen,addr Texte
   mov edx,eax
   invoke getlinesJJ,addr Texte,edx   

My last version pass all tests without problem,i have take care of this.
Others need some little changes.

Which are the others that need changes? Lingo's code crashes if it is not 16-byte aligned, but otherwise his code yields the same result as yours and mine:


Texte, getlinesJJ=              11 lines
Texte, getlines Lingo=          11 lines
Texte, CompteurLignes=          11 lines

ToutEnMasm


Make more test adiing some data before

Quote
for the text:
TACK_TEXT: 
0012ffb4 00401057 00404411 00000145 7c817067 minus!getlinesJJ+0x68 [F:\lignes\sse2.inc @ 318]
0012fff0 00000000 00401040 00000000 78746341 minus!start+0x17 [F:\lignes\minus.asm @ 83]


FAULTING_SOURCE_CODE: 
   314:    pcmpeqb xmm0, [edi]         ; compare packed bytes in [m128] and xmm0 for equality
   315:    pmovmskb edx, xmm0      ; set byte mask in edx for first 16 byte chunk
   316:
   317:    movdqa xmm0, xmm2         ; linefeeds in xmm0 & xmm1
>  318:    pcmpeqb xmm0, [edi+16]   ; compare packed bytes in [m128] and xmm0 for equality
   319:    pmovmskb ecx, xmm0         ; set byte mask in edx for second 16 byte chunk
   320:
   321:    lea edi, [edi+32]      ; point to next chunk
   322:    cmp edi, esi            ; test boundary
   323:    jae L1

For the sznull len 0 or 1 ,same trick
Quote
(d0c.c20): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=00000000 ebx=7ffd5000 ecx=00000000 edx=00000000 esi=00531047 edi=00132ff0
eip=00401b66 esp=0012ffa0 ebp=0012ffb4 iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010246
*** WARNING: Unable to verify checksum for minus.exe
minus!getlinesJJ+0x68:
00401b66 660f744710      pcmpeqb xmm0,xmmword ptr [edi+10h] ds:0023:00133000=????
0:000> !analyze -v

FAULTING_IP:
minus!getlinesJJ+68 [F:\lignes\sse2.inc @ 318]
00401b66 660f744710      pcmpeqb xmm0,xmmword ptr [edi+10h]

EXCEPTION_RECORD:  ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 00401b66 (minus!getlinesJJ+0x00000068)
   ExceptionCode: c0000005 (Access violation)
  ExceptionFlags: 00000000
NumberParameters: 2
   Parameter[0]: 00000000
   Parameter[1]: 00133000
Attempt to read from address 00133000

FAULTING_THREAD:  00000c20

DEFAULT_BUCKET_ID:  INVALID_POINTER_READ

PROCESS_NAME:  minus.exe

ERROR_CODE: (NTSTATUS) 0xc0000005 - L'instruction   "0x%08lx" emploie l'adresse m moire "0x%08lx". La m moire ne peut pas  tre "%s".

EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - L'instruction   "0x%08lx" emploie l'adresse m moire "0x%08lx". La m moire ne peut pas  tre "%s".

EXCEPTION_PARAMETER1:  00000000

EXCEPTION_PARAMETER2:  00133000

READ_ADDRESS:  00133000

FOLLOWUP_IP:
minus!getlinesJJ+68 [F:\lignes\sse2.inc @ 318]
00401b66 660f744710      pcmpeqb xmm0,xmmword ptr [edi+10h]

NTGLOBALFLAG:  70

APPLICATION_VERIFIER_FLAGS:  0

PRIMARY_PROBLEM_CLASS:  INVALID_POINTER_READ

BUGCHECK_STR:  APPLICATION_FAULT_INVALID_POINTER_READ

LAST_CONTROL_TRANSFER:  from 00401057 to 00401b66

STACK_TEXT: 
0012ffb4 00401057 00404410 00000000 7c817067 minus!getlinesJJ+0x68 [F:\lignes\sse2.inc @ 318]
0012fff0 00000000 00401040 00000000 78746341 minus!start+0x17 [F:\lignes\minus.asm @ 83]


FAULTING_SOURCE_CODE: 
   314:    pcmpeqb xmm0, [edi]         ; compare packed bytes in [m128] and xmm0 for equality
   315:    pmovmskb edx, xmm0      ; set byte mask in edx for first 16 byte chunk
   316:
   317:    movdqa xmm0, xmm2         ; linefeeds in xmm0 & xmm1
>  318:    pcmpeqb xmm0, [edi+16]   ; compare packed bytes in [m128] and xmm0 for equality
   319:    pmovmskb ecx, xmm0         ; set byte mask in edx for second 16 byte chunk
   320:
   321:    lea edi, [edi+32]      ; point to next chunk
   322:    cmp edi, esi            ; test boundary
   323:    jae L1


SYMBOL_STACK_INDEX:  0

SYMBOL_NAME:  minus!getlinesJJ+68

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: minus

IMAGE_NAME:  minus.exe

DEBUG_FLR_IMAGE_TIMESTAMP:  49c3d30d

STACK_COMMAND:  ~0s ; kb

FAILURE_BUCKET_ID:  INVALID_POINTER_READ_c0000005_minus.exe!getlinesJJ

BUCKET_ID:  APPLICATION_FAULT_INVALID_POINTER_READ_minus!getlinesJJ+68

WATSON_STAGEONE_URL:  http://watson.microsoft.com/StageOne/minus_exe/0_25_5_2005/49c3d30d/minus_exe/0_25_5_2005/49c3d30d/c0000005/00001b66.htm?Retriage=1

Followup: MachineOwner
---------






jj2007

Quote from: ToutEnMasm on March 20, 2009, 05:36:13 PM

Make more test adiing some data before


Please make that test with the current version posted some hours ago. Or post the string table, with alignment info, so that I can run it myself. Or even better, put source and executable into a zip file and post it here.

ToutEnMasm


Found , I have a  problem  with the
Quote
OPTION PROLOGUE:none
OPTION EPILOGUE:none
Just a lost of time to write that.
I use only standard proc and it was not write.


jj2007

Quote from: ToutEnMasm on March 20, 2009, 06:34:28 PM

Found , I have a  problem  with the
Quote
OPTION PROLOGUE:none
OPTION EPILOGUE:none
Just a lost of time to write that.
I use only standard proc and it was not write.


Yeah, I know it's a bad habit to remove the stack frame. But with such a simple algo, I couldn't resist. And it has its advantages, too :bg

Intel(R) Celeron(R) M CPU        420  @ 1.60GHz (SSE3)
Codesizes:

getlinesJJ =            155
CompteurLignes =        237

Counting lines of \masm32\include\winextra.inc:

getlinesJJ: (jj2007)
455     kilocycles for 20025 lines, 807877 bytes

CompteurLignes: (ToutEnMasm)
804     kilocycles for 20025 lines, 807877 bytes

ToutEnMasm


If I see upper
Quote
getlinesJJ: (jj2007)
903     kilocycles for 20025 lines, 807877 bytes

getlines (Lingo):
1083    kilocycles for 20025 lines, 807877 bytes

Two different machines that give so different result ?



jj2007

Quote from: ToutEnMasm on March 20, 2009, 06:58:26 PM

If I see upper
Quote
getlinesJJ: (jj2007)
903     kilocycles for 20025 lines, 807877 bytes

getlines (Lingo):
1083    kilocycles for 20025 lines, 807877 bytes

Two different machines that give so different result ?


Yes, that's not unusual. The P4 is a lot slower, and relative differences are smaller. My Celeron M runs getlinesJJ at 450, and Lingo's version at 600 kilocycles. It's a Core (not: Core 2) CPU. Lingo's AMD might favour his own algo again - in the szLen thread, his code was marginally (1%) slower for very long strings on my Celeron but significantly (20%+) faster on several other CPU's. These differences make optimisation increasingly difficult.