News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

CCITT fax G3 compression

Started by bashir1961, February 16, 2012, 01:56:38 PM

Previous topic - Next topic

bashir1961

Hello everybody,
I am trying to optimize a C routine for compressing an image in TIFF fax G3 CCITT format.
The C routine is well optimized (it comes from www.libtiff.org), so I do not think there is much to do in C and so I am trying to code some "hot spot" routine in assembler.
The application essentially does a run length encoding at bit level, and profiling it it seems it spends most of its time in two routines, the first one counts the sequence of "0" (find0span for reference if you want to see the C source code, file is tif_fax3.c of the above mentioned libtiff), the other one counts the sequence of "1"s.
As I said before the routine is well optimized but the optimization is targeted at average images, i.e. with a long sequence of 0 / 1, and so tries to read 4 bytes at once etc.etc.
However we are experiencing troubles with images that seem designed to break the compressor, for instance two black bits, than a zero, two black bits again and so on.

So I started studying bit operations in x86; I discovered that the BT operation supports also memory operation, i.e. it can address a single bit in memory seen as a bit array, at least I understood so.

This is my first try, written with the help of __asm keyword of visual studio:
the routine take as first parameter the address of the memory array to scan, bit start and bit end.

However I must be missing something really obvious because it behaves as it tested the pointer, and not the memory pointed to.

INT32 find0spanasm(u_char* indice, INT32 bs, INT32 be)
{
    INT32 result;
    __asm
    {
        xor eax,eax
        mov ecx,bs
inizioloop:
        bt DWORD PTR indice, ecx
        jc  uscitaloop
        add eax, 1
        add ecx, 1
        cmp ecx, be
        jne inizioloop
uscitaloop:
        mov result, eax
    }
    return result;
}

Thany you and best regards

qWord

hi,
pointers must be loaded into a register. Also take a look at the BSF/BSR, POPCNT and LZCNT instruction (POPCNT and LZCNT are only available on newer CPUs).
mov edx,indice ; edx is the pointer
bsf eax,DWORD ptr [edx]
jz @nobitset
;eax = index of first set bit


FPU in a trice: SmplMath
It's that simple!

bashir1961

Thanks qword

I changed to the version below and it seems to work, it seems very ssslooowww, I will make some more serious timing test.
I considered the bsf/bsr but I must count a sequence of bits starting from an arbitrary bit position and so they seemed to me not useful for this job.


INT32 find0spanasm(u_char* indice, INT32 bs, INT32 be)
{
    INT32 result;
    __asm
    {
        xor eax,eax
        mov ecx,bs
        mov edi, indice
inizioloop:
        bt [edi], ecx
        jc  uscitaloop
        add eax, 1
        add ecx, 1
        cmp ecx, be
        jne inizioloop
uscitaloop:
        mov result, eax
    }
    return result;
}

dedndave

there are a few ways to speed that up
one thing i see right away is...

for each pass of the loop, you access the same memory location, over and over
        bt [edi], ecx

it would be best to load that value into EDX before the loop
        mov     edx,[edi]

then, inside the loop, you BT on EDX
        bt      edx,ecx

but - there are some tricks that can really speed this up
my understanding is that you want to count set bits in a given range
once a cleared bit or if the end of the range is encountered, you stop

you could do away with the loop, i think
let's play.....
        mov     ecx,bs     ;4
        mov     edx,[edi]  ;01110101
        shr     edx,cl     ;0111
        inc     edx        ;1000 - notice that all successive 1's are now 0, and the first 0 is now a 1
        bsf     eax,edx


you get the idea - should be much faster than a loop

bashir1961

Thanks dedndave,

well, I am not sure I understood correctly the BT instruction, It seemed interesting because it allows (but does it ?) to index an arbitrary bit in memory, i.e. in this example ecx can be greater than 32 and refer to successive memory positions.
So I wanted to avoid bit base / bit offset computations, etc.etc.

I will study your example, thanks a lot, if I get it right I should include it in a loop on all the memory locations, i.e. (bitend - bitstart) / 32 or something like that.
You are right, I need to count sequence of set bits in a given range but not limited to 0..31, the bit string can span theoretically several Mbytes.

dedndave

well - you can test the dword to see if it is -1 (all 1's) at the beginning
if it is, you add 32 to the total count and go to the next dword
if it is not, you know you have a cleared bit

a simple way to test for -1 is to add 1 and test for 0
        mov     edx,[edi]
        add     edx,1
        jz      is_all_ones

        dec     edx
;scan for cleared bit


oh - another way to test for a cleared bit is to invert the dword, then scan for a set bit

Tedd

If you're going for speed, BT probably isn't the best idea - it's quite slow.

As for avoiding bit-base/offsets, it's not really an issue. Any dword is either: all 1s, all 0s, or a mixture. A simple CMP can check for the first two, which results in your bit-count increasing by 32 and you continue onto the next dword; or it's a mixture and you count the individual bits as you would have to do in some form anyway.

For counting runs of bits, BSR would be nice but gives the first bit set (1), so you NOT the dword first, then BSR will effectively tell you how many bits were 1s; to continue for the rest of the dword you'll need to SHL by the number of bits, and also keep a count of the total bits counted in the dword, so you don't go over 32.
No snowflake in an avalanche feels responsible.

qWord

I've played a bit with it: the following proc scans  for zero-bits in given bit range. The buffer's size must be a multiple of 4 and should be aligned to 4. The bit indexes are zero based:
invoke zero_scan,...,0,31 ; scans a whole DWORD
invoke zero_scan,...,63,63; test bit 63


zero_scan proc uses edi esi ebx pBits:PCHAR,bs:DWORD,be:DWORD
LOCAL ns:DWORD

mov esi,pBits
mov edx,bs
mov edi,be
sub edi,edx ; edi = number of bits to scan
inc edi

xor ecx,ecx
and edx,31  ; edx = bit index
mov ebx,edx
jz @1
mov ecx,31
sub ecx,edx ; ecx = reserved index
lea edx,[ecx+1]
mov ns,edx
mov eax,80000000h
sar eax,cl ; create bitmask

xor ecx,ecx

mov edx,[esi]
and edx,eax ; mask bits
lea esi,[esi+4] ; update offset
mov ecx,ns
jz @1
mov ecx,ebx
neg ecx
jnz @bsf

align 16
@1: mov edx,[esi]
test edx,edx
jnz @bsf
add ecx,32
add esi,4
cmp ecx,edi
jb @1

mov eax,edi
ret

align 16
@bsf:
bsf eax,edx
add ecx,eax
cmp ecx,edi
cmovae eax,edi
cmovb eax,ecx
ret

zero_scan endp

The return value is the number of contiguous zero-bits starting from bs.
This procedure could be improved ...

qWord
FPU in a trice: SmplMath
It's that simple!

MichaelW

#8
Quote from: bashir1961 on February 16, 2012, 04:02:59 PM
well, I am not sure I understood correctly the BT instruction, It seemed interesting because it allows (but does it ?) to index an arbitrary bit in memory, i.e. in this example ecx can be greater than 32 and refer to successive memory positions.
So I wanted to avoid bit base / bit offset computations, etc.etc.

It does.

Edit: Added code to get an idea of how slow the BT instruction is.


;==============================================================================
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
;==============================================================================
.data
    bitarray db 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
             db 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
             db 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
             db 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
             db 0ffh,0,0ffh,0,0ffh,0,0ffh,0,0ffh,0,0ffh,0,0ffh,0,0ffh,0
.code
;==============================================================================
start:

    mov esi, OFFSET bitarray
    xor ebx, ebx
    .WHILE ebx < 640
        test ebx, 7
        jnz @F
        printf("\n")
      @@:
        bt [esi], ebx
        mov eax, 0
        adc eax, 0
        printf("%d", eax)
        inc ebx
    .ENDW
    printf("\n\n")

    invoke GetCurrentProcess
    invoke SetProcessAffinityMask, eax, 1

    invoke Sleep, 3000

    REPEAT 3

        counter_begin 1000000, HIGH_PRIORITY_CLASS
            xor ebx, ebx
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 4
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 8
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 12
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
        counter_end
        printf("%d cycles\n", eax)

    ENDM

    printf("\n")

    REPEAT 3

        counter_begin 1000000, HIGH_PRIORITY_CLASS
            mov ebx, 1
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 5
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 9
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 13
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
        counter_end
        printf("%d cycles\n", eax)

    ENDM

    printf("\n")

    REPEAT 3

        counter_begin 1000000, HIGH_PRIORITY_CLASS
            xor ebx, ebx
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 128
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 256
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 512
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
        counter_end
        printf("%d cycles\n", eax)

    ENDM

    printf("\n")

    REPEAT 3

        counter_begin 1000000, HIGH_PRIORITY_CLASS
            xor ebx, ebx
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 129
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 257
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 513
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
        counter_end
        printf("%d cycles\n", eax)

    ENDM

    printf("\n")

    inkey
    exit
;==============================================================================
end start


00000000
10000000
01000000
11000000
00100000
10100000
01100000
11100000
00010000
10010000
01010000
11010000
00110000
10110000
01110000
11110000
11110000
01110000
10110000
00110000
11010000
01010000
10010000
00010000
11100000
01100000
10100000
00100000
11000000
01000000
10000000
00000000
00000000
10000000
01000000
11000000
00100000
10100000
01100000
11100000
00010000
10010000
01010000
11010000
00110000
10110000
01110000
11110000
11110000
01110000
10110000
00110000
11010000
01010000
10010000
00010000
11100000
01100000
10100000
00100000
11000000
01000000
10000000
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000

Typical cycle counts on a P3:

21 cycles
21 cycles
21 cycles

21 cycles
21 cycles
21 cycles

21 cycles
21 cycles
21 cycles

21 cycles
22 cycles
21 cycles


Typical cycle counts on a P4 (Northwood AFAIK):

46 cycles
44 cycles
43 cycles

54 cycles
47 cycles
47 cycles

44 cycles
43 cycles
44 cycles

44 cycles
43 cycles
45 cycles


Using BT would simplify the bit indexing code, but I have doubts that the code would be enough simpler to make up for the slowness of the instruction and the additional code necessary to get/test the value of the indexed bit.

In the tif_fax3.c source, the find0span and find1span functions contain complex code that involves lookup tables. Is your goal to duplicate the results of these functions?

BTW, if you're going to use inline assembly in a function that will be called frequently, you can save a few cycles by eliminating the prologue and epilogue code.

//=============================================================================
#include <windows.h>
#include <conio.h>
#include <stdio.h>
// counter_c.c is available here:
//   http://www.masm32.com/board/index.php?topic=13100.msg101673#msg101673
#include "counter_c.c"
//=============================================================================

INT32 find0spanasm(u_char* indice, INT32 bs, INT32 be)
{
    INT32 result;
    __asm
    {
        xor eax, eax
        mov ecx, bs
        mov edi, indice
inizioloop:
        bt [edi], ecx
        jc  uscitaloop
        add eax, 1
        add ecx, 1
        cmp ecx, be
        jne inizioloop
uscitaloop:
        mov result, eax
    }
    return result;
}

__declspec( naked ) INT32 find0spannaked(u_char* indice, INT32 bs, INT32 be)
{
    __asm
    {
        xor eax, eax
        mov ecx, [esp+8]
        mov edi, [esp+4]
inizioloop:
        bt [edi], ecx
        jc  uscitaloop
        add eax, 1
        add ecx, 1
        cmp ecx, [esp+12]
        jne inizioloop
uscitaloop:
        ret
    }
}

void main( void )
{
    INT32 r;
    u_char test = 0x10;

    printf("%d\n",find0spanasm(&test,0,7));

    printf("%d\n",find0spannaked(&test,0,7));

    Sleep(4000);

    CTR_BEGIN( 1, 1000, HIGH_PRIORITY_CLASS );
    CTR_END( 1 )
    printf( "%d cycles\n", ctr_cycles );

    CTR_BEGIN( 2, 1000, HIGH_PRIORITY_CLASS );
        r = find0spanasm(&test,0,7);
    CTR_END( 2 )
    printf( "%d cycles\n", ctr_cycles );

    CTR_BEGIN( 3, 1000, HIGH_PRIORITY_CLASS );
        r = find0spannaked(&test,0,7);
    CTR_END( 3 )
    printf( "%d cycles\n", ctr_cycles );

    getch();
}


Running on a P3:

4
4
0 cycles
54 cycles
44 cycles




eschew obfuscation

bashir1961

Hello,
first of all I wanted to thank you all of this forum for the support, I'm really impressed, my understanding of x86 tricks and traps has improved and I have many examples to study, thank you again.
@MichaelW:
yes, the idea was to evaluate a reimplementation of find0span, find1span in assembler;
they are complex functions since they are heavily optimized toward very long scans, at least I suppose so.
The situation in which I need improvement is the opposite, like a sequence of 0xF0, 0XF0 etc, a few bits on, a few bits off and so on.

@qWord:
thank for the code, i will study your example, will keep you informed, thanks again

dedndave

QuoteThe situation in which I need improvement is the opposite, like a
sequence of 0xF0, 0XF0 etc, a few bits on, a few bits off and so on

that sounds like a much more likely case in an average fax   :P
but, you can design a data-set that tests both short and long
also, MichaelW's timing macros are what many of us use to time code...
http://www.masm32.com/board/index.php?topic=770.0

attached is an example of using the timers...

MichaelW

I just noticed that this bit of code:

add ecx, 1
cmp ecx, be
jne inizioloop


Will stop short of using index be. I think the JNE should be a JNA.
eschew obfuscation