Text only | Text with Images

The MASM Forum Archive 2004 to 2012

General Forums => The Campus => Topic started by: bashir1961 on February 16, 2012, 01:56:38 PM

Title: CCITT fax G3 compression
Post by: bashir1961 on February 16, 2012, 01:56:38 PM

Hello everybody,
I am trying to optimize a C routine for compressing an image in TIFF fax G3 CCITT format.
The C routine is well optimized (it comes from www.libtiff.org), so I do not think there is much to do in C and so I am trying to code some "hot spot" routine in assembler.
The application essentially does a run length encoding at bit level, and profiling it it seems it spends most of its time in two routines, the first one counts the sequence of "0" (find0span for reference if you want to see the C source code, file is tif_fax3.c of the above mentioned libtiff), the other one counts the sequence of "1"s.
As I said before the routine is well optimized but the optimization is targeted at average images, i.e. with a long sequence of 0 / 1, and so tries to read 4 bytes at once etc.etc.
However we are experiencing troubles with images that seem designed to break the compressor, for instance two black bits, than a zero, two black bits again and so on.

So I started studying bit operations in x86; I discovered that the BT operation supports also memory operation, i.e. it can address a single bit in memory seen as a bit array, at least I understood so.

This is my first try, written with the help of __asm keyword of visual studio:
the routine take as first parameter the address of the memory array to scan, bit start and bit end.

However I must be missing something really obvious because it behaves as it tested the pointer, and not the memory pointed to.

INT32 find0spanasm(u_char* indice, INT32 bs, INT32 be)
{
INT32 result;
__asm
{
xor eax,eax
mov ecx,bs
inizioloop:
bt DWORD PTR indice, ecx
jc uscitaloop
add eax, 1
add ecx, 1
cmp ecx, be
jne inizioloop
uscitaloop:
mov result, eax
}
return result;
}

Thany you and best regards

Title: Re: CCITT fax G3 compression
Post by: qWord on February 16, 2012, 02:37:45 PM

hi,
pointers must be loaded into a register. Also take a look at the BSF/BSR, POPCNT and LZCNT instruction (POPCNT and LZCNT are only available on newer CPUs).

Code Select

mov edx,indice ; edx is the pointer
bsf eax,DWORD ptr [edx]
jz @nobitset
;eax = index of first set bit

Title: Re: CCITT fax G3 compression
Post by: bashir1961 on February 16, 2012, 02:48:11 PM

Thanks qword

I changed to the version below and it seems to work, it seems very ssslooowww, I will make some more serious timing test.
I considered the bsf/bsr but I must count a sequence of bits starting from an arbitrary bit position and so they seemed to me not useful for this job.

INT32 find0spanasm(u_char* indice, INT32 bs, INT32 be)
{
INT32 result;
__asm
{
xor eax,eax
mov ecx,bs
mov edi, indice
inizioloop:
bt [edi], ecx
jc uscitaloop
add eax, 1
add ecx, 1
cmp ecx, be
jne inizioloop
uscitaloop:
mov result, eax
}
return result;
}

Title: Re: CCITT fax G3 compression
Post by: dedndave on February 16, 2012, 03:47:27 PM

there are a few ways to speed that up
one thing i see right away is...

for each pass of the loop, you access the same memory location, over and over

Code Select

bt [edi], ecx

it would be best to load that value into EDX before the loop

Code Select

mov edx,[edi]

then, inside the loop, you BT on EDX

Code Select

bt edx,ecx

but - there are some tricks that can really speed this up
my understanding is that you want to count set bits in a given range
once a cleared bit or if the end of the range is encountered, you stop

you could do away with the loop, i think
let's play.....

Code Select

        mov     ecx,bs     ;4
        mov     edx,[edi]  ;01110101
        shr     edx,cl     ;0111
        inc     edx        ;1000 - notice that all successive 1's are now 0, and the first 0 is now a 1
        bsf     eax,edx

you get the idea - should be much faster than a loop

Title: Re: CCITT fax G3 compression
Post by: bashir1961 on February 16, 2012, 04:02:59 PM

Thanks dedndave,

well, I am not sure I understood correctly the BT instruction, It seemed interesting because it allows (but does it ?) to index an arbitrary bit in memory, i.e. in this example ecx can be greater than 32 and refer to successive memory positions.
So I wanted to avoid bit base / bit offset computations, etc.etc.

I will study your example, thanks a lot, if I get it right I should include it in a loop on all the memory locations, i.e. (bitend - bitstart) / 32 or something like that.
You are right, I need to count sequence of set bits in a given range but not limited to 0..31, the bit string can span theoretically several Mbytes.

Title: Re: CCITT fax G3 compression
Post by: dedndave on February 16, 2012, 04:15:59 PM

well - you can test the dword to see if it is -1 (all 1's) at the beginning
if it is, you add 32 to the total count and go to the next dword
if it is not, you know you have a cleared bit

a simple way to test for -1 is to add 1 and test for 0

Code Select

        mov     edx,[edi]
        add     edx,1
        jz      is_all_ones

        dec     edx
;scan for cleared bit

oh - another way to test for a cleared bit is to invert the dword, then scan for a set bit

Title: Re: CCITT fax G3 compression
Post by: Tedd on February 16, 2012, 05:30:58 PM

If you're going for speed, BT probably isn't the best idea - it's quite slow.

As for avoiding bit-base/offsets, it's not really an issue. Any dword is either: all 1s, all 0s, or a mixture. A simple CMP can check for the first two, which results in your bit-count increasing by 32 and you continue onto the next dword; or it's a mixture and you count the individual bits as you would have to do in some form anyway.

For counting runs of bits, BSR would be nice but gives the first bit set (1), so you NOT the dword first, then BSR will effectively tell you how many bits were 1s; to continue for the rest of the dword you'll need to SHL by the number of bits, and also keep a count of the total bits counted in the dword, so you don't go over 32.

Title: Re: CCITT fax G3 compression
Post by: qWord on February 16, 2012, 05:45:20 PM

I've played a bit with it: the following proc scans for zero-bits in given bit range. The buffer's size must be a multiple of 4 and should be aligned to 4. The bit indexes are zero based:

Code Select

invoke zero_scan,...,0,31 ; scans a whole DWORD
invoke zero_scan,...,63,63; test bit 63

Code Select

zero_scan proc uses edi esi ebx pBits:PCHAR,bs:DWORD,be:DWORD
LOCAL ns:DWORD

	mov esi,pBits
	mov edx,bs
	mov edi,be
	sub edi,edx 	; edi = number of bits to scan
	inc edi
	
	xor ecx,ecx
	and edx,31  	; edx = bit index
	mov ebx,edx
	jz @1
	mov ecx,31
	sub ecx,edx 	; ecx = reserved index
	lea edx,[ecx+1]
	mov ns,edx
	mov eax,80000000h
	sar eax,cl		; create bitmask

	xor ecx,ecx

	mov edx,[esi]
	and edx,eax			; mask bits
	lea esi,[esi+4]		; update offset
	mov ecx,ns
	jz @1
	mov ecx,ebx
	neg ecx
	jnz @bsf
	
	align 16
@1:	mov edx,[esi]
	test edx,edx
	jnz @bsf
	add ecx,32
	add esi,4
	cmp ecx,edi
	jb @1
	
	mov eax,edi
	ret
	
	align 16
@bsf:	
	bsf eax,edx
	add ecx,eax
	cmp ecx,edi
	cmovae eax,edi
	cmovb eax,ecx
	ret
	
zero_scan endp

The return value is the number of contiguous zero-bits starting from bs.
This procedure could be improved ...

qWord

Title: Re: CCITT fax G3 compression
Post by: MichaelW on February 17, 2012, 07:13:39 AM

Quote from: bashir1961 on February 16, 2012, 04:02:59 PM
well, I am not sure I understood correctly the BT instruction, It seemed interesting because it allows (but does it ?) to index an arbitrary bit in memory, i.e. in this example ecx can be greater than 32 and refer to successive memory positions.
So I wanted to avoid bit base / bit offset computations, etc.etc.

It does.

Edit: Added code to get an idea of how slow the BT instruction is.

Code Select


;==============================================================================
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
;==============================================================================
.data
    bitarray db 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
             db 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
             db 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
             db 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
             db 0ffh,0,0ffh,0,0ffh,0,0ffh,0,0ffh,0,0ffh,0,0ffh,0,0ffh,0
.code
;==============================================================================
start:

    mov esi, OFFSET bitarray
    xor ebx, ebx
    .WHILE ebx < 640
        test ebx, 7
        jnz @F
        printf("\n")
      @@:
        bt [esi], ebx
        mov eax, 0
        adc eax, 0
        printf("%d", eax)
        inc ebx
    .ENDW
    printf("\n\n")

    invoke GetCurrentProcess
    invoke SetProcessAffinityMask, eax, 1

    invoke Sleep, 3000

    REPEAT 3

        counter_begin 1000000, HIGH_PRIORITY_CLASS
            xor ebx, ebx
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 4
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 8
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 12
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
        counter_end
        printf("%d cycles\n", eax)

    ENDM

    printf("\n")

    REPEAT 3

        counter_begin 1000000, HIGH_PRIORITY_CLASS
            mov ebx, 1
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 5
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 9
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 13
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
        counter_end
        printf("%d cycles\n", eax)

    ENDM

    printf("\n")

    REPEAT 3

        counter_begin 1000000, HIGH_PRIORITY_CLASS
            xor ebx, ebx
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 128
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 256
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 512
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
        counter_end
        printf("%d cycles\n", eax)

    ENDM

    printf("\n")

    REPEAT 3

        counter_begin 1000000, HIGH_PRIORITY_CLASS
            xor ebx, ebx
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 129
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 257
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
            mov ebx, 513
            bt [esi], ebx
            mov eax, 0
            adc eax, 0
        counter_end
        printf("%d cycles\n", eax)

    ENDM

    printf("\n")

    inkey
    exit
;==============================================================================
end start

Code Select

Typical cycle counts on a P3:

Code Select


21 cycles
21 cycles
21 cycles

21 cycles
21 cycles
21 cycles

21 cycles
21 cycles
21 cycles

21 cycles
22 cycles
21 cycles

Typical cycle counts on a P4 (Northwood AFAIK):

Code Select


46 cycles
44 cycles
43 cycles

54 cycles
47 cycles
47 cycles

44 cycles
43 cycles
44 cycles

44 cycles
43 cycles
45 cycles

Using BT would simplify the bit indexing code, but I have doubts that the code would be enough simpler to make up for the slowness of the instruction and the additional code necessary to get/test the value of the indexed bit.

In the tif_fax3.c source, the find0span and find1span functions contain complex code that involves lookup tables. Is your goal to duplicate the results of these functions?

BTW, if you're going to use inline assembly in a function that will be called frequently, you can save a few cycles by eliminating the prologue and epilogue code.

Code Select


//=============================================================================
#include <windows.h>
#include <conio.h>
#include <stdio.h>
// counter_c.c is available here:
//   http://www.masm32.com/board/index.php?topic=13100.msg101673#msg101673
#include "counter_c.c"
//=============================================================================

INT32 find0spanasm(u_char* indice, INT32 bs, INT32 be)
{
    INT32 result;
    __asm
    {
        xor eax, eax
        mov ecx, bs
        mov edi, indice
inizioloop:
        bt [edi], ecx
        jc  uscitaloop
        add eax, 1
        add ecx, 1
        cmp ecx, be
        jne inizioloop
uscitaloop:
        mov result, eax
    }
    return result;
}

__declspec( naked ) INT32 find0spannaked(u_char* indice, INT32 bs, INT32 be)
{
    __asm
    {
        xor eax, eax
        mov ecx, [esp+8]
        mov edi, [esp+4]
inizioloop:
        bt [edi], ecx
        jc  uscitaloop
        add eax, 1
        add ecx, 1
        cmp ecx, [esp+12]
        jne inizioloop
uscitaloop:
        ret
    }
}

void main( void )
{
    INT32 r;
    u_char test = 0x10;

    printf("%d\n",find0spanasm(&test,0,7));

    printf("%d\n",find0spannaked(&test,0,7));

    Sleep(4000);

    CTR_BEGIN( 1, 1000, HIGH_PRIORITY_CLASS );
    CTR_END( 1 )
    printf( "%d cycles\n", ctr_cycles );

    CTR_BEGIN( 2, 1000, HIGH_PRIORITY_CLASS );
        r = find0spanasm(&test,0,7);
    CTR_END( 2 )
    printf( "%d cycles\n", ctr_cycles );

    CTR_BEGIN( 3, 1000, HIGH_PRIORITY_CLASS );
        r = find0spannaked(&test,0,7);
    CTR_END( 3 )
    printf( "%d cycles\n", ctr_cycles );

    getch();
}

Running on a P3:

Code Select


4
4
0 cycles
54 cycles
44 cycles

Title: Re: CCITT fax G3 compression
Post by: bashir1961 on February 18, 2012, 09:33:19 AM

Hello,
first of all I wanted to thank you all of this forum for the support, I'm really impressed, my understanding of x86 tricks and traps has improved and I have many examples to study, thank you again.
@MichaelW:
yes, the idea was to evaluate a reimplementation of find0span, find1span in assembler;
they are complex functions since they are heavily optimized toward very long scans, at least I suppose so.
The situation in which I need improvement is the opposite, like a sequence of 0xF0, 0XF0 etc, a few bits on, a few bits off and so on.

@qWord:
thank for the code, i will study your example, will keep you informed, thanks again

Title: Re: CCITT fax G3 compression
Post by: dedndave on February 18, 2012, 02:23:11 PM

QuoteThe situation in which I need improvement is the opposite, like a
sequence of 0xF0, 0XF0 etc, a few bits on, a few bits off and so on

that sounds like a much more likely case in an average fax :P
but, you can design a data-set that tests both short and long
also, MichaelW's timing macros are what many of us use to time code...
http://www.masm32.com/board/index.php?topic=770.0

attached is an example of using the timers...

Title: Re: CCITT fax G3 compression
Post by: MichaelW on February 18, 2012, 02:48:51 PM

I just noticed that this bit of code:

Code Select


add ecx, 1
cmp ecx, be
jne inizioloop

Will stop short of using index be. I think the JNE should be a JNA.

Text only | Text with Images

SMF 2.1.4 © 2023, Simple Machines