Hello everybody,
I am trying to optimize a C routine for compressing an image in TIFF fax G3 CCITT format.
The C routine is well optimized (it comes from www.libtiff.org), so I do not think there is much to do in C and so I am trying to code some "hot spot" routine in assembler.
The application essentially does a run length encoding at bit level, and profiling it it seems it spends most of its time in two routines, the first one counts the sequence of "0" (find0span for reference if you want to see the C source code, file is tif_fax3.c of the above mentioned libtiff), the other one counts the sequence of "1"s.
As I said before the routine is well optimized but the optimization is targeted at average images, i.e. with a long sequence of 0 / 1, and so tries to read 4 bytes at once etc.etc.
However we are experiencing troubles with images that seem designed to break the compressor, for instance two black bits, than a zero, two black bits again and so on.
So I started studying bit operations in x86; I discovered that the BT operation supports also memory operation, i.e. it can address a single bit in memory seen as a bit array, at least I understood so.
This is my first try, written with the help of __asm keyword of visual studio:
the routine take as first parameter the address of the memory array to scan, bit start and bit end.
However I must be missing something really obvious because it behaves as it tested the pointer, and not the memory pointed to.
INT32 find0spanasm(u_char* indice, INT32 bs, INT32 be)
{
INT32 result;
__asm
{
xor eax,eax
mov ecx,bs
inizioloop:
bt DWORD PTR indice, ecx
jc uscitaloop
add eax, 1
add ecx, 1
cmp ecx, be
jne inizioloop
uscitaloop:
mov result, eax
}
return result;
}
Thany you and best regards
hi,
pointers must be loaded into a register. Also take a look at the BSF/BSR, POPCNT and LZCNT instruction (POPCNT and LZCNT are only available on newer CPUs).
mov edx,indice ; edx is the pointer
bsf eax,DWORD ptr [edx]
jz @nobitset
;eax = index of first set bit
Thanks qword
I changed to the version below and it seems to work, it seems very ssslooowww, I will make some more serious timing test.
I considered the bsf/bsr but I must count a sequence of bits starting from an arbitrary bit position and so they seemed to me not useful for this job.
INT32 find0spanasm(u_char* indice, INT32 bs, INT32 be)
{
INT32 result;
__asm
{
xor eax,eax
mov ecx,bs
mov edi, indice
inizioloop:
bt [edi], ecx
jc uscitaloop
add eax, 1
add ecx, 1
cmp ecx, be
jne inizioloop
uscitaloop:
mov result, eax
}
return result;
}
there are a few ways to speed that up
one thing i see right away is...
for each pass of the loop, you access the same memory location, over and over
bt [edi], ecx
it would be best to load that value into EDX before the loop
mov edx,[edi]
then, inside the loop, you BT on EDX
bt edx,ecx
but - there are some tricks that can really speed this up
my understanding is that you want to count set bits in a given range
once a cleared bit or if the end of the range is encountered, you stop
you could do away with the loop, i think
let's play.....
mov ecx,bs ;4
mov edx,[edi] ;01110101
shr edx,cl ;0111
inc edx ;1000 - notice that all successive 1's are now 0, and the first 0 is now a 1
bsf eax,edx
you get the idea - should be much faster than a loop
Thanks dedndave,
well, I am not sure I understood correctly the BT instruction, It seemed interesting because it allows (but does it ?) to index an arbitrary bit in memory, i.e. in this example ecx can be greater than 32 and refer to successive memory positions.
So I wanted to avoid bit base / bit offset computations, etc.etc.
I will study your example, thanks a lot, if I get it right I should include it in a loop on all the memory locations, i.e. (bitend - bitstart) / 32 or something like that.
You are right, I need to count sequence of set bits in a given range but not limited to 0..31, the bit string can span theoretically several Mbytes.
well - you can test the dword to see if it is -1 (all 1's) at the beginning
if it is, you add 32 to the total count and go to the next dword
if it is not, you know you have a cleared bit
a simple way to test for -1 is to add 1 and test for 0
mov edx,[edi]
add edx,1
jz is_all_ones
dec edx
;scan for cleared bit
oh - another way to test for a cleared bit is to invert the dword, then scan for a set bit
If you're going for speed, BT probably isn't the best idea - it's quite slow.
As for avoiding bit-base/offsets, it's not really an issue. Any dword is either: all 1s, all 0s, or a mixture. A simple CMP can check for the first two, which results in your bit-count increasing by 32 and you continue onto the next dword; or it's a mixture and you count the individual bits as you would have to do in some form anyway.
For counting runs of bits, BSR would be nice but gives the first bit set (1), so you NOT the dword first, then BSR will effectively tell you how many bits were 1s; to continue for the rest of the dword you'll need to SHL by the number of bits, and also keep a count of the total bits counted in the dword, so you don't go over 32.
I've played a bit with it: the following proc scans for zero-bits in given bit range. The buffer's size must be a multiple of 4 and should be aligned to 4. The bit indexes are zero based:
invoke zero_scan,...,0,31 ; scans a whole DWORD
invoke zero_scan,...,63,63; test bit 63
zero_scan proc uses edi esi ebx pBits:PCHAR,bs:DWORD,be:DWORD
LOCAL ns:DWORD
mov esi,pBits
mov edx,bs
mov edi,be
sub edi,edx ; edi = number of bits to scan
inc edi
xor ecx,ecx
and edx,31 ; edx = bit index
mov ebx,edx
jz @1
mov ecx,31
sub ecx,edx ; ecx = reserved index
lea edx,[ecx+1]
mov ns,edx
mov eax,80000000h
sar eax,cl ; create bitmask
xor ecx,ecx
mov edx,[esi]
and edx,eax ; mask bits
lea esi,[esi+4] ; update offset
mov ecx,ns
jz @1
mov ecx,ebx
neg ecx
jnz @bsf
align 16
@1: mov edx,[esi]
test edx,edx
jnz @bsf
add ecx,32
add esi,4
cmp ecx,edi
jb @1
mov eax,edi
ret
align 16
@bsf:
bsf eax,edx
add ecx,eax
cmp ecx,edi
cmovae eax,edi
cmovb eax,ecx
ret
zero_scan endp
The return value is the number of contiguous zero-bits starting from bs.
This procedure could be improved ...
qWord
Quote from: bashir1961 on February 16, 2012, 04:02:59 PM
well, I am not sure I understood correctly the BT instruction, It seemed interesting because it allows (but does it ?) to index an arbitrary bit in memory, i.e. in this example ecx can be greater than 32 and refer to successive memory positions.
So I wanted to avoid bit base / bit offset computations, etc.etc.
It does.
Edit: Added code to get an idea of how slow the BT instruction is.
;==============================================================================
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
;==============================================================================
.data
bitarray db 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
db 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
db 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
db 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0
db 0ffh,0,0ffh,0,0ffh,0,0ffh,0,0ffh,0,0ffh,0,0ffh,0,0ffh,0
.code
;==============================================================================
start:
mov esi, OFFSET bitarray
xor ebx, ebx
.WHILE ebx < 640
test ebx, 7
jnz @F
printf("\n")
@@:
bt [esi], ebx
mov eax, 0
adc eax, 0
printf("%d", eax)
inc ebx
.ENDW
printf("\n\n")
invoke GetCurrentProcess
invoke SetProcessAffinityMask, eax, 1
invoke Sleep, 3000
REPEAT 3
counter_begin 1000000, HIGH_PRIORITY_CLASS
xor ebx, ebx
bt [esi], ebx
mov eax, 0
adc eax, 0
mov ebx, 4
bt [esi], ebx
mov eax, 0
adc eax, 0
mov ebx, 8
bt [esi], ebx
mov eax, 0
adc eax, 0
mov ebx, 12
bt [esi], ebx
mov eax, 0
adc eax, 0
counter_end
printf("%d cycles\n", eax)
ENDM
printf("\n")
REPEAT 3
counter_begin 1000000, HIGH_PRIORITY_CLASS
mov ebx, 1
bt [esi], ebx
mov eax, 0
adc eax, 0
mov ebx, 5
bt [esi], ebx
mov eax, 0
adc eax, 0
mov ebx, 9
bt [esi], ebx
mov eax, 0
adc eax, 0
mov ebx, 13
bt [esi], ebx
mov eax, 0
adc eax, 0
counter_end
printf("%d cycles\n", eax)
ENDM
printf("\n")
REPEAT 3
counter_begin 1000000, HIGH_PRIORITY_CLASS
xor ebx, ebx
bt [esi], ebx
mov eax, 0
adc eax, 0
mov ebx, 128
bt [esi], ebx
mov eax, 0
adc eax, 0
mov ebx, 256
bt [esi], ebx
mov eax, 0
adc eax, 0
mov ebx, 512
bt [esi], ebx
mov eax, 0
adc eax, 0
counter_end
printf("%d cycles\n", eax)
ENDM
printf("\n")
REPEAT 3
counter_begin 1000000, HIGH_PRIORITY_CLASS
xor ebx, ebx
bt [esi], ebx
mov eax, 0
adc eax, 0
mov ebx, 129
bt [esi], ebx
mov eax, 0
adc eax, 0
mov ebx, 257
bt [esi], ebx
mov eax, 0
adc eax, 0
mov ebx, 513
bt [esi], ebx
mov eax, 0
adc eax, 0
counter_end
printf("%d cycles\n", eax)
ENDM
printf("\n")
inkey
exit
;==============================================================================
end start
00000000
10000000
01000000
11000000
00100000
10100000
01100000
11100000
00010000
10010000
01010000
11010000
00110000
10110000
01110000
11110000
11110000
01110000
10110000
00110000
11010000
01010000
10010000
00010000
11100000
01100000
10100000
00100000
11000000
01000000
10000000
00000000
00000000
10000000
01000000
11000000
00100000
10100000
01100000
11100000
00010000
10010000
01010000
11010000
00110000
10110000
01110000
11110000
11110000
01110000
10110000
00110000
11010000
01010000
10010000
00010000
11100000
01100000
10100000
00100000
11000000
01000000
10000000
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
11111111
00000000
Typical cycle counts on a P3:
21 cycles
21 cycles
21 cycles
21 cycles
21 cycles
21 cycles
21 cycles
21 cycles
21 cycles
21 cycles
22 cycles
21 cycles
Typical cycle counts on a P4 (Northwood AFAIK):
46 cycles
44 cycles
43 cycles
54 cycles
47 cycles
47 cycles
44 cycles
43 cycles
44 cycles
44 cycles
43 cycles
45 cycles
Using BT would simplify the bit indexing code, but I have doubts that the code would be enough simpler to make up for the slowness of the instruction and the additional code necessary to get/test the value of the indexed bit.
In the tif_fax3.c source, the find0span and find1span functions contain complex code that involves lookup tables. Is your goal to duplicate the results of these functions?
BTW, if you're going to use inline assembly in a function that will be called frequently, you can save a few cycles by eliminating the prologue and epilogue code.
//=============================================================================
#include <windows.h>
#include <conio.h>
#include <stdio.h>
// counter_c.c is available here:
// http://www.masm32.com/board/index.php?topic=13100.msg101673#msg101673
#include "counter_c.c"
//=============================================================================
INT32 find0spanasm(u_char* indice, INT32 bs, INT32 be)
{
INT32 result;
__asm
{
xor eax, eax
mov ecx, bs
mov edi, indice
inizioloop:
bt [edi], ecx
jc uscitaloop
add eax, 1
add ecx, 1
cmp ecx, be
jne inizioloop
uscitaloop:
mov result, eax
}
return result;
}
__declspec( naked ) INT32 find0spannaked(u_char* indice, INT32 bs, INT32 be)
{
__asm
{
xor eax, eax
mov ecx, [esp+8]
mov edi, [esp+4]
inizioloop:
bt [edi], ecx
jc uscitaloop
add eax, 1
add ecx, 1
cmp ecx, [esp+12]
jne inizioloop
uscitaloop:
ret
}
}
void main( void )
{
INT32 r;
u_char test = 0x10;
printf("%d\n",find0spanasm(&test,0,7));
printf("%d\n",find0spannaked(&test,0,7));
Sleep(4000);
CTR_BEGIN( 1, 1000, HIGH_PRIORITY_CLASS );
CTR_END( 1 )
printf( "%d cycles\n", ctr_cycles );
CTR_BEGIN( 2, 1000, HIGH_PRIORITY_CLASS );
r = find0spanasm(&test,0,7);
CTR_END( 2 )
printf( "%d cycles\n", ctr_cycles );
CTR_BEGIN( 3, 1000, HIGH_PRIORITY_CLASS );
r = find0spannaked(&test,0,7);
CTR_END( 3 )
printf( "%d cycles\n", ctr_cycles );
getch();
}
Running on a P3:
4
4
0 cycles
54 cycles
44 cycles
Hello,
first of all I wanted to thank you all of this forum for the support, I'm really impressed, my understanding of x86 tricks and traps has improved and I have many examples to study, thank you again.
@MichaelW:
yes, the idea was to evaluate a reimplementation of find0span, find1span in assembler;
they are complex functions since they are heavily optimized toward very long scans, at least I suppose so.
The situation in which I need improvement is the opposite, like a sequence of 0xF0, 0XF0 etc, a few bits on, a few bits off and so on.
@qWord:
thank for the code, i will study your example, will keep you informed, thanks again
QuoteThe situation in which I need improvement is the opposite, like a
sequence of 0xF0, 0XF0 etc, a few bits on, a few bits off and so on
that sounds like a much more likely case in an average fax :P
but, you can design a data-set that tests both short and long
also, MichaelW's timing macros are what many of us use to time code...
http://www.masm32.com/board/index.php?topic=770.0
attached is an example of using the timers...
I just noticed that this bit of code:
add ecx, 1
cmp ecx, be
jne inizioloop
Will stop short of using index be. I think the JNE should be a JNA.