Hi everyone,
I am trying to speedup the following:
given a vector of values, I need to weed out values that are smaller than a low_limit
and higher than a hi_limit, and then get a new vector with only valid values (i.e ones
in the range). Basically, I am trying to re-implement the find() function of Matlab, for those
of you who are familiar with it. Consider my knowledge in assembly to be very minimal, but
aware :-) I use core duo Intel, that can support uptill SSE4.1.
I was wondering if someone may help me with some basic asm snippets / ideas that
will help me in writing a fast version of the above problem statement.
Thanks,
s.
a while back, qWord posted some sse code that you may be able to adapt to your needs
we were scanning a string for a few characters, like carriage return, line feed, tab, null
with a little effort, you could use something similar...
http://www.masm32.com/board/index.php?topic=15474.msg134886#msg134886
my knowledge of sse is very limited :P
SIDDYS,
When you say: "...given a vector of values,..."
What do you mean exactly ???
How is the data represented ???
"In physics, as well as mathematics, a vector is often identified with a tuple, or list of numbers, which depend on some auxiliary coordinate system or reference frame."
You could sort the values using some efficient sorting algorithm and then use some sort of binary search to find the target range.
By a vector of values, I mean that the data values are in a 1D array, and stores in byte
aligned contiguous memory locations.
Thanks a lot folks for the tips, I will try them, but please keep the
ideas flowing my way, as and when they come....
sid.
assuming the array contains signed values...
push edi
push esi
mov dh,Xmin
mov dl,Xmax
mov edi,offset VectorOut
mov esi,offset VectorIn
mov ecx,sizeof VectorIn
loop00: mov al,[esi]
inc esi
cmp al,dl
jg loop01
cmp al,dh
jl loop01
mov [edi],al
inc edi
loop01: sub ecx,1
jnz loop00
xchg eax,edi
pop esi
sub eax,offset VectorOut
pop edi
it's not SSE, and there are probably some speed-ups :P
for instance, if we knew more about the length of the array, you might be able to unroll the loop
Wow, thanks.... seems to be aligned with what I was looking for. Regarding the length
of the array, the maximum size can be 1024 x 1024 elements. The dimensions will always
be power of 2.
Thanks again guys... I should have registered on this group sooner.....
sid.
well, if it's a power of 2, you can unroll the loop nicely
the above code can be modified to load words, for example and do 2 bytes per pass
the more you unroll the loop, the faster it is likely to be, within certain limitations
it is good to keep the total length of the loop under 128 bytes so that the branch back to the top is SHORT rather than NEAR
also - important to know that memory reads and writes are faster on 32-bit machines if they are aligned dwords
anyways, here is the above loop modified to do 2 bytes per pass
push edi
push esi
mov dh,Xmin
mov dl,Xmax
mov edi,offset VectorOut
mov esi,offset VectorIn
mov ecx,sizeof VectorIn
loop00: mov ax,[esi]
add esi,2
cmp al,dl
jg loop01
cmp al,dh
jl loop01
mov [edi],al
inc edi
loop01: cmp ah,dl
jg loop02
cmp ah,dh
jl loop02
mov [edi],ah
inc edi
loop02: sub ecx,2
ja loop00
xchg eax,edi
pop esi
sub eax,offset VectorOut
pop edi
found a typo in the first loop...
was
xchg eax,esi
should be
xchg eax,edi