News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

An ALGO in my mind - #1

Started by frktons, June 16, 2010, 04:35:42 PM

Previous topic - Next topic

frktons

While I'm studying the C language, I'm thinking about
ways to accomplish fast operations with Assembly code.

Something I'm trying to figure it out is how to perform
the comparison between strings of different lenghts in a faster
way than the C standard library can.


As an example, I'm going to do a search and match for strings
of lenght between 2 and an unknown number of characters
over a text file of 10 MB more or less.

The program's target is to verify what strings appear more often
and how many times they appear in the text file.

So the general idea  is to start with the first 2 bytes of the text file
and compare them with all the 2 bytes strings I have in the text file,
except for the 2 bytes I've chosen,  byte 1 and 2 I mean,
incrementing a counter each time I find a match.

After that I do the same moving 1 byte forward and taking two bytes
to compare with all the others, and so on.

When I finish with the 2 bytes strings, I start over with 3 bytes strings,
until I don't find any match. If I don't find a 12 bytes match, I can't find
any 13 or more bytes match as well, I mean.

At the moment I'm only concerned with the comparison of the two
strings, trying to do it in an efficient and fast way.


Surely many of you have already gone through this kind of problem.

I'd like some ideas, not necessarily the ASM code, but the ALGO you
have experimented and what kind of efficiency you have got from it.

If you feel like to do so, please explain what you have done and why
you think it is faster, smaller, smarter, ....er
  :P

Thanks

Frank



Mind is like a parachute. You know what to do in order to use it :-)

clive

Are we talking about groups of bytes, or really strings? There are other threads about strcmp() algorithms.

Further, doing repetitive linear searches over the whole file makes little sense, it would make more sense to generate a hashing mechanism and hash everything in an initial pass, and perform subsequent searches down specific hash chains.

A simple hash key could be the first two characters of the string. With a reasonable distribution your search space would be a fraction of the whole file.
It could be a random act of randomness. Those happen a lot as well.

frktons

#2
I'm thinking about "repetitive patterns of bytes", from 2 to n bytes, inside
a text file, or better a binary file with text inside, without line feed or newline.
A single big string of any lenght.
For the 2 bytes pattern it would be useful, I think, to use them as index of
a 2**16 numeric array and add 1 to the indexed element, so in one pass I
could know how many  occurrences are there for every 2 group bytes.

I don't know if the hashing is something similar, I've never used this kind of
algos, so if you can explain how it works in simple words I'd be grateful.

For groups of  4 bytes I have some different idea to test.

For example it should be quite fast to compare 2 strings, when they are
4 bytes long, using 2 32 bit registers:


mov eax, offset str1
mov ebx, offset str2
mov eax, [eax]
mov ebx, [ebx]
xor eax, ebx
jz equal
jmp next_string


Well I actually don't know if that code works, I'm just dreaming about
ways to do the comparisons in an efficient way, I didn't test it so far.


Mind is like a parachute. You know what to do in order to use it :-)

clive

Yes indexing and hashing are similar concepts.

You could use the keys from the 2-byte scan, as a starting point for the 3 and 4 byte scans.

Typically if I'm compressing something I will try to find a repeating sequence, determine how long it repeats, and then look for longer repeating sequences. It usually doesn't make sense to do an exhaustive search, so once you reach a threshold the match is good enough. It is a speed/compression trade off.
It could be a random act of randomness. Those happen a lot as well.

frktons

Mind is like a parachute. You know what to do in order to use it :-)