While I'm studying the C language, I'm thinking about
ways to accomplish fast operations with Assembly code.
Something I'm trying to figure it out is how to perform
the comparison between strings of different lenghts in a faster
way than the C standard library can.
As an example, I'm going to do a search and match for strings
of lenght between 2 and an unknown number of characters
over a text file of 10 MB more or less.
The program's target is to verify what strings appear more often
and how many times they appear in the text file.
So the general idea is to start with the first 2 bytes of the text file
and compare them with all the 2 bytes strings I have in the text file,
except for the 2 bytes I've chosen, byte 1 and 2 I mean,
incrementing a counter each time I find a match.
After that I do the same moving 1 byte forward and taking two bytes
to compare with all the others, and so on.
When I finish with the 2 bytes strings, I start over with 3 bytes strings,
until I don't find any match. If I don't find a 12 bytes match, I can't find
any 13 or more bytes match as well, I mean.
At the moment I'm only concerned with the comparison of the two
strings, trying to do it in an efficient and fast way.
Surely many of you have already gone through this kind of problem.
I'd like some ideas, not necessarily the ASM code, but the ALGO you
have experimented and what kind of efficiency you have got from it.
If you feel like to do so, please explain what you have done and why
you think it is faster, smaller, smarter, ....er :P
Thanks
Frank
Are we talking about groups of bytes, or really strings? There are other threads about strcmp() algorithms.
Further, doing repetitive linear searches over the whole file makes little sense, it would make more sense to generate a hashing mechanism and hash everything in an initial pass, and perform subsequent searches down specific hash chains.
A simple hash key could be the first two characters of the string. With a reasonable distribution your search space would be a fraction of the whole file.
I'm thinking about "repetitive patterns of bytes", from 2 to n bytes, inside
a text file, or better a binary file with text inside, without line feed or newline.
A single big string of any lenght.
For the 2 bytes pattern it would be useful, I think, to use them as index of
a 2**16 numeric array and add 1 to the indexed element, so in one pass I
could know how many occurrences are there for every 2 group bytes.
I don't know if the hashing is something similar, I've never used this kind of
algos, so if you can explain how it works in simple words I'd be grateful.
For groups of 4 bytes I have some different idea to test.
For example it should be quite fast to compare 2 strings, when they are
4 bytes long, using 2 32 bit registers:
mov eax, offset str1
mov ebx, offset str2
mov eax, [eax]
mov ebx, [ebx]
xor eax, ebx
jz equal
jmp next_string
Well I actually don't know if that code works, I'm just dreaming about
ways to do the comparisons in an efficient way, I didn't test it so far.
Yes indexing and hashing are similar concepts.
You could use the keys from the 2-byte scan, as a starting point for the 3 and 4 byte scans.
Typically if I'm compressing something I will try to find a repeating sequence, determine how long it repeats, and then look for longer repeating sequences. It usually doesn't make sense to do an exhaustive search, so once you reach a threshold the match is good enough. It is a speed/compression trade off.
As far as I've tried, it looks like strcmp() can be optimized
quite a lot:
http://www.masm32.com/board/index.php?topic=13701.msg113018#msg113018