Print Page - text parsing

Title: text parsing
Post by: xtreme on June 16, 2007, 02:22:56 PM

hello guys

i´m new in this forum and custom times your helps

i would like write a text parser and i have trouble with this

can you help me!

regards

[attachment deleted by admin]

Title: Re: text parsing
Post by: hutch-- on June 16, 2007, 02:35:16 PM

Hi xtreme,

Welcome on board.

I have had a loot at you text file to parse which has this data in it,

Code Select


stringstart&str1=string1&str2=string2&stringend

You could design a parser for this format but it is much simpler and much faster to use a better design.

If you can live with text like,

Code Select


word1 word2 word3 word4

There is already a very fqst algo in the MASM32 project called wtok. It is an "in place" parser that delivers the result as an array of pointers to each word which it also zero terminates so you can just use the pointers to access the words.

Title: Re: text parsing
Post by: xtreme on June 16, 2007, 02:46:29 PM

hi

I would like to use this Design, it was to be print out string1 and string2

stringstart&str1=string1&str2=string2&stringend

thanks for your repley

Title: Re: text parsing
Post by: hutch-- on June 16, 2007, 03:28:50 PM

the current string that you want to parse uses "=" as the lead delimiter and "&" as the trailing delimiter, the first and last words are redundant.

Title: Re: text parsing
Post by: xtreme on June 16, 2007, 03:36:58 PM

thanks :U

Title: Re: text parsing
Post by: MichaelW on June 16, 2007, 04:13:39 PM

xtreme,

The way your string is formatted it will be difficult to parse. What do you expect the parsing to produce? For example:

Code Select


stringstart
&
str1
=
string1
&
str2
=
string2
&
stringend

Or something else?

Assuming that "&" is a string concatenation operator and "=" is an assignment operator, as a programming language statement your string would be illegal in most common languages because it contains operators on the left side of assignments.

Title: Re: text parsing
Post by: xtreme on June 16, 2007, 05:13:13 PM

thanks to all

I solved it a little the problem! only that is not print the string2 :'(

give a another findstring example

Str_find PROC, sourcePtr:PTR BYTE, targetPtr:PTR BYTE

mov eax, sourcePtr
mov ebx, targetPtr
mov esi, targetPtr
xor ecx, ecx
L0:
cmp BYTE PTR [esi], 0
je L1
inc esi
inc ecx
jmp L0
L1:
mov edx, ecx
L2:
mov esi, eax
mov edi, ebx
repe cmpsb
jz FOUND
mov ecx, edx
cmp BYTE PTR [eax + ecx], 0
jz NOT_FOUND
inc eax
jmp L2
NOT_FOUND:
or eax, 1
ret
FOUND:
mov tLength,edx
sub esi,tLength
ret
Str_find ENDP

[attachment deleted by admin]

Title: Re: text parsing
Post by: MichaelW on June 16, 2007, 09:51:22 PM

I still am not sure what you are trying to do. This project would be easier if you would start with a console app, so you could concentrate on the string handling code instead of being distracted by the dialog code. You can take care of the dialog code after you get the string handling code working. The attachment contains the necessary parts of your code, cleaned up a bit, and converted to a console app. I also included code that displays a hex-ascii dump of the memory at the pointer returned by Str_find, so you can see exactly what the result is. The hex-ascii dump is displayed on the console, so to see it you must build the code as a console app.

[attachment deleted by admin]

Title: Re: text parsing
Post by: Synfire on June 17, 2007, 12:53:13 AM

xtreme,

Sorry for popping in late on the thread, but from a quick glance at the string you are trying to parse, i would say you are working your way into trying to code a CGI app using MASM. While this is fine and by all means go for it mate, I'm just wondering, if that's the case, why not take advantage of a language like PHP or PERL which has CGI extensions, such extensions that can handle even the modified format of stringstart;str1=string1;str2=string2;stringend with no modification at all, have those scripts parse the QUERY_STRING into a manageable format such as environment variables (to be accessed through _getenv() in msvcrt.dll) and then use their builtin exec call (in PERL it's system()) to execute your application. This way you don't have to deal with any of the parsing at all, you just grab your variables right out of the environment, and you can use that same script to handle all of your CGI/ASM applications so you don't have to deal with the parser at all during development time.

But, if that's not the way you want to go.. or maybe I just missed your target. Then the best way to parse this type of string is backwards. You scan through to the end of the string making sure you keep the address of the beginning of the string stored in a register somewhere (I prefer ESI for the start and EDI for the end). Then you start scanning backwards through the string until you reach the first ampersand (&) or your start address (ESI). When that is reached, save EDI+1 in a register (say EBX), this is our current token. Using ECX as an index you can now scan forward until you reach an equal (=) or a NULL character (\0). If an equal is found then EBX+ECX+1 is the value of your token, EBX+ECX should be set to a NULL character, and EBX is the key of your token. You've just broken up the Key/Value pair. If you reach a NULL character instead of an equal then you can consider it a NULL Key and use it as is by accessing EBX. You continue the search by repeating the process until EDI is equal to ESI.

By doing this, you could construct a multi-dimensional array (matrix) of dwords which point to key/value pairs. The coolest part is, because you are substituting the ampersand (&) and equal (=) characters with NULL characters (\0) as you go, you don't have to remove the string from memory, you can actually use this string as the place where the dwords in the matrix point to, as when you add the NULL characters you are turning the string into an array of strings. This makes for efficient use of memory as well. ;)

Also, you only have to do it once, when you want to lookup a token, you can use the matrix you've constructed as a table to help you find the value you are looking for. Just remember that when you come across the NULL Keys during parsing that you have to remember that they too require some identification (maybe -1) in the second column of your matrix to let you know that it is in fact a NULL Key.

I hope all of this has helped in some way.

Regards,
Bryant Keller

Title: Re: text parsing
Post by: gabor on June 22, 2007, 11:35:24 AM

Hi!

I think there is no problem with the design. Actually any sort of design should be supported by an adequate parser.
Secondly, my EFSM automaton can handle this (and hopefully any other) syntax. It uses rules to do the matching. A rule has also an action associated with it, thus it is quite easy to create an automaton that is controlled by input.
Without any further explanations, this is my design for a parser:
1. Read an input char.
2. Match it against valid separator characters.
3. If match found step in next state, if not append char to an auxillary string and go to step 1. The state may be represented by the address of the separator array.
4. The auxiallary string (containing the chars between the preceeding and following separators) are the seeked words.

Whole words can be "separators" too: after a word was eaten entirely (from separator to separator) the word can be looked up in a table to get a code or action, etc.

Good luck to your work!
Greets, Gábor

The MASM Forum Archive 2004 to 2012

General Forums => The Campus => Topic started by: xtreme on June 16, 2007, 02:22:56 PM