News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

text parsing

Started by xtreme, June 16, 2007, 02:22:56 PM

Previous topic - Next topic

xtreme

hello guys

i´m new in this forum and custom times your helps

i would like write a text parser and i have trouble with this

can you help me!


regards


[attachment deleted by admin]

hutch--

Hi xtreme,

Welcome on board.

I have had a loot at you text file to parse which has this data in it,


stringstart&str1=string1&str2=string2&stringend


You could design a parser for this format but it is much simpler and much faster to use a better design.

If you can live with text like,


word1 word2 word3 word4


There is already a very fqst algo in the MASM32 project called wtok. It is an "in place" parser that delivers the result as an array of pointers to each word which it also zero terminates so you can just use the pointers to access the words.



Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

xtreme

hi

I would like to use this Design, it was to be print out string1 and string2

stringstart&str1=string1&str2=string2&stringend


thanks for your repley

hutch--

the current string that you want to parse uses "=" as the lead delimiter and "&" as the trailing delimiter, the first and last words are redundant.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

xtreme


MichaelW

xtreme,

The way your string is formatted it will be difficult to parse. What do you expect the parsing to produce? For example:

stringstart
&
str1
=
string1
&
str2
=
string2
&
stringend

Or something else?

Assuming that "&" is a string concatenation operator and "=" is an assignment operator, as a programming language statement your string would be illegal in most common languages because it contains operators on the left side of assignments.
eschew obfuscation

xtreme

#6
thanks to all

I solved it a little the problem! only that is not print the string2 :'(

give a another findstring example 

Str_find PROC, sourcePtr:PTR BYTE, targetPtr:PTR BYTE

          mov eax, sourcePtr
          mov ebx, targetPtr
          mov esi, targetPtr
          xor ecx, ecx
   L0:
          cmp BYTE PTR [esi], 0
          je L1
          inc esi
          inc ecx
          jmp L0
   L1:
          mov edx, ecx
   L2:
          mov esi, eax
          mov edi, ebx
          repe cmpsb
          jz FOUND
          mov ecx, edx
          cmp BYTE PTR [eax + ecx], 0
          jz NOT_FOUND
          inc eax
          jmp L2
NOT_FOUND:
          or eax, 1
          ret
FOUND:
          mov tLength,edx
          sub esi,tLength
ret
Str_find ENDP

[attachment deleted by admin]

MichaelW

I still am not sure what you are trying to do. This project would be easier if you would start with a console app, so you could concentrate on the string handling code instead of being distracted by the dialog code. You can take care of the dialog code after you get the string handling code working. The attachment contains the necessary parts of your code, cleaned up a bit, and converted to a console app. I also included code that displays a hex-ascii dump of the memory at the pointer returned by Str_find, so you can see exactly what the result is. The hex-ascii dump is displayed on the console, so to see it you must build the code as a console app.


[attachment deleted by admin]
eschew obfuscation

Synfire

xtreme,

Sorry for popping in late on the thread, but from a quick glance at the string you are trying to parse, i would say you are working your way into trying to code a CGI app using MASM. While this is fine and by all means go for it mate, I'm just wondering, if that's the case, why not take advantage of a language like PHP or PERL which has CGI extensions, such extensions that can handle even the modified format of  stringstart;str1=string1;str2=string2;stringend with no modification at all, have those scripts parse the QUERY_STRING into a manageable format such as environment variables (to be accessed through _getenv() in msvcrt.dll)  and then use their builtin exec call (in PERL it's system()) to execute your application. This way you don't have to deal with any of the parsing at all, you just grab your variables right out of the environment, and you can use that same script to handle all of your CGI/ASM applications so you don't have to deal with the parser at all during development time.

But, if that's not the way you want to go.. or maybe I just missed your target. Then the best way to parse this type of string is backwards. You scan through to the end of the string making sure you keep the address of the beginning of the string stored in a register somewhere (I prefer ESI for the start and EDI for the end). Then you start scanning backwards through the string until you reach the first ampersand (&) or your start address (ESI). When that is reached, save EDI+1 in a register (say EBX), this is our current token. Using ECX as an index you can now scan forward until you reach an equal (=) or a NULL character (\0). If an equal is found then EBX+ECX+1 is the value of your token, EBX+ECX should be set to a NULL character, and EBX is the key of your token. You've just broken up the Key/Value pair. If you reach a NULL character instead of an equal then you can consider it a NULL Key and use it as is by accessing EBX. You continue the search by repeating the process until EDI is equal to ESI.

By doing this, you could construct a multi-dimensional array (matrix) of dwords which point to key/value pairs. The coolest part is, because you are substituting the ampersand (&) and equal (=) characters with NULL characters (\0) as you go, you don't have to remove the string from memory, you can actually use this string as the place where the dwords in the matrix point to, as when you add the NULL characters you are turning the string into an array of strings. This makes for efficient use of memory as well. ;)

Also, you only have to do it once, when you want to lookup a token, you can use the matrix you've constructed as a table to help you find the value you are looking for. Just remember that when you come across the NULL Keys during parsing that you have to remember that they too require some identification (maybe -1) in the second column of your matrix to let you know that it is in fact a NULL Key.

I hope all of this has helped in some way.

Regards,
Bryant Keller

gabor

Hi!

I think there is no problem with the design. Actually any sort of design should be supported by an adequate parser.
Secondly, my EFSM automaton can handle this (and hopefully any other) syntax. It uses rules to do the matching. A rule has also an action associated with it, thus it is quite easy to create an automaton that is controlled by input.
Without any further explanations, this is my design for a parser:
1. Read an input char.
2. Match it against valid separator characters.
3. If match found step in next state, if not append char to an auxillary string and go to step 1. The state may be represented by the address of the separator array.
4. The auxiallary string (containing the chars between the preceeding and following separators) are the seeked words.

Whole words can be "separators" too: after a word was eaten entirely (from separator to separator) the word can be looked up in a table to get a code or action, etc.


Good luck to your work!
Greets, Gábor