News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Identify a text file

Started by ToutEnMasm, September 11, 2011, 03:19:09 PM

Previous topic - Next topic

ToutEnMasm


I need to identify text file without too much errors.
There is :   Is 0d0ah exist   ?     ---> give bad results
There is :  If it is not  PE file,perhaps it's a text file.
Theris is :  Nothing to identify an (bmp,gif,ico .......) without testing each format and there is numreous formats of images.
There is : Is there not visible chars ... <32 and not 9,13,10 char  : bad results (>.h have this chars ...)

Any Idea ?

dedndave

there are different kinds of text files   :bg

that having been said, most "plain ascii" text files have very few characters above 127
you might do a percentage count - should be less than 1%, easily enough
of course, small text files might fail the test

below 32, you have 9, 10, 13
you might also see form feed (12)

if you see a 26, there should only be one at the end of the file

dedndave

if there is a 255, it probably isn't a text file   :U
you are likely to see that value often in binary files, and not too often in text files
although, in the old days, i used to use that char as a white space in batch files
it worked with the early DOS version ECHO command   :P
you won't see too many files like that around - lol

the same is true for the value 0
text files have no use of null spaces, really - unless they are UNICODE   :P
binary files will have many

not all text files have 13/10 - many HTML files have none or very few (as well as js, css, php, xml, etc)

you could probably come up with some kind of formula
count 13's and 10's in one counter
count characters from 32 to 127 in another counter
(or better yet, count all the other characters)

when done....
adjusted file size = A = (file size) - (crs&lfs)
based on the adjusted file size, the percentage of characters from 32 to 127 should be very high

  A           minimum percentage
  10                      80
100                       95
1000                      99


and so on

jj2007

In addition to Dave's good advice:
- there is Unicode (UTF-16 or UTF-8, and they usually have a BOM)
- Unix, Mac and Windows have different CrLf usage
Anyway, a statistical approach will work. Test it on a hundred exe's and images, and compare to the *.inc family.

FORTRANS

Hi,

   A DOS or Windows text file "should" have equal numbers of
carriage returns and line feeds.  Unix style and old Macintosh
files fail that test of course.

   A test for a file extension of *.TXT should be an easy first test.

Regards,

Steve N.

fearless

I had a project a while back, a recursive directory walker that checked for file types for a cheat scanner thing i was writing. I mainly compared against known magic header values along with the extension. I suppose it depends on what your checking for, but these are the functions i used, not sure if it will help you (old code some of which is not optimized).

;**************************************************************************
; Determines if the specified file is Unicode UTF16 Little Endian file:
; hFile is the handle of the opened file to check
;**************************************************************************
IsFileUTF16LE PROC hFile:DWORD

LOCAL FileSigBuffer:WORD
LOCAL nBytesRead:DWORD

Invoke ReadFile, hFile, Addr FileSigBuffer, 2, Addr nBytesRead, NULL
.IF eax == 0
mov eax, FALSE ; failed, so return false anyhow
ret
.ENDIF
                             
; Check for signature: '..' = FF FE =  FEFFh                                                         
xor eax, eax
mov ax, FileSigBuffer
; PrintHex eax
.IF eax == 0FEFFh
mov eax, TRUE
.else
mov eax, FALSE
.endif                                                                                 

ret

IsFileUTF16LE endp

;**************************************************************************
; Determines if the specified file is Unicode UTF16 Big Endian file:
; hFile is the handle of the opened file to check
;**************************************************************************
IsFileUTF16BE PROC hFile:DWORD

LOCAL FileSigBuffer:WORD
LOCAL nBytesRead:DWORD

Invoke ReadFile, hFile, Addr FileSigBuffer, 2, Addr nBytesRead, NULL
.IF eax == 0
mov eax, FALSE ; failed, so return false anyhow
ret
.ENDIF
                             
; Check for signature: '..' = FE FF  =  FFFEh                                                         
xor eax, eax
mov ax, FileSigBuffer
; PrintHex eax
.IF eax == 0FFFEh
mov eax, TRUE
.else
mov eax, FALSE
.endif                                                                                 

ret

IsFileUTF16BE endp

;**************************************************************************
; Determines if the specified file is Unicode UTF8 file:
; hFile is the handle of the opened file to check
;**************************************************************************
IsFileUTF8 PROC hFile:DWORD

LOCAL FileSigBuffer:DWORD
LOCAL nBytesRead:DWORD

Invoke ReadFile, hFile, Addr FileSigBuffer, 4, Addr nBytesRead, NULL
.IF eax == 0
mov eax, FALSE ; failed, so return false anyhow
ret
.ENDIF
                             
; Check for signature: '....' = 0xEF 0xBB 0xBF  =  ??BFBBEFh                                                         

mov eax, FileSigBuffer
and eax, 00FFFFFFh ; remove the first byte as we dont care about that one
; PrintHex eax
.IF eax == 00BFBBEFh
mov eax, TRUE
.else
mov eax, FALSE
.endif                                                                                 

ret

IsFileUTF8 endp

;**************************************************************************
; Determines if the specified file is an ASCII encoded file
; hFile is the handle of the opened file to check
;**************************************************************************
IsFileASCII PROC hFile:DWORD

LOCAL FileSigBuffer[64]:BYTE ; could take a larger sample to check of course
LOCAL nBytesRead:DWORD

Invoke ReadFile, hFile, Addr FileSigBuffer, 64, Addr nBytesRead, NULL
.IF eax == 0
mov eax, FALSE ; failed, so return false anyhow
ret
.ENDIF
                             
; Loop through the buffer, checking for character codes = or greater than space (31h) and less than or = to tilde (126h)                                           

lea esi, FileSigBuffer
lea edi, FileSigBuffer
add edi, 64d

scanasciiloop:
.if esi != edi
xor ebx, ebx
mov bl, byte ptr [esi]
.if bl >= 31d && bl <= 126d || bl == 13d || bl == 10d || bl == 08d ; ok then continue
inc esi
;PrintDec ebx
jmp scanasciiloop
.else
mov eax, FALSE
jmp endscanasciiloop
.endif
.else
mov eax, TRUE
; PrintText 'Ascii Found'
.endif                                                                                 

endscanasciiloop:

ret
IsFileASCII endp

ƒearless

ToutEnMasm


Thanks for answers,
I have made little progress:
Quote
The first four bytes of a file contain the file signatures or the  magic numbers that uniquely identify the file. For instance,
JPEG image file is always found to hold the value FF D8 FF E0 (Hex) in the first four bytes, GIF image file is identified by its first
three bytes as 47 49 46 and 42 4D as the first two bytes of the file indicates a Bitmap
Htm 3 first bytes EF BB BF,need confirmation.

fearless

I've more IsFile???? file types procs if you want, made them a while ago which use where possible magic header signature checks. Basically my scanner type program checked for most known types, and skipped for the common ones (validating with IsFile???? whatever), i was only interested in files that where:

- not as advertised, like a .gif returning false for IsFileGif, or a .rar returning false for IsFileRar
- exe, dll, binary type files
- few other specific file types for game engine stuff
- and all other unknown formats

which where then more deeply searched/scanned. Let me know if you want them and ill post them up or pm you, whichever you like - might save you some time if it is what your looking for.
ƒearless

jj2007

Quote from: ToutEnMasm on September 12, 2011, 07:52:03 AM
Htm 3 first bytes EF BB BF,need confirmation.

That's not *.htm specific, it's just the UTF-8 BOM.

hutch--

Try this approach, the algo in the test piece counts all characters, then do your analysis on the counts of 13 and 10 and what characters are above 127.



IF 0  ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
                      Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    include \masm32\include\masm32rt.inc

    char_count PROTO :DWORD,:DWORD

    .code

start:
   
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    call main
    inkey
    exit

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

main proc

    LOCAL carr[256]:DWORD       ; array to hold character counts
    LOCAL hMem  :DWORD          ; handle of text memory

    push ebx
    push esi
    push edi

    mov hMem, InputFile("\masm32\include\windows.inc")

    invoke memfill,ADDR carr,1024,0     ; zero fill array
    invoke char_count,hMem,ADDR carr    ; count characters in source

    lea esi, carr
    xor ebx, ebx

  lbl:
    mov edi, [esi+ebx*4]
    print ustr$(ebx)," --- "
    print ustr$(edi),13,10

    add ebx, 1
    cmp ebx, 255
    jle lbl

    free hMem

    pop edi
    pop esi
    pop ebx

    ret

main endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

char_count proc psrc:DWORD,parr:DWORD

    mov ecx, psrc
    mov edx, parr
    sub ecx, 1

  ; -----------
  ; unroll by 4
  ; -----------
  align 4
  lbl0:
    add ecx, 1
    movzx eax, BYTE PTR [ecx]         ; zero extend each byte into EAX
    add DWORD PTR [edx+eax*4], 1      ; increment the count for that character
    test eax, eax
    jz lbl1

    add ecx, 1
    movzx eax, BYTE PTR [ecx]
    add DWORD PTR [edx+eax*4], 1
    test eax, eax
    jz lbl1

    add ecx, 1
    movzx eax, BYTE PTR [ecx]
    add DWORD PTR [edx+eax*4], 1
    test eax, eax
    jz lbl1

    add ecx, 1
    movzx eax, BYTE PTR [ecx]
    add DWORD PTR [edx+eax*4], 1
    test eax, eax
    jnz lbl0

  lbl1:
    sub ecx, psrc                     ; calculate the length of the source
    mov eax, ecx                      ; return it to the caller

    ret

char_count endp

; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

end start
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

Rockphorr

#10
Quote from: ToutEnMasm on September 11, 2011, 03:19:09 PM

I need to identify text file without too much errors.
There is :   Is 0d0ah exist   ?     ---> give bad results
There is :  If it is not  PE file,perhaps it's a text file.
Theris is :  Nothing to identify an (bmp,gif,ico .......) without testing each format and there is numreous formats of images.
There is : Is there not visible chars ... <32 and not 9,13,10 char  : bad results (>.h have this chars ...)

Any Idea ?


Ask user by using dialog window - allow users read data as them wish

YOU CANNOT IDENTIFY TEXT YOU CAN CHECK FILEFORMAT ONLY

HOW DO YOU IDENTIFY DIFFERENT SOURCE CODES ??? - IS IT .C OR .CPP OR SMALTALK ????
Strike while the iron is hot - Бей утюгом, пока он горячий

dedndave

QuoteHtm 3 first bytes EF BB BF,need confirmation.

there is no standard file marker for html or xml
the EF BB BF is for UTF-8   :P
however, you can scan the first several lines for "<html" or "<xml" - the letters can be upper or lower case, possibly mixed

unicode: FF FE
unicode big endian: FE FF
UTF-8: EF BB BF

of course, these files should more or less meet the test for text files

QuoteGIF image file is identified by its first three bytes as 47 49 46
42 4D as the first two bytes of the file indicates a Bitmap

gif files will start with either "GIF87" or "GIF89" - the year is the specification version
bmp files start with "BM", followed by the file size in binary

mpeg files start with "MPG"
zip files start with "PK"

there are a million of em   :bg
you might make an INI file so you can easily edit or add markers in the list
not all files will be easy to add, as the marker may not always be in the first bytes

ToutEnMasm


a little more progress


http://www.garykessler.net/library/file_sigs.html

Quote
suppress what is not txt
   .if word ptr [esi] == "ZM"
      jmp   Extension_filtrer ;Windows/DOS executable file
   .elseif word ptr [esi] == 0D8FFh  ;jpeg probable
      jmp   Extension_filtrer
   .elseif word ptr [esi] == 4947h ; IG gif probable
      jmp   Extension_filtrer         
   .elseif word ptr [esi] == 0BBEFh ;UTF_8 (made bd char ù*¤ in notepad)
      jmp   Extension_filtrer   
   .elseif word ptr [esi] == 4D42h ;MB bitmap probable
      jmp   Extension_filtrer   
   .elseif byte ptr [esi] < 32 && byte ptr [esi+1] < 32
      mov eax,0
      .if byte ptr [esi] == 13
         inc eax
      .elseif byte ptr [esi] == 10   
         inc eax      
      .elseif byte ptr [esi] == 9
         inc eax      
      .endif
      .if byte ptr [esi+1] == 13
         inc eax
      .elseif byte ptr [esi+1] == 10   
         inc eax      
      .elseif byte ptr [esi+1] == 9
         inc eax      
      .endif         
      .if eax != 2      ;not  text file                           
         jmp Extension_filtrer
      .endif
   .endif


ToutEnMasm


Here is the output of hutch test piece on windows.inc
Quote
0 --- 1
;-------------------  0 cutted to shorter
9 --- 2210
10 --- 22274
11 --- 0
12 --- 0
13 --- 22274
;-----------------------
32 --- 304722
33 --- 1
34 --- 98
35 --- 1
36 --- 0
37 --- 0
38 --- 3
39 --- 13
40 --- 878
41 --- 878
42 --- 223
43 --- 1394
44 --- 121
45 --- 2725
46 --- 44
47 --- 10
48 --- 23384
49 --- 6054
50 --- 4092
51 --- 2287
52 --- 3287
53 --- 1388
54 --- 1413
55 --- 1142
56 --- 2318
57 --- 756
58 --- 66
59 --- 372
60 --- 717
61 --- 170
62 --- 717
63 --- 3773
64 --- 1
65 --- 19032
66 --- 6634
67 --- 15022
68 --- 18112
69 --- 34892
70 --- 8078
71 --- 6301
72 --- 5089
73 --- 19942
74 --- 681
75 --- 2313
76 --- 13266
77 --- 13631
78 --- 18654
79 --- 21242
80 --- 10732
81 --- 581
82 --- 26495
83 --- 21251
84 --- 27028
85 --- 7855
86 --- 4884
87 --- 7097
88 --- 2349
89 --- 3906
90 --- 669
91 --- 9
92 --- 0
93 --- 8
94 --- 0
95 --- 29162
96 --- 1
97 --- 2788
98 --- 784
99 --- 1751
100 --- 3368
101 --- 20213
102 --- 1217
103 --- 799
104 --- 8632
105 --- 2733
106 --- 54
107 --- 312
108 --- 1836
109 --- 1516
110 --- 2227
111 --- 2099
112 --- 1607
113 --- 14871
114 --- 2781
115 --- 2372
116 --- 3593
117 --- 16020
118 --- 386
119 --- 1027
120 --- 453
121 --- 767
122 --- 440
123 --- 9
124 --- 0
125 --- 9
126 --- 37
;----------------
171 --- 365
;----------------


dedndave

171 = 1/2   :P

i was thinking you might see the copyright symbol once in a while (169)

of course, in non-English text files, you may see other chars, like those with accents and rings above
more than likely, they would be unicode files, though