I need to identify text file without too much errors.
There is : Is 0d0ah exist ? ---> give bad results
There is : If it is not PE file,perhaps it's a text file.
Theris is : Nothing to identify an (bmp,gif,ico .......) without testing each format and there is numreous formats of images.
There is : Is there not visible chars ... <32 and not 9,13,10 char : bad results (>.h have this chars ...)
Any Idea ?
there are different kinds of text files :bg
that having been said, most "plain ascii" text files have very few characters above 127
you might do a percentage count - should be less than 1%, easily enough
of course, small text files might fail the test
below 32, you have 9, 10, 13
you might also see form feed (12)
if you see a 26, there should only be one at the end of the file
if there is a 255, it probably isn't a text file :U
you are likely to see that value often in binary files, and not too often in text files
although, in the old days, i used to use that char as a white space in batch files
it worked with the early DOS version ECHO command :P
you won't see too many files like that around - lol
the same is true for the value 0
text files have no use of null spaces, really - unless they are UNICODE :P
binary files will have many
not all text files have 13/10 - many HTML files have none or very few (as well as js, css, php, xml, etc)
you could probably come up with some kind of formula
count 13's and 10's in one counter
count characters from 32 to 127 in another counter
(or better yet, count all the other characters)
when done....
adjusted file size = A = (file size) - (crs&lfs)
based on the adjusted file size, the percentage of characters from 32 to 127 should be very high
A minimum percentage
10 80
100 95
1000 99
and so on
In addition to Dave's good advice:
- there is Unicode (UTF-16 or UTF-8, and they usually have a BOM)
- Unix, Mac and Windows have different CrLf usage
Anyway, a statistical approach will work. Test it on a hundred exe's and images, and compare to the *.inc family.
Hi,
A DOS or Windows text file "should" have equal numbers of
carriage returns and line feeds. Unix style and old Macintosh
files fail that test of course.
A test for a file extension of *.TXT should be an easy first test.
Regards,
Steve N.
I had a project a while back, a recursive directory walker that checked for file types for a cheat scanner thing i was writing. I mainly compared against known magic header values along with the extension. I suppose it depends on what your checking for, but these are the functions i used, not sure if it will help you (old code some of which is not optimized).
;**************************************************************************
; Determines if the specified file is Unicode UTF16 Little Endian file:
; hFile is the handle of the opened file to check
;**************************************************************************
IsFileUTF16LE PROC hFile:DWORD
LOCAL FileSigBuffer:WORD
LOCAL nBytesRead:DWORD
Invoke ReadFile, hFile, Addr FileSigBuffer, 2, Addr nBytesRead, NULL
.IF eax == 0
mov eax, FALSE ; failed, so return false anyhow
ret
.ENDIF
; Check for signature: '..' = FF FE = FEFFh
xor eax, eax
mov ax, FileSigBuffer
; PrintHex eax
.IF eax == 0FEFFh
mov eax, TRUE
.else
mov eax, FALSE
.endif
ret
IsFileUTF16LE endp
;**************************************************************************
; Determines if the specified file is Unicode UTF16 Big Endian file:
; hFile is the handle of the opened file to check
;**************************************************************************
IsFileUTF16BE PROC hFile:DWORD
LOCAL FileSigBuffer:WORD
LOCAL nBytesRead:DWORD
Invoke ReadFile, hFile, Addr FileSigBuffer, 2, Addr nBytesRead, NULL
.IF eax == 0
mov eax, FALSE ; failed, so return false anyhow
ret
.ENDIF
; Check for signature: '..' = FE FF = FFFEh
xor eax, eax
mov ax, FileSigBuffer
; PrintHex eax
.IF eax == 0FFFEh
mov eax, TRUE
.else
mov eax, FALSE
.endif
ret
IsFileUTF16BE endp
;**************************************************************************
; Determines if the specified file is Unicode UTF8 file:
; hFile is the handle of the opened file to check
;**************************************************************************
IsFileUTF8 PROC hFile:DWORD
LOCAL FileSigBuffer:DWORD
LOCAL nBytesRead:DWORD
Invoke ReadFile, hFile, Addr FileSigBuffer, 4, Addr nBytesRead, NULL
.IF eax == 0
mov eax, FALSE ; failed, so return false anyhow
ret
.ENDIF
; Check for signature: '....' = 0xEF 0xBB 0xBF = ??BFBBEFh
mov eax, FileSigBuffer
and eax, 00FFFFFFh ; remove the first byte as we dont care about that one
; PrintHex eax
.IF eax == 00BFBBEFh
mov eax, TRUE
.else
mov eax, FALSE
.endif
ret
IsFileUTF8 endp
;**************************************************************************
; Determines if the specified file is an ASCII encoded file
; hFile is the handle of the opened file to check
;**************************************************************************
IsFileASCII PROC hFile:DWORD
LOCAL FileSigBuffer[64]:BYTE ; could take a larger sample to check of course
LOCAL nBytesRead:DWORD
Invoke ReadFile, hFile, Addr FileSigBuffer, 64, Addr nBytesRead, NULL
.IF eax == 0
mov eax, FALSE ; failed, so return false anyhow
ret
.ENDIF
; Loop through the buffer, checking for character codes = or greater than space (31h) and less than or = to tilde (126h)
lea esi, FileSigBuffer
lea edi, FileSigBuffer
add edi, 64d
scanasciiloop:
.if esi != edi
xor ebx, ebx
mov bl, byte ptr [esi]
.if bl >= 31d && bl <= 126d || bl == 13d || bl == 10d || bl == 08d ; ok then continue
inc esi
;PrintDec ebx
jmp scanasciiloop
.else
mov eax, FALSE
jmp endscanasciiloop
.endif
.else
mov eax, TRUE
; PrintText 'Ascii Found'
.endif
endscanasciiloop:
ret
IsFileASCII endp
Thanks for answers,
I have made little progress:
Quote
The first four bytes of a file contain the file signatures or the magic numbers that uniquely identify the file. For instance,
JPEG image file is always found to hold the value FF D8 FF E0 (Hex) in the first four bytes, GIF image file is identified by its first
three bytes as 47 49 46 and 42 4D as the first two bytes of the file indicates a Bitmap
Htm 3 first bytes EF BB BF,need confirmation.
I've more IsFile???? file types procs if you want, made them a while ago which use where possible magic header signature checks. Basically my scanner type program checked for most known types, and skipped for the common ones (validating with IsFile???? whatever), i was only interested in files that where:
- not as advertised, like a .gif returning false for IsFileGif, or a .rar returning false for IsFileRar
- exe, dll, binary type files
- few other specific file types for game engine stuff
- and all other unknown formats
which where then more deeply searched/scanned. Let me know if you want them and ill post them up or pm you, whichever you like - might save you some time if it is what your looking for.
Quote from: ToutEnMasm on September 12, 2011, 07:52:03 AM
Htm 3 first bytes EF BB BF,need confirmation.
That's not *.htm specific, it's just the UTF-8 BOM.
Try this approach, the algo in the test piece counts all characters, then do your analysis on the counts of 13 and 10 and what characters are above 127.
IF 0 ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
Build this template with "CONSOLE ASSEMBLE AND LINK"
ENDIF ; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
include \masm32\include\masm32rt.inc
char_count PROTO :DWORD,:DWORD
.code
start:
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
call main
inkey
exit
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
main proc
LOCAL carr[256]:DWORD ; array to hold character counts
LOCAL hMem :DWORD ; handle of text memory
push ebx
push esi
push edi
mov hMem, InputFile("\masm32\include\windows.inc")
invoke memfill,ADDR carr,1024,0 ; zero fill array
invoke char_count,hMem,ADDR carr ; count characters in source
lea esi, carr
xor ebx, ebx
lbl:
mov edi, [esi+ebx*4]
print ustr$(ebx)," --- "
print ustr$(edi),13,10
add ebx, 1
cmp ebx, 255
jle lbl
free hMem
pop edi
pop esi
pop ebx
ret
main endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
char_count proc psrc:DWORD,parr:DWORD
mov ecx, psrc
mov edx, parr
sub ecx, 1
; -----------
; unroll by 4
; -----------
align 4
lbl0:
add ecx, 1
movzx eax, BYTE PTR [ecx] ; zero extend each byte into EAX
add DWORD PTR [edx+eax*4], 1 ; increment the count for that character
test eax, eax
jz lbl1
add ecx, 1
movzx eax, BYTE PTR [ecx]
add DWORD PTR [edx+eax*4], 1
test eax, eax
jz lbl1
add ecx, 1
movzx eax, BYTE PTR [ecx]
add DWORD PTR [edx+eax*4], 1
test eax, eax
jz lbl1
add ecx, 1
movzx eax, BYTE PTR [ecx]
add DWORD PTR [edx+eax*4], 1
test eax, eax
jnz lbl0
lbl1:
sub ecx, psrc ; calculate the length of the source
mov eax, ecx ; return it to the caller
ret
char_count endp
; ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
end start
Quote from: ToutEnMasm on September 11, 2011, 03:19:09 PM
I need to identify text file without too much errors.
There is : Is 0d0ah exist ? ---> give bad results
There is : If it is not PE file,perhaps it's a text file.
Theris is : Nothing to identify an (bmp,gif,ico .......) without testing each format and there is numreous formats of images.
There is : Is there not visible chars ... <32 and not 9,13,10 char : bad results (>.h have this chars ...)
Any Idea ?
Ask user by using dialog window - allow users read data as them wish
YOU CANNOT IDENTIFY TEXT YOU CAN CHECK FILEFORMAT ONLY
HOW DO YOU IDENTIFY DIFFERENT SOURCE CODES ??? - IS IT .C OR .CPP OR SMALTALK ????
QuoteHtm 3 first bytes EF BB BF,need confirmation.
there is no standard file marker for html or xml
the EF BB BF is for UTF-8 :P
however, you can scan the first several lines for "<html" or "<xml" - the letters can be upper or lower case, possibly mixed
unicode: FF FE
unicode big endian: FE FF
UTF-8: EF BB BF
of course, these files should more or less meet the test for text files
QuoteGIF image file is identified by its first three bytes as 47 49 46
42 4D as the first two bytes of the file indicates a Bitmap
gif files will start with either "GIF87" or "GIF89" - the year is the specification version
bmp files start with "BM", followed by the file size in binary
mpeg files start with "MPG"
zip files start with "PK"
there are a million of em :bg
you might make an INI file so you can easily edit or add markers in the list
not all files will be easy to add, as the marker may not always be in the first bytes
a little more progress
http://www.garykessler.net/library/file_sigs.html
Quote
suppress what is not txt
.if word ptr [esi] == "ZM"
jmp Extension_filtrer ;Windows/DOS executable file
.elseif word ptr [esi] == 0D8FFh ;jpeg probable
jmp Extension_filtrer
.elseif word ptr [esi] == 4947h ; IG gif probable
jmp Extension_filtrer
.elseif word ptr [esi] == 0BBEFh ;UTF_8 (made bd char ù*¤ in notepad)
jmp Extension_filtrer
.elseif word ptr [esi] == 4D42h ;MB bitmap probable
jmp Extension_filtrer
.elseif byte ptr [esi] < 32 && byte ptr [esi+1] < 32
mov eax,0
.if byte ptr [esi] == 13
inc eax
.elseif byte ptr [esi] == 10
inc eax
.elseif byte ptr [esi] == 9
inc eax
.endif
.if byte ptr [esi+1] == 13
inc eax
.elseif byte ptr [esi+1] == 10
inc eax
.elseif byte ptr [esi+1] == 9
inc eax
.endif
.if eax != 2 ;not text file
jmp Extension_filtrer
.endif
.endif
Here is the output of hutch test piece on windows.inc
Quote
0 --- 1
;------------------- 0 cutted to shorter
9 --- 2210
10 --- 22274
11 --- 0
12 --- 0
13 --- 22274
;-----------------------
32 --- 304722
33 --- 1
34 --- 98
35 --- 1
36 --- 0
37 --- 0
38 --- 3
39 --- 13
40 --- 878
41 --- 878
42 --- 223
43 --- 1394
44 --- 121
45 --- 2725
46 --- 44
47 --- 10
48 --- 23384
49 --- 6054
50 --- 4092
51 --- 2287
52 --- 3287
53 --- 1388
54 --- 1413
55 --- 1142
56 --- 2318
57 --- 756
58 --- 66
59 --- 372
60 --- 717
61 --- 170
62 --- 717
63 --- 3773
64 --- 1
65 --- 19032
66 --- 6634
67 --- 15022
68 --- 18112
69 --- 34892
70 --- 8078
71 --- 6301
72 --- 5089
73 --- 19942
74 --- 681
75 --- 2313
76 --- 13266
77 --- 13631
78 --- 18654
79 --- 21242
80 --- 10732
81 --- 581
82 --- 26495
83 --- 21251
84 --- 27028
85 --- 7855
86 --- 4884
87 --- 7097
88 --- 2349
89 --- 3906
90 --- 669
91 --- 9
92 --- 0
93 --- 8
94 --- 0
95 --- 29162
96 --- 1
97 --- 2788
98 --- 784
99 --- 1751
100 --- 3368
101 --- 20213
102 --- 1217
103 --- 799
104 --- 8632
105 --- 2733
106 --- 54
107 --- 312
108 --- 1836
109 --- 1516
110 --- 2227
111 --- 2099
112 --- 1607
113 --- 14871
114 --- 2781
115 --- 2372
116 --- 3593
117 --- 16020
118 --- 386
119 --- 1027
120 --- 453
121 --- 767
122 --- 440
123 --- 9
124 --- 0
125 --- 9
126 --- 37
;----------------
171 --- 365
;----------------
171 = 1/2 :P
i was thinking you might see the copyright symbol once in a while (169)
of course, in non-English text files, you may see other chars, like those with accents and rings above
more than likely, they would be unicode files, though
Here something giving a good result:
Quote
;------------------------ see if ASCII file ---------------------------
;Release\vc90.idb
;test 127 --> 168 sur 1000 caracteres
mov ecx,ArchiveFile.Tfichier ;size of file
.if ecx > 1000
mov ecx,1000
.else
;don't include the zero end
dec ecx
.if ecx == 0 ;NULL size
mov retour,1
jmp Findeadd_to_archive ;Nothing to do
.endif
.endif
@@:
.if byte ptr [esi] != 0
.if byte ptr [esi] >= 127 && byte ptr [esi] <= 144 ;No printable char
jmp Extension_filtrer ; NOT a text file
.endif
.if byte ptr [esi] >= 147 && byte ptr [esi] <= 159 ;non imprimable
jmp Extension_filtrer
.endif
;a good tool to see that is maketbl in the masm32 package
.else
jmp Extension_filtrer
.endif
dec ecx
.if ecx != 0
inc esi
jmp @B
.endif
mov esi,ArchiveFile.Pfichier
;pointeur sur ext
.if word ptr [esi] == "ZM"
jmp Extension_filtrer ;Windows/DOS executable file
.elseif word ptr [esi] == "iM" ;microsoft
invoke lstrcmpi,esi,SADR("Microsoft") ;text
.if eax == 0
jmp Extension_filtrer
.endif
.elseif word ptr [esi] == 0D8FFh ;jpeg probable
jmp Extension_filtrer
.elseif word ptr [esi] == 4947h ; IG gif probable
jmp Extension_filtrer
.elseif word ptr [esi] == 0BBEFh ;UTF_8 (made bad char ù*¤ in notepad)
jmp Extension_filtrer
.elseif word ptr [esi] == 4D42h ;MB bitmap probable
;need more test here
jmp Extension_filtrer
.elseif byte ptr [esi] < 32 && byte ptr [esi+1] < 32
mov eax,0
.if byte ptr [esi] == 13
inc eax
.elseif byte ptr [esi] == 10
inc eax
.elseif byte ptr [esi] == 9
inc eax
.endif
.if byte ptr [esi+1] == 13
inc eax
.elseif byte ptr [esi+1] == 10
inc eax
.elseif byte ptr [esi+1] == 9
inc eax
.endif
.if eax != 2 ;not text file
jmp Extension_filtrer
.endif
.endif
rejet_signature:
mov NbRetourLigne,1