News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Read a file then build a frequency table

Started by Scalemail Ted, March 16, 2010, 02:25:53 PM

Previous topic - Next topic

Scalemail Ted

Hello everyone,

I'm having massive trouble trying to get this to work...

I'd like  to build a procedure that will read a file, one character at a time, and then pass that character into an array that stores its value by frequency.

I've experimented, I have a proc that can build a freq table from a string.

I have a proc that can read in a file one character at a time.

My attempt to merge the two have failed.

This is a partial segment of my code.

I know, I'm doing something wrong... obviously, b/c its not working :p

Any feedback would be awesome.




donkey

Haven't test this but it should be about what your looking for.

CountFreq PROC uses edi esi pszFile:DWORD, pFreqBins:DWORD
LOCAL fsh:DWORD
LOCAL hFile:DWORD
LOCAL pFileData:DWORD
LOCAL cbFile:DWORD

invoke RtlZeroMemory,[pFreqBins],256
invoke CreateFile,[pszFile],GENERIC_READ,NULL,NULL,OPEN_EXISTING,NULL,NULL
mov [hFile],eax
invoke GetFileSize,[hFile],addr fsh
mov [cbFile], eax
invoke VirtualAlloc,NULL,eax,MEM_COMMIT,PAGE_READWRITE
mov [pFileData],eax
invoke ReadFile,[hFile],[pFileData],[cbFile],addr fsh,NULL
invoke CloseHandle,[hFile]

mov ecx,[fsh]
mov edi,[pFileData]
mov esi,[pFreqBins]
@@:
mov eax,[edi+ecx]
add DWORD PTR[esi+eax*4],1
dec ecx
jns @B

@@:
invoke VirtualFree,[pFileData],NULL,MEM_RELEASE
RET
CountFreq ENDP


It will fill a buffer of "bins" with the frequency that each one is encountered. For example A is 42h, so bin 42h will have the number of times character A appears in the file.

Edgar
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

Scalemail Ted

Its like it partially works.. something is off...

So I test it with a file that contains:

ABCDEEFDABDEDAE

And when I use this method in conjunction with my sort proc i get:

D
A
B
C
F

I should get:

D
E
A
B
C
F

donkey

Well, I'd check the sort proc to make sure it isn't the problem. try dumping the array to make sure that the data for E is present and if it is revisit the sort proc. Here's the same thing with mapping so you don't have to allocate so much memory:

CountFreq2 PROC uses edi esi pszFile:DWORD, pFreqBins:DWORD
LOCAL fsh:DWORD
LOCAL hFile:DWORD
LOCAL MapFileOffset:DWORD
LOCAL hMapFile:DWORD

invoke RtlZeroMemory,[pFreqBins],1024
invoke CreateFile,[pszFile],GENERIC_READ,NULL,NULL,OPEN_EXISTING,NULL,NULL
mov [hFile],eax
invoke CreateFileMapping,[hFile],NULL,PAGE_READONLY,0,0,NULL
mov [hMapFile],eax
invoke MapViewOfFile,[hMapFile],FILE_MAP_READ,0,0,0
mov [MapFileOffset],eax

invoke GetFileSize,[hFile],addr fsh

mov ecx,eax
mov edi,[MapFileOffset]
mov esi,[pFreqBins]
xor eax,eax
@@:
mov al,[edi+ecx]
add DWORD PTR[esi+eax*4],1
dec ecx
jns @B

invoke UnmapViewOfFile,[MapFileOffset]
invoke CloseHandle,[hMapFile]
invoke CloseHandle,[hFile]
RET
CountFreq2 ENDP
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

donkey

Well, I tried it and dumped a file containing ABCDEFHIJA, the result was correct:

00403BE8:  00 02 01 01-01 01 01 00-01 01 01 00-00 00 00 00   ................

I cut out most of the dump because it was just filled with 0's. I used byte sized bins instead of DWORD sized ones for the test but I tested it with DWORD sized ones and it works fine as well.
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

dedndave


@@:     mov     al,[edi]
        inc dword ptr [esi+eax*4]
        inc     edi
        dec     ecx
        jnz     @B

donkey

This was an interesting little project, thanks. I have tested a sort for the bins and found that it is easier to store the character in the bin as well, so I defined a structure for the bin and recycled a sort routine for it. A quick test shows it works though since I wrote it pretty quickly I can't vouch for the speed...

CHARBIN STRUCT
sum dd ?
char dd ?
CHARBIN ENDS

CountFreq2 PROC uses edi esi pszFile:DWORD, pFreqBins:DWORD
LOCAL fsh:DWORD
LOCAL hFile:DWORD
LOCAL MapFileOffset:DWORD
LOCAL hMapFile:DWORD

invoke RtlZeroMemory,[pFreqBins],2048
mov eax,255
mov edi,[pFreqBins]
@@:
; Load the array
mov [edi+eax*8+4],eax
dec eax
jns @B

invoke CreateFile,[pszFile],GENERIC_READ,NULL,NULL,OPEN_EXISTING,NULL,NULL
mov [hFile],eax
invoke CreateFileMapping,hFile,NULL,PAGE_READONLY,0,0,NULL
mov [hMapFile],eax
invoke SetLastError,0
invoke MapViewOfFile,[hMapFile],FILE_MAP_READ,0,0,0
mov [MapFileOffset],eax

invoke GetFileSize,[hFile],addr fsh

mov ecx,eax
dec ecx
mov edi,[MapFileOffset]
mov esi,[pFreqBins]
xor eax,eax
@@:
mov al,[edi]
add DWORD PTR[esi+eax*8],1
inc edi ; thanks Dave
dec ecx
jns @B

invoke UnmapViewOfFile,[MapFileOffset]
invoke CloseHandle,[hMapFile]
invoke CloseHandle,[hFile]
RET
CountFreq2 ENDP

SortBins proc uses esi edi ebx edx pData:DWORD
LOCAL done :DWORD
LOCAL Jump :DWORD
LOCAL nItems :DWORD
LOCAL i :SDWORD

mov [nItems],256
mov Jump,255
mov eax,pData

.WHILE Jump > 1
shr Jump,1
.REPEAT
mov edi,nItems
sub edi,Jump
mov i,edi
mov edx,1
mov done,TRUE
.WHILE SDWORD PTR edx <= i
mov esi,Jump
add esi,edx
mov ebx,[eax+esi*8-8]
mov ecx,[eax+edx*8-8]
.IF ecx < ebx
mov [eax+edx*8-8],ebx
mov [eax+esi*8-8],ecx
;Swap the char data
mov ebx,[eax+esi*8-4]
mov ecx,[eax+edx*8-4]
mov [eax+edx*8-4],ebx
mov [eax+esi*8-4],ecx
mov done,FALSE
.ENDIF
inc edx
.ENDW
.UNTIL done
.ENDW

ret

SortBins endp


You call it like this:

Bins CHARBIN 256 DUP (<?>)

invoke CountFreq2,offset FileName,offset Bins
invoke SortBins,offset Bins

I made one small change to the CountFreq2 proc, it decrements the file size by one so we don't get the NULL at the end.

Tested on the following:
AABBCDDEFHIBBJA

result:
00403050:  04 00 00 00-42 00 00 00-03 00 00 00-41 00 00 00   ....B.......A...
00403060:  02 00 00 00-44 00 00 00-01 00 00 00-43 00 00 00   ....D.......C...
00403070:  01 00 00 00-46 00 00 00-01 00 00 00-49 00 00 00   ....F.......I...
00403080:  01 00 00 00-45 00 00 00-01 00 00 00-4A 00 00 00   ....E.......J...
00403090:  01 00 00 00-48 00 00 00-00 00 00 00-01 00 00 00   ....H...........


Edgar
"Ahhh, what an awful dream. Ones and zeroes everywhere...[shudder] and I thought I saw a two." -- Bender
"It was just a dream, Bender. There's no such thing as two". -- Fry
-- Futurama

Donkey's Stable

KeepingRealBusy

Scalemail Ted,

The actual error is:

Quote   mov al, sourcefile_buffer

That simple.

The problem is that you load only the last byte of a 4 byte register (the other three upper bytes remain unchanged), then you use the entire register to index the location to update. You can fix either by:

xor eax,eax

to clear eax BEFORE you load al, or use:

movzx eax,BYTE PTR sourcefile_buffer

which clears it while loading the LoByte.

Dave.

Scalemail Ted

Thanks everyone,  I have it working now! :U


I do have another question pertaining to the register sizes and writing data to a file. 

What exactly is occurring when you write to a file.  Do you have to convert your data into byte sized char values to read them into the buffer so that you can write it out to a file?

So far I can read a file, index all of its characters, sort the characters from most common to most rare, and now I want to write that out to a new file.  but the newly created files contain inconsistent chars. 

Im assuming I need to to convert the data before I can write it to a file?

dedndave

Quotebut the newly created files contain inconsistent chars

the data you have created is binary, in this case dwords
what you'd probably like to see is a list of decimal numbers that represent the counts for each of the 256 ASCII characters
something like...

000 123
001 456
.
.
255 1234567890

those are decimal ASCII strings, seperated by a space, and followed by a carriage return/line feed sequence
the first column can just be generated - no need to convert - and can include the space character
the second column needs a little routine to convert binary dwords into decimal ASCII strings
a dword can hold unsigned values up to 4294967295 (10 decimal digits)
you can use a macro to do this conversion, but you don't really learn anything that way
use the forum search tool to see if you can find a dword to decimal ASCII routine
you can create the entire list in a buffer area, then write it to disk all at once

Scalemail Ted

Quote from: dedndave on March 17, 2010, 11:55:45 AM
Quotebut the newly created files contain inconsistent chars

the data you have created is binary, in this case dwords
what you'd probably like to see is a list of decimal numbers that represent the counts for each of the 256 ASCII characters
something like...

000 123
001 456
.
.
255 1234567890

those are decimal ASCII strings, seperated by a space, and followed by a carriage return/line feed sequence
the first column can just be generated - no need to convert - and can include the space character
the second column needs a little routine to convert binary dwords into decimal ASCII strings
a dword can hold unsigned values up to 4294967295 (10 decimal digits)
you can use a macro to do this conversion, but you don't really learn anything that way
use the forum search tool to see if you can find a dword to decimal ASCII routine
you can create the entire list in a buffer area, then write it to disk all at once

awesome.

Just to  make sure I understand properly.  If I have an integer value stored within a DWORD (say for example 9000) and if I wanted to write it to a file I would use this latter method and convert it into a new decimal ascii string && then if I wanted to read that file later I would run a similar routine to reconvert that string back into my original number?

dedndave

well - if you want the data in binary form, you may want to save it that way and skip the conversion steps
i figured you wanted a text file that you could view with notepad or something
but, you can store the raw binary data, then look at it with a hex editor or something to view it

hutch--

Unless you need to read the string in ASCII outside of your code with something else, just write the DWORD data 4 bytes at a time and read it back the same way. Saves conversions in both directions.
Download site for MASM32      New MASM Forum
https://masm32.com          https://masm32.com/board/index.php

dedndave


Scalemail Ted

Let me explain what I want to do...

It seems so simple its stupid to ask... but it has me stuck right now and I dont know how to deal with it yet.

I'm building a file compressor... and so now i want to start to write the zipped file.

the first thing i wanted to do was write a header containing the numerical count of total characters in the file and then my ascii/keycode index.

seems simple, but my counter which tallies up the character count is a dword... my test case was 9258 characters.  How do I wirte this to a file so that when i decompress it my program  will recogonize it?

Am i limited to just the char in write file...will i need to reanlate my keys in the header.... or will the computer know what the values are when it reads the file again with no additional actions from me?