Delimited File vs. Fixed-Field

dick_grindler · February 28, 2006, 05:37:56 AM

Hello all,

On an Intel x86 based system which would be a more efficient file format -- in terms of space and parsing speed -- delimited or fixed-field?

In case this is unclear a delimited text file looks something like this:

"field_1","field_2","field_3",...,"field_N"

A fixed-field file looks something like this:

field_1 field_2 field_3

ie. each field has a fixed size with unused bytes padded with spaces.

The intended file wil contain 1 million lines/records with each line not exceeding 200 bytes.

My reasoning on the matter is that the fixed-field file will be space inefficient because in practice every field won't be fully populated, eg. if we allocate 20 bytes for first name in most cases 10 bytes will be redundant. In terms of parsing or tokenising efficiency I don't think there will be much difference because (a) each record is small enough to fit in memory, accessing the bytes of each record won't trigger a page fault, it will -- after the disk read -- be like traversing an array; (b) the linear processing of the string -- seeking delimiters -- will be efficient because the string would be in CPU cache; and (c) although with fixed-field we wouldn't need to perform a chracter-by-character comparison looking for the start and end of each field because most of the fields will have redundant spaces we would need to perform a character-by-character comparison from the end of the field loooking for the first non-whitespace character in order to trim these.

I'm predicting that fiex-field would be slightly faster to parse than delimited. Your responses would settle an office debate. I have given consideration to actually testing two files but my crappy coding won't do each file justice.

Regards
DG

hutch-- · February 28, 2006, 06:28:01 AM

Dick,

It will have a lot to do with how you parse it. The top line looks like CSV format which is best done with a left to right word scanner. A fixed size and fixed offsets for members can be read a block at a time of the known length and it may be slightly faster depending on how you read and parse it.

The delimited version will probably be smaller and if you parse it correctly, it can be very fast.

Mark Jones · February 28, 2006, 07:53:59 AM

How about a fixed-length of say, 16 bytes, then compressing the source file using LZMA or 7z or something? (Decompress the file, parse, discard.) The decompress is very fast for most compression algos, perhaps faster than byte-scanning? On second thought, maybe not.

MichaelW · February 28, 2006, 09:08:24 AM

Assuming that in your fixed-field file the fields were organized into records, with each of the records having the same layout, finding a particular field in a particular record would require only a simple calculation. With a delimited file to find a particular field you would need to do a search, and for large files this would be relatively slow. For a fixed-field file you could eliminate the need to search for the end of the field data by including a length field in front of each data field. You could have the best features of both, the storage efficiency of a delimited file and the rapid access of a fixed-field file, by using a delimited file in combination with an index. Properly done, the index could also make insertions and deletions simpler and faster. To make insertions and deletions simpler and faster for a fixed-field file you could arrange the records as a linked list. Today, with storage space so cheap, I would probably use a fixed-field file.

sluggy · February 28, 2006, 09:27:50 AM

I have used both styles, and it depends upon your requirements. Personally i would go with the fixed field for three reasons:

- it compresses nicely
- because field sizes are known, you can parse it very quickly
- you can use XML (XSD) to describe the file, so if something needs to change ou only change the XSD, you don't have to recompile the app.

EduardoS · February 28, 2006, 01:08:45 PM

1 million records between 10 and 200 bytes?
Using a Delimited File you have same problems, how to find the 547285th record?
But with fixed problem all records must have 200 bytes, 200*1 million = 200 MB, not a good idea,
I agree with MichaelW, an Index is the best choice, in the index just put the start and lenght of each record, so its fast to find the 547285th record, and the file will be only 1 or 2 MB bigger than the Delimited version,
BTW, you are thinking on changing or deleting records?

News:

Delimited File vs. Fixed-Field

dick_grindler

hutch--

Mark Jones

MichaelW

sluggy

EduardoS