News:

MASM32 SDK Description, downloads and other helpful links
MASM32.com New Forum Link
masmforum WebSite

Duplicate File Finder proggy

Started by Shooter, January 10, 2011, 05:15:02 PM

Previous topic - Next topic

Shooter

Has anyone had any luck building a decent program to find duplicate files? I wonder if it's possible to scan .MP3s for just the music content (ignoring the tags and such) to determine if a duplicate exists. Has anyone tried this?

-Shooter
Never use direct references to anything ever. Bury everything in
macros. Bury the macros in include files. Reference those include
files indirectly from other include files. Use macros to reference
those include files.

jj2007

- get all candidate file names (e.g. FindFirstFile *.mp3 for a given folder and its subfolders)
- sort them by file size
- check if there are "neighbour files" with identical size or very similar size
- verify if contents are identical

Shooter

Quote from: jj2007 on January 10, 2011, 06:04:10 PM
- verify if contents are identical

Yeah, um... contents aren't always going to be identical if one .MP3 file has a blank ID3 tag and another one has a partially completed ID3 tag, yet they carry the exact same song. It won't be the same file size or hash, and sometimes not even the same filename. Hence why I'm asking.

Background (I used to be a DJ, but not so much these days):
Over the last 15 or so years I've accumulated thousands of CDs (up to a certain point, then switched to buying MP3 albums and songs). About 10 or so years ago I started ripping the CDs to my hard drives, but found that with the limited size (at the time) I had to span my collection over multiple hard drives (over 20 for over 100,000 songs). Now that I've obtained eight 1-Terabyte drives, I'm trying to condense my collection and have found that I have multiple copies of some CDs, and even more of certain songs. The problem is that some songs contain no ID3 info, or artwork, some have older versions of ID3 tags, and some have newer versions of ID3 tags as well as artwork. Adding to the headache, the various softwares I've used changed up the filename schemes at various times.

The one thing that I know is a constant is the music data itself (to some degree... one program changed the volume (amplitude) of a few of my harddrive's contents, so no telling how that's going to play in all of this).

I plan on eventually getting back into the DJ game, but I want to do it using a program that I made (not one that costs hundreds or thousands of dollars), and I want a cleaner selection of music to choose from. So I've got my work cut out for me and I'm seeking a shortcut in any area I can to achieve this.
Never use direct references to anything ever. Bury everything in
macros. Bury the macros in include files. Reference those include
files indirectly from other include files. Use macros to reference
those include files.

jj2007

You need to identify "neighbours", with criteria such as similar file size or partly identical file names. Sorting by size is just a first step - and bear in mind that bitrates may also differ, with much higher influence on file size than a tiny tag.

Shooter

Quote from: jj2007 on January 10, 2011, 06:51:28 PM
You need to identify "neighbours", with criteria such as similar file size or partly identical file names. Sorting by size is just a first step - and bear in mind that bitrates may also differ, with much higher influence on file size than a tiny tag.

Oof! I forgot about the bitrates. I wonder if it's possible to expand each song into a raw wave file, create a hash and store it, dispose of the wave file, and move on to the next one; then at the end compare the hashes? How difficult would that be do you think?
Never use direct references to anything ever. Bury everything in
macros. Bury the macros in include files. Reference those include
files indirectly from other include files. Use macros to reference
those include files.

FORTRANS

Hi,

   Rather than hashes, a Fast Fourier Transform of the *.WAV
is more likely to act as an indication of the contents.  Break the
file into one second pieces and try to match by frequency and
magnitude of the pieces.

Cheers,

Steve N.

jj2007

That sounds pretty advanced. But if you manage to extract pieces of melodies, i.e. sequences of sounds with different frequencies, there is another technique to find matches:
- Split the piece into tones, e.g. do re mi fa sol
- compare each tone against the following one
REPEAT 16
- for every unchanged tone, set two bits to zero
- for a rising tone, set the first bit
- for a falling tone, set the second bit
- shl eax, 2
- get the next tone
ENDM
Now you have got  DWORD with a pattern that identifies a sequence of 16 tones. It is the DNA of a melody.

Shooter

Steve and JJ,
That is way advanced... if I had some code to examine, I might could make sense of it.

-Shooter
Never use direct references to anything ever. Bury everything in
macros. Bury the macros in include files. Reference those include
files indirectly from other include files. Use macros to reference
those include files.

FORTRANS

Hi,

   Most FFT routines are from existing code or called from a
library.  The FFT converts between a time sequence (waves
or amplitudes) and frequency values.  Representing the data
can be a bit tricky if you want to actually look at it.  Google
just failed me as I have forgotten the term.

  Okay look for "Images for voice spectrogram".  Some of
those show the frequencies in speech over time.  And you
would do something similar for your music.

   I have used FFT's for a number of differing jobs.  But
most of the useful stuff was for image processing trying
to remove or suppress artifacts..

Cheers,

Steve N.

dedndave

i think trying to identify the tune is a bit pie-in-the-sky
even if you used fast fourier transforms or any other method
you might be able to seperate rock from country, based on the chords they use - lol
        xor     eax,edx
        test    eax,4050h
        jnz     Is_Rock

        test    eax,4160h
        jnz     Is_Country

        jmp    Exit_It_Must_Be_Rap

Neil

Have you tried MediaMonkey,

www.mediamonkey.com

This free program does everything you want. I use it to maintain my music collection of 200,000 files & without it that would be an almost impossible task.

FORTRANS

Quote from: dedndave on January 10, 2011, 11:00:01 PM
i think trying to identify the tune is a bit pie-in-the-sky
even if you used fast fourier transforms or any other method

Hi,

   True.  But you could separate those that start out loud from
those that are quieter.  Those that have drums or some other
defining instrument.  Maybe you could tell if there was a vocal
part, though I think that would be pushing things a bit.

Regards,

Steve N.

redskull

not so pie-in-the-sky after all:

http://www.shazam.com/music/web/home.html

+1 for the FFT method, though it's probably more work than just going through your collection and retagging things manually.
Strange women, lying in ponds, distributing swords, is no basis for a system of government

Shooter

Quote from: Neil on January 11, 2011, 08:52:29 AM
Have you tried MediaMonkey,

www.mediamonkey.com

This free program does everything you want. I use it to maintain my music collection of 200,000 files & without it that would be an almost impossible task.

I'm taking a look at it now. Thanks.
Never use direct references to anything ever. Bury everything in
macros. Bury the macros in include files. Reference those include
files indirectly from other include files. Use macros to reference
those include files.

Tedd

What you're looking to do is create a 'thumbprint' that captures the essence of the audio, rather than the contents of the file - which is entirely dependent on the format, bitrate, encoder, tags, and numerous other things.
So, what you'd need to do is decode the contents of the file into audio (wav is probably easiest and best supported), split that into sections of some length (e.g. 250ms), do a DCT (like FFT but simpler and quicker - you want essence, not detail) on each section, and then store a hash of the first few components (5 or 6 is usual) of each section (along with a reference to the source.)
Sections of audio with the same content will hash the same, and so you can lookup any file by hashing a small number of sections to check if they're already in the 'library.' If it is, you can print its twin(s), and if not you can hash the entire file and add it to the library.
Note: this isn't exact matching, so it could produce false matches - it only tests the contents for similarity. But, assuming the number of duplicated files will be small, they can easily be checked by hand.

Of course this isn't a simple binary compare, and you're not going to achieve this very effectively by any simple method. It would be worth doing a straightforward binary compare, taking care to skip the audio tags - it might catch some files at least.
Another method that could work, and is considerably simpler, is to decode the audio and match using cross-correlation (a convolution function). Though I wouldn't suggest this as it would take exponential time (you'd need to compare every section of every file against every section of every other file.)
No snowflake in an avalanche feels responsible.