Detecting folders with duplicated FLAC albums

Like it or not over the years we end up copying and moving things around, make backups of backups and before you know it you have two or more copies of the identical album (i.e. regardless of file size and metadata, the audio streams are identical) lying around in your library.

For my own use I coded some SQLite queries to identify folders within my library containing FLAC files which together have the identical audio stream to the FLAC files in another folder. In doing the analysis the code generates 3 SQLite tables

  • a list of distinct folders
  • a list of folders with the same FLAC content
  • a list of folders which can be deleted (leaving behind only one copy of the FLAC files)

The code will not modify or delete your music, it simply does analysis based on table contents. It’s been used and tested extensively by myself and a few friends and has no known issues.

I think it’d be a neat addition to beets and require only a few additional tables to be created whenever a user wants to check their library for duplicate albums and perhaps some code to act on the outcomes.

Beets’ import code would need to import and store the md5sum from the underlying FLAC files

I’d be happy to share the code if there’s interest in incorporating into beets.

Sounds very cool! But just for comparison, have you tried out the duplicates plugin? I know it’s not the same because your thing uses actual content hashes instead of metadata, but it might be worth thinking about how it might be integrated.

I had a quick look at the duplicates plugin documentation and I think its scope is a lot broader than that I put together. It may be possible to incorporate the ideas or code I put together into the existing plugin to deal with the specific task of finding folders in a collection that contain a collection of FLAC files with identical audio contents, using only the md5sum.

Looking at beets’ items table it doesn’t look like you grab the md5sum from FLAC metadata, which is a pity because all that’d be needed beyond that point is to use the sqlite code within Python to do the comparing. Without the md5sum one would need to fetch same. My scripts do so using metaflac, but I know it can be retrieved using mutagen too.

Anyhow, the scripts are here:

I’m not particularly familiar with Python so incorporating would be a challenge for me, but if dev’s feel there’s benefit adding it and are willing to humour me and assist I’m happy to have a go in my spare time.