Basic script to dedupe identical FLACs

Not exactly beets, although it would be cool to have this functionality in beets in the import / dupe detecting step.

I have files in beets and files elsewhere. With freezetag I can have two symlinks that only take up one file on my HDD. So I needed a hash based (not fingerprint based, not approximate) matcher.

I surprisingly couldn’t find one that operated on FLAC’s native MD5 hashes, built right into the tags.

I also learned about 1% of my files have a fake hash: 00000000000000000000000000000000

So anyone implementing dupe detection needs to be aware of this.

My basic python script is below. Took about an hour to go through 500 GB of FLACs, mostly CD-size.

Once you get the matches in the DB, you need to run a basic sql query to get all the dupes. I also wanted src folder to take priority over music folder, so you need your query to sort by that. Sadly I closed sqlite without saving the query, but there are examples online.

Then I ran the matches through python’s send2trash.

One thing I would change from the below script is the use of lowecase. It ruined usual deletion programs and made me need Python for the deletions too.

I found metaflac frustrating because there’s no guarantee that it’s operating read-only. I instead relied on backing up my files and setting the folders to read-only. I also found mutagen frustrating bc I couldn’t get FLAC MD5s/ fingerprints out of it.

allowed_exts = [".flac", ".mp3", ".aac"]
import pathlib
import os
import pdb
import mutagen
import subprocess
import sqlite3
import re
from send2trash import send2trash

# create database and initialize cursor
con = sqlite3.connect(os.path.join(r"c:/all/music.sqlite"))
CUR = con.cursor()

import os, shutil

def check_file_has_name(pathlib_obj):
    """Return false for filenames like `.nomedia`."""
    return pathlib_obj.stem !=

def get_flac_md5(path):
    args = ["metaflac", "--show-md5sum", str(path)]
        md5 = subprocess.check_output(args, shell=False)
        md5 = md5.decode("utf-8").rstrip()
        pat = re.compile(r"([a-fA-F\d]{32})")
        if pat.fullmatch(md5):
            return md5
        return False
    return False

def init_db(cur):
    cur.execute('CREATE TABLE IF NOT EXISTS "files" ("dir", "file", "hash");')

def delete_files(path_to_txt):
    with open(path_to_txt, "r", encoding="utf-8") as topo_file:
        for line in topo_file:
            fname = pathlib.Path(line.rstrip())
            if os.path.isfile(fname):  # this makes the code more robust

def save_hash(cur, mydir, myfile, hash):
    sql_insert = """INSERT INTO files(dir, file, hash) VALUES (?, ?, ?)"""
    cur.execute(sql_insert, (mydir, myfile, hash))

def handle_file(mydir, file, fullpath, ext):
    if ext == ".flac":
        md5 = get_flac_md5(fullpath)
        if md5:
            # some files (<1%) have obviously fake hash
            if md5 != r"00000000000000000000000000000000":
                save_hash(CUR, str(mydir), str(file), md5)

def mywalk(search_dir):
    jj = 0
    for dirpath, dirnames, filenames in os.walk(search_dir):
        for file in filenames:
            low = file.lower()
            full_path = pathlib.Path(dirpath, low)
            my_ext = pathlib.Path(low).suffix.lower()
            if not check_file_has_name(full_path):
                if in allowed_exts:
                    my_ext = pathlib.Path(low).name.lower()
            handle_file(dirpath, low, full_path, my_ext)
            jj += 1
            if jj % 1000 == 0:
                # "progress bar"
        # for filename in files:
        #     doSomethingWithFile(os.path.join(root, filename))
        # for dirname in dirs:
        #     doSomewthingWithDir(os.path.join(root, dirname))

if __name__ == "__main__":
    # init_db(CUR)
    # # mywalk(pathlib.Path(r"e:\music"))
    # mywalk(pathlib.Path(r"e:\src"))
    # delete_files(r"C:\all\delete_these_files.txt")