Basic script to dedupe identical FLACs

RollingStar · January 22, 2022, 7:47pm

Not exactly beets, although it would be cool to have this functionality in beets in the import / dupe detecting step.

I have files in beets and files elsewhere. With freezetag I can have two symlinks that only take up one file on my HDD. So I needed a hash based (not fingerprint based, not approximate) matcher.

I surprisingly couldn’t find one that operated on FLAC’s native MD5 hashes, built right into the tags.

I also learned about 1% of my files have a fake hash: 00000000000000000000000000000000

So anyone implementing dupe detection needs to be aware of this.

My basic python script is below. Took about an hour to go through 500 GB of FLACs, mostly CD-size.

Once you get the matches in the DB, you need to run a basic sql query to get all the dupes. I also wanted src folder to take priority over music folder, so you need your query to sort by that. Sadly I closed sqlite without saving the query, but there are examples online.

Then I ran the matches through python’s send2trash.

One thing I would change from the below script is the use of lowecase. It ruined usual deletion programs and made me need Python for the deletions too.

I found metaflac frustrating because there’s no guarantee that it’s operating read-only. I instead relied on backing up my files and setting the folders to read-only. I also found mutagen frustrating bc I couldn’t get FLAC MD5s/ fingerprints out of it.

allowed_exts = [".flac", ".mp3", ".aac"]
import pathlib
import os
import pdb
import mutagen
import subprocess
import sqlite3
import re
from send2trash import send2trash

# create database and initialize cursor
con = sqlite3.connect(os.path.join(r"c:/all/music.sqlite"))
CUR = con.cursor()

import os, shutil


def check_file_has_name(pathlib_obj):
    """Return false for filenames like `.nomedia`."""
    return pathlib_obj.stem != pathlib_obj.name


def get_flac_md5(path):
    args = ["metaflac", "--show-md5sum", str(path)]
    try:
        md5 = subprocess.check_output(args, shell=False)
        md5 = md5.decode("utf-8").rstrip()
        pat = re.compile(r"([a-fA-F\d]{32})")
        if pat.fullmatch(md5):
            return md5
    except:
        print(path)
        return False
    return False


def init_db(cur):
    cur.execute('CREATE TABLE IF NOT EXISTS "files" ("dir", "file", "hash");')


def delete_files(path_to_txt):
    with open(path_to_txt, "r", encoding="utf-8") as topo_file:
        for line in topo_file:
            fname = pathlib.Path(line.rstrip())
            if os.path.isfile(fname):  # this makes the code more robust
                send2trash(fname)


def save_hash(cur, mydir, myfile, hash):
    sql_insert = """INSERT INTO files(dir, file, hash) VALUES (?, ?, ?)"""
    cur.execute(sql_insert, (mydir, myfile, hash))
    con.commit()


def handle_file(mydir, file, fullpath, ext):
    if ext == ".flac":
        md5 = get_flac_md5(fullpath)
        if md5:
            # some files (<1%) have obviously fake hash
            if md5 != r"00000000000000000000000000000000":
                save_hash(CUR, str(mydir), str(file), md5)


def mywalk(search_dir):
    jj = 0
    for dirpath, dirnames, filenames in os.walk(search_dir):
        for file in filenames:
            low = file.lower()
            full_path = pathlib.Path(dirpath, low)
            my_ext = pathlib.Path(low).suffix.lower()
            if not check_file_has_name(full_path):
                if full_path.name in allowed_exts:
                    my_ext = pathlib.Path(low).name.lower()
            handle_file(dirpath, low, full_path, my_ext)
            jj += 1
            if jj % 1000 == 0:
                # "progress bar"
                print(full_path)
        # for filename in files:
        #     doSomethingWithFile(os.path.join(root, filename))
        # for dirname in dirs:
        #     doSomewthingWithDir(os.path.join(root, dirname))


if __name__ == "__main__":
    # init_db(CUR)
    # # mywalk(pathlib.Path(r"e:\music"))
    # mywalk(pathlib.Path(r"e:\src"))
    # delete_files(r"C:\all\delete_these_files.txt")
    pass

Topic		Replies	Views
Detecting folders with duplicated FLAC albums Help	2	614	January 3, 2020
Importing totaltracks and totaldiscs from flac files Help	2	422	November 14, 2022
Importing "better" quality music Help	2	785	May 29, 2020
Preserving flac hash Help	2	731	May 15, 2020
Automatically merging two collections by file type Help	8	1324	November 25, 2018

Basic script to dedupe identical FLACs

Related Topics