Large import gripes

zrav · December 14, 2017, 9:17pm

Hi,

I’m in the process of importing a large music collection into Beets, just under 200k tracks. I’m about a third in and I’d like get some feedback on the problems I’m encountering and possible solutions. The collection has already mostly been tagged with Picard, renamed and organized into a folder structure. So I’m using Beets in metadata only mode (copy: no, write:no, incremental: yes, but manually writing tags to files when a chunk is done), for the extra tagging provided by the Discogs and Beatport plugins, finding duplicates as well as other features and scripting down the road.

Most of the issues are related to reimporting and duplicate handling:

Rescanning does not handle moved/renamed tracks, instead it finds a new duplicate (despite Beets erring on missing files).
Rescanning an album that was manually tagged outside of Beets causes it to find a duplicate, and if one does select “remove old”, the single album on disk is deleted. Selecting “keep both” maintains a duplicate entry in the DB.
Rescanning in incremental mode does not handle deleted tracks, dead entries remain in the DB. Instead of skipping on already visited folders in incremental mode, I believe the scan should go deeper before deciding to skip. Beets should still check the folder contents and file timestamps and proceed with import if they don’t match the DB.
Rescanning in non-incremental mode requires you to reenter all manual choices even for albums that were already imported and tagged. This should only be necessary for new albums. Maybe provide a parameter to let users chose between maintaining previous choices or re-asking.
Currently tracks tagged via Discogs or Beatport are not automatically re-detected during re-import, unlike MB.
Beets should be able to detect multi-disc albums even when the folders are not labeled CD_ or Disc_. On the first disc it might complain about missing tracks, but as soon as the second disc is scanned, it should detect it as belonging to the same album instead of another incomplete duplicate. This could be limited to folders contained in the same parent folder.
Albums without “album” tags that also cannot be autotagged are all detected as duplicates of album " - ". By now I have a long list of those. For these, the duplicate detection should be disabled, or at least switched to something else, like based on the fingerprint.
Due to the issues with importing, there is no good workflow for dealing with albums falsely detected as duplicates, unless you manually tag them via Beets (not so fun, sorry).
Finally there are cases where I’m unable to determine WHY an album or track is detected as duplicate.

Other issues

Transliteration or rather aliasing of foreign characters (cyrilic, asian) isn’t working, while in Picard it does.
Occasional import hangs, both random and reproducible.
Feature Request: Setting distance threshold value for automatic “Use as is”
“beet write” changes files on disk even when writing is pointless. For example: The bpm field is stored as float in the DB but as int in the tags. This is detected as a change during every Beets write, causing redundant disk writes, which can be inconvenient, for copy-on-write filesystems, for instance.

I’m even considering looking into some of those issues, but I’m no Python coder. I have some experience with PHP but any tips for setting up the dev environment would be welcome. A doc page for devs about the recommended IDE setup, testing procedures, required packages, other recommended tools etc. would be ideal.

Cheers

adrian · December 15, 2017, 9:24pm

Indeed, incremental mode is not really meant for use with copy: no—it sort of expects you to use the standard setup, with a separate “incoming” directory from your main library directory. This assumption actually applies to a handful of things on your list: in general, we like to recommend that people stick with the in- and out-directory setup if possible.

FWIW, the update command is the thing that’s supposed to handle deleted (missing) files by cleaning up DB entries.

Have you seen the discussion in the FAQ about multi-disc albums? This is a big, somewhat complicated set of heuristics. If you’re interested in helping expand how this heuristics work, please do consider creating a detailed design—but it’s really important that it work with the existing heuristics and that it is very unlikely to “accidentally” group together unrelated albums. That’s a hard problem!

For albums, a duplicate happens when the artist and album fields are identical; for tracks, it’s the artist and title fields.

For any of these, please consider filing bugs with complete details!

Thanks for being willing to contribute! We have a page like that on the wiki:

zrav · December 16, 2017, 7:47am

Thanks.

I don’t think you can universally assume that Beets will be the only tool changing the files on disk. Once you concede that, you’ll have to assume users can run into the problems I’ve described. With some moderate changes these use cases could work much better with Beets.

adrian · December 16, 2017, 3:46pm

Indeed! Syncing with “out-of-band” changes by other tools is what the update command is supposed to do, but it certainly doesn’t cover every use case. In particular, it gets confused if files get moved around—but that’s a pretty tricky problem to solve reliably.

Topic		Replies	Views
Overwriting duplicates when moving Help	1	319	October 12, 2022
Newbie: How to deal with duplicates within an album? Help	6	2502	September 27, 2019
Duplicate files Help	4	655	May 23, 2023
Problem with merging tracks into albums when importing Help	18	1698	April 21, 2023
Release vs release group Help	14	1528	July 19, 2017

Large import gripes

Related topics