Large import gripes

Hi,

I’m in the process of importing a large music collection into Beets, just under 200k tracks. I’m about a third in and I’d like get some feedback on the problems I’m encountering and possible solutions. The collection has already mostly been tagged with Picard, renamed and organized into a folder structure. So I’m using Beets in metadata only mode (copy: no, write:no, incremental: yes, but manually writing tags to files when a chunk is done), for the extra tagging provided by the Discogs and Beatport plugins, finding duplicates as well as other features and scripting down the road.

Most of the issues are related to reimporting and duplicate handling:

  • Rescanning does not handle moved/renamed tracks, instead it finds a new duplicate (despite Beets erring on missing files).
  • Rescanning an album that was manually tagged outside of Beets causes it to find a duplicate, and if one does select “remove old”, the single album on disk is deleted. Selecting “keep both” maintains a duplicate entry in the DB.
  • Rescanning in incremental mode does not handle deleted tracks, dead entries remain in the DB. Instead of skipping on already visited folders in incremental mode, I believe the scan should go deeper before deciding to skip. Beets should still check the folder contents and file timestamps and proceed with import if they don’t match the DB.
  • Rescanning in non-incremental mode requires you to reenter all manual choices even for albums that were already imported and tagged. This should only be necessary for new albums. Maybe provide a parameter to let users chose between maintaining previous choices or re-asking.
  • Currently tracks tagged via Discogs or Beatport are not automatically re-detected during re-import, unlike MB.
  • Beets should be able to detect multi-disc albums even when the folders are not labeled CD_ or Disc_. On the first disc it might complain about missing tracks, but as soon as the second disc is scanned, it should detect it as belonging to the same album instead of another incomplete duplicate. This could be limited to folders contained in the same parent folder.
  • Albums without “album” tags that also cannot be autotagged are all detected as duplicates of album " - ". By now I have a long list of those. For these, the duplicate detection should be disabled, or at least switched to something else, like based on the fingerprint.
  • Due to the issues with importing, there is no good workflow for dealing with albums falsely detected as duplicates, unless you manually tag them via Beets (not so fun, sorry).
  • Finally there are cases where I’m unable to determine WHY an album or track is detected as duplicate.

Other issues

  • Transliteration or rather aliasing of foreign characters (cyrilic, asian) isn’t working, while in Picard it does.
  • Occasional import hangs, both random and reproducible.
  • Feature Request: Setting distance threshold value for automatic “Use as is”
  • “beet write” changes files on disk even when writing is pointless. For example: The bpm field is stored as float in the DB but as int in the tags. This is detected as a change during every Beets write, causing redundant disk writes, which can be inconvenient, for copy-on-write filesystems, for instance.

I’m even considering looking into some of those issues, but I’m no Python coder. I have some experience with PHP but any tips for setting up the dev environment would be welcome. A doc page for devs about the recommended IDE setup, testing procedures, required packages, other recommended tools etc. would be ideal.

Cheers

Indeed, incremental mode is not really meant for use with copy: no—it sort of expects you to use the standard setup, with a separate “incoming” directory from your main library directory. This assumption actually applies to a handful of things on your list: in general, we like to recommend that people stick with the in- and out-directory setup if possible.

FWIW, the update command is the thing that’s supposed to handle deleted (missing) files by cleaning up DB entries.

Have you seen the discussion in the FAQ about multi-disc albums? This is a big, somewhat complicated set of heuristics. If you’re interested in helping expand how this heuristics work, please do consider creating a detailed design—but it’s really important that it work with the existing heuristics and that it is very unlikely to “accidentally” group together unrelated albums. That’s a hard problem!

For albums, a duplicate happens when the artist and album fields are identical; for tracks, it’s the artist and title fields.

For any of these, please consider filing bugs with complete details!

Thanks for being willing to contribute! We have a page like that on the wiki:

Thanks.

I don’t think you can universally assume that Beets will be the only tool changing the files on disk. Once you concede that, you’ll have to assume users can run into the problems I’ve described. With some moderate changes these use cases could work much better with Beets.

Indeed! Syncing with “out-of-band” changes by other tools is what the update command is supposed to do, but it certainly doesn’t cover every use case. In particular, it gets confused if files get moved around—but that’s a pretty tricky problem to solve reliably.