Providing hints to autotagger

Hi there. I’m just getting started with beets, and I’m trying to figure out how to improve autotagger matching.

I’m importing an Album from this release group. The specific release is here.

The first suggestion beets gives me is a vinyl from the year 1978, whereas this is a CD from 2010. Since the FLAC files don’t have any of this metadata embedded yet, I suppose this is reasonable as beets doesn’t have much to work with.

Even if I hit M to show more matches, beets only shows another 5 albums out of the 43 from that group, and none of the 5 are the exact one I’m looking for (side questions: how does it choose these 5? why not show all 43?).

Is it possible to give hints to beets so that it has more information to improve the matches? Being able to provide the year or catalog number in this case would both immediately reduce to one possibility since these are unique among others in the group. I experimented with the --set flag on import (e.g., --set year=2010 or --set catalognum="AFZ 095"), but this had no effect.

Edit: Also, I’m aware that you can specify the specific MusicBrainz ID, but that takes more effort as it requires looking up the release on MusicBrainz first. I have all of the other information on-hand already, and would love to be able to automate these hints in my import scripts with no user intervention required.

It needs to choose a limit so it doesn’t spend an infinite amount of time downloading metadata from MusicBrainz.

One thing you might consider trying is importing with the hints (either with --set or after the fact with the beet modify command) and then re-importing the same album again, once you have put more contextual information in place.

Other than that, if there are general rules that seem to characterize your music, you might try the match.preferred config option or some other similar options. But there is nothing that specifically lets you tell the importer which catalog number to search for for a specific album.

I see. I was expecting beets to do a single query for the group that returns all of the albums at once, then beets would download further metadata once it picked the album with the shortest distance. I guess it needs specific track information for each album to calculate that distance in the first place, though.

Yeah, that was going to be my backup plan, though I think that means it’ll require an intermediate import path since I don’t want to touch the source music files. Are there any plugin hooks that would let me inject the metadata before beets queries MusicBrainz?

Sure! You can see the “fromfilename” plugin for an example that does that.

1 Like

Thanks for the reference. Unfortunately, beets is still unable to match the release. Here’s a test plugin I created that sets the media, year, label, and catalog number:

from beets.plugins import BeetsPlugin
import six

class OriginTagger(BeetsPlugin):
    def __init__(self):
        super(OriginTagger, self).__init__()
        self.register_listener('import_task_start', origin_task)

def origin_task(task, session):
    items = task.items if task.is_album else [task.item]

    for item in items:
        item.media = six.text_type('HDCD')
        item.catalognum = six.text_type('AFZ 095')
        item.year = 2010
        item.label = six.text_type('Audio Fidelity')
        print item.keys

After running beet -d ~/downloads/music-out import ~/downloads/music-in/, here’s a dump of the first item, showing that the keys were in fact modified:

<bound method Item.keys of Item(mb_releasetrackid=u’‘, lyrics=u’‘, album_id=None, albumstatus=u’‘, disctitle=u’‘, lyricist=u’‘, month=0, channels=2, genre=u’Rock’, original_day=0, albumartist=u’Billy Joel’, mb_trackid=u’‘, composer=u’‘, year=2010, albumdisambig=u’‘, samplerate=44100, albumartist_sort=u’‘, id=None, album=u’52nd Street’, mb_artistid=u’64b94289-9474-4d43-8c93-918ccc1920d1’, bitdepth=16, disctotal=1, title=u’Big Shot’, media=u’HDCD’, artist_sort=u’‘, mb_albumid=u’‘, arranger=u’‘, comments=u’Audio Fidelity AFZ 095 HDCD’, tracktotal=9, rg_track_peak=0.962982, mb_releasegroupid=u’54e6b4a1-7a6a-3ead-b032-0912cfd49a1e’, mtime=0, acoustid_id=u’‘, mb_albumartistid=u’64b94289-9474-4d43-8c93-918ccc1920d1’, rg_album_peak=0.975006, albumartist_credit=u’‘, catalognum=u’AFZ 095’, added=0.0, original_month=0, asin=u’‘, track=1, comp=False, encoder=u’‘, composer_sort=u’‘, initial_key=None, rg_track_gain=-2.77, path=’/home/user/downloads/music-in/Billy Joel - 1978 - 52nd Street/01 - Big Shot.flac’, bitrate=812358, day=0, original_year=0, language=u’‘, r128_album_gain=None, artist=u’Billy Joel’, releasegroupdisambig=u’‘, country=u’‘, script=u’‘, bpm=0, label=u’Audio Fidelity’, r128_track_gain=None, rg_album_gain=-0.99, length=243.78666666666666, disc=1, albumtype=u’‘, artist_credit=u’‘, acoustid_fingerprint=u’‘, format=‘FLAC’, grouping=u’')>

However, this is the first result:

Tagging:
Billy Joel - 52nd Street
URL:
https://musicbrainz.org/release/b4a99044-144a-4d82-a2c7-5d307fff78e9
(Similarity: 94.0%) (media, catalognum, label, year) (Hybrid SACD, 2012, US, Mobile Fidelity Sound Lab, UDSAD 2090)

And the next 5 after that aren’t the right release, either. Why can’t beets find it?

Huh! I’m sorry, but I don’t know how to tell why beets doesn’t find it without doing a deep debugging dive. Maybe you’d be interested in checking exactly what search terms beets ends up using, and looking directly at the MusicBrainz Web service response to see why it might not be returning the results you expect?

Well, the cause is simple enough: beets doesn’t use any of the other metadata in the search! The release name, artist(s), and track count are the only criteria used in the MusicBrainz query: https://github.com/beetbox/beets/blob/master/beets/autotag/mb.py#L423

So the query ends up looking like this: http://musicbrainz.org/ws/2/release/?limit=5&query=release%3A(52nd+street)+artist%3A(billy+joel)+tracks%3A(9)

As expected, adding any (or all) of the following to this query return the expected release as the first match:
* +format%3AHDCD
* +date%3A2010
* +catno%3AAFZ095
* +label%3AAudio%20Fidelity

Is there any reason in particular that beets would want to exclude this metadata in the search? If so, would you accept a PR that adds a submit_criteria event so that plugins can inject other fields into the criteria? If not, how about a PR that adds all of the available fields listed here to mb.py?

Got it! Thanks for the detective work.

The reason we don’t include all the relevant fields in the search criteria is to avoid narrowing the search too much. That is, if a field like the catalog number is wrong, we would still like to be able to find relevant matches. I don’t know how eagerly including a bad catalog number would exclude otherwise-good matches; perhaps this would be worth experimentation.

The idea is that we should get several relevant matches back and then filter them with our own similarity matching to find the best. That usually works but, as you saw, fails if there are lots of similar releases.

I like the idea of a plugin event, or even just a single-purpose config option that decides what to include! Even a config option that just increases the number of search results could work for this particular case…

I would hope that bad tags would be the exception rather than the rule, in which case I’d argue that it’s better to fail and require manual intervention for badly tagged music than to fail for music that is tagged correctly!

As it stands now, it’s just a matter of luck whether a result set of N items will contain a given release if there are >N items with that artist/album/track count in the database. It seems like other fields should count for something.

From some cursory tests, the answer seems to be “not very eager at all”, which would be good news.

Here’s a query from beets with the limit set to 100. Compare that to a query that appends date, catno, date, and format which are all wrong (the invalid parameters were all taken from this Dark Side of the Moon release). The result set is still overwhelmingly Billy Joel/52nd Street; in fact, Dark Side of the Moon appears only near the end of the result set–with a weak score of 43–despite 4 of the 7 search parameters being exact matches! It seems, then, that MusicBrainz weighs these fields much less, treating them more as disambiguators rather than primary criteria.

That might be the best option here as it would give the community the chance to test these results beyond my tiny anecdotal sample, and it would be straightforward to transition to more fields later if the outcome is favorable.

That’s cool! Thanks for investigating!! Yeah, maybe a good way to proceed would be an experimental config flag that enables several other criteria when issuing the search. Then, it will be possible to compare the match results “in vivo” on real-world music. I suspect you’re right and results will be uniformly better, even for badly tagged music, and we can flip the option to be on by default.

Any takers interested in implementing such a flag?