Chroma match information is dropped, not available for musicbrainz?

rik · January 5, 2020, 11:12pm

i’m importing my library, and digging in on those that aren’t matching to see why.

i have albums that successfully match acoustic fingerprints (AFP), but then seem to have these hints ignored by subsequent musicbrainz database searches?! my debug tracing through the various pipeline/coroutine layers of the plugins architecture makes it seem as if the _matches and related dictionaries (correctly) produced by chroma.acoustid_match() are NOT being returned as the value of event handler in plugins.send(); it is getting None as a return value.

i’ve documented the LAST state of _matches after all items in the album have been processed, showing all the AFP match info below

but as soon as the fingerprint task finishes, and returns to send(), the result is None:

Anyone have clues as to why _matches, etc are not being brought up with the task?

Rik

adrian · January 6, 2020, 3:22am

Huh! No, nothing obviously jumps out here… any chance you could include a verbose log from an affected run?

rik · January 6, 2020, 5:10pm

it seems discourse doesn’t support attachments? so i made a zip file containing both a minimal test config.yaml and the full log, and put it on dropbox:

rik · January 6, 2020, 5:20pm

the data file of MP3s is ~200 MB, but i can drop that on dropbox, too, if you’d find it helpful?

adrian · January 6, 2020, 9:54pm

Hmm… I might be a little confused. According to this log, it does actually seem like fingerprint-based album candidates are being considered. This line:

chroma: acoustid album candidates: 4

is where the plugin indicates that it has produced 4 album matches. You can see them listed above:

Requesting MusicBrainz release f2f91c39-085e-4c3a-b542-e0e4f7f9c32f
primary MB release type: album
secondary MB release type(s): compilation
Requesting MusicBrainz release ec7e7dde-55ff-33cc-b2df-20276f7371eb
primary MB release type: album
Requesting MusicBrainz release 45e86248-739b-4f7d-8a55-5e9f735d6565
primary MB release type: other
secondary MB release type(s): compilation
Requesting MusicBrainz release 5fb3cd81-c939-4e41-a5df-14b51dc3ae62
primary MB release type: album
secondary MB release type(s): compilation

Anyway, below that, you can also see beets performing the matching logic on all four of those albums. So they seem to have made their way to the full autotagger pipeline.

Any chance you could expand on why you have concluded that this isn’t working?

rik · January 7, 2020, 8:28pm

Thanks for looking in to this Adrian, and almost certainly it’s me
that’s confused. I’ll lay out my thinking below, and scatter a few
specific questions as i do. For background, note that i’m focusing on
this CD’s tracks because I know the MBI albumID it should match:
ec7e7dde-55ff-33cc-b2df-20276f7371eb; i’ll call it ec7e7 for
short. also, i have set threaded: no to make the debugging progress more sequential.

my original post focused on the (first) handoff from chroma to the
handler, because I could not find anyplace the carefully constructed
_matches (and related) dictionaries were being maintained; can you
help me with that: where is this state kept across pipeline stages?

the next stage is doing the “Looking up:” with the minimal info
available from album directory name: {'release': u'brahms: ein deutsches requiem', 'tracks': u'7', 'artist': u'barbara hendricks, jose van dam; herbert von karajan: vienna philharmonic orchestra, vienna singverein'}. Here i would have expected chroma’s identification of album candidates would be used to narrow this search, but it seems not to?

nevertheless, ec7e7 is indeed one of the retrieved candidates, but
with a pretty bad distance=0.17 score. But then you are totally
correct, that chroma does again chime in with the line you quote.
This gets generated in (line 202) chroma.AcoustidPlugin.candidates().

so now the issue appears to me to be the “Duplicate” flag that
dismisses the correct candidate ec7e7.

this is generated by match._add_candidate(), when called by
match.tag_album() from line 467, because the argument search_ids is empty. that means the ImportTask objects self.search_ids is empty in ImportTask.lookup_candidates(). Popping up to importer.lookup_candidates(), search_ids is supposed to come from session.config['search_ids'] which the comment (line 1373) makes it seem as if are to come from user interaction (only?!).

thanks again for reading if you’ve made it this far! i’m grateful for
any insights you might have.

adrian · January 7, 2020, 8:56pm

It is _matches itself where this is maintained—it’s a global variable that’s shared by all stages in the plugin. (That is not an awesome design decision, by the way—I wish this were associated with the import session rather than being a global variable…)

Correct. The “normal” matching logic uses metadata rather than fingerprints, and that’s what you’re seeing here. The chroma plugin doesn’t disable that normal lookup logic—it only adds to it by mixing in additional releases found by fingerprinting.

Interesting… keep in mind that that’s an 83% similarity (not a 17% similarity), so it’s not that bad. But it would be worth trying to nail down why the accuracy is lower than you’d expect… maybe some of the titles are mismatched?

Yes, those come from the user-interactive “enter Id” prompt. The Chroma matches come from a separate query, not represented in lookup_candidates. The candidates method in the plugin is where the action happens—the hooks.album_for_id call within that looks up the album for the fingerprinted ID.

rik · January 7, 2020, 9:07pm

umm, ok. but why then should this be flagged as a duplicate, and dropped?

rik · January 7, 2020, 9:10pm

but then how is state exposed to later stages of the pipeline?

adrian · January 8, 2020, 3:18am

Take a look at the lookup_candidates function. That gets called later in the pipeline, when it’s time to look up album information. Core beets asks the plugin, “hey, what albums are a good match for these audio files?” And the plugin says, “let me see if there’s anything in _matches for those audio files. If so, I’ll try looking up a corresponding album or two for them and send those back to beets core.”

That’s because the autotagger found that it already had that album from some other source, probably the default metadata-based search logic. Maybe if you import this album in “timid” mode and ask to see all candidates, you’ll be able to see it?

rik · January 8, 2020, 3:40am

now we’re getting to it!

it seems to me the chroma matches are great features to affect “metadata-based search logic”? how/where do i modify that?

adrian · January 8, 2020, 12:06pm

Hmm, I’m not actually sure what that would look like! Can you explain a little more about exactly how you’d want to change the way things work, relative to the status quo?

rik · January 8, 2020, 7:11pm

the change should be that the correct MBID be identified, rather than SKIPPED.

i have traced thru this pretty carefully: some info on potential matches is coming from chroma, and some from MB. but rather than COMBINING this information, it seems to be simply DROPPED as a duplicate (in match._add_candidate) ? do i have that right?

if so, i’m asking where to put in more clever logic that ACCUMULATES clues ACROSS plugins, prior to the calculation of distance?

adrian · January 8, 2020, 10:15pm

That’s the thing—the only reason it’s being skipped (AFAICT) is that the metadata-based matcher already found that album. So it should be available as an option, with or without fingerprinting. Is it in the list of candidates (using the “timid mode” tip I mentioned above)?

Aha, I think this is an important point! The important point here is that things happen in two phases (although they are not strictly time-ordered):

Collect potential matches. (The beets core does this with metadata-based searching. Plugins can also add matches.)
Score those matches. (The beets core does this by comparing metadata. Plugins can also add scoring here.)

But they’re actually separate phases. So the Chroma plugin finds additional matches, in stage #1, that augment the metadata-based matches. Then, all of those matches, regardless of where they came from, go through the same process in stage #2. The Chroma plugin also provides match-quality evidence based on IDs in this stage. But it does so without knowing where the matches were originally found.

rik · January 12, 2020, 10:52pm

sorry for the delay in my response, but it’s because your last
comments made me appreciate much better how beets’ matching occurs. in particular, all my chatter above in this thread merging chroma attributes with musicbrainz attributes, were bogus: all are marshalled and ready: the details of matching seem to all lead around the construction and use of the autotag.hooks.Distance object. and before going any further, i want to thank you again for your patient hand-holding as i come to better appreciate all that is in beets.

i wrote a little Distance.pprint() method (below) to enumerate all
the features that go into the distance calculation. using the same
ec7e7dde-55ff-33cc-b2df-20276f7371eb (i’ve calling it ec7e7)
target as above, using the default weightings i get this:

	album 0.0327272727273
	artist 0.0914085914086
	mediums 0.0
	year 0.012539184953
	* dist.tracks
	0 0.0491803278689
	1 0.0353535353535
	2 0.125541125541
	3 0.0374331550802
	4 0.0522243713733
	5 0.0369799691834
	6 0.0376175548589
	* tracks
	track 1 2279052b-27b9-48bd-bc3b-56f9bb9ec6f8 1 Ein deutsches Requiem, op. 45: I. Selig sind, die da Leid tragen
	track_id [0.0]
	track_index [0.0]
	track_length [0.0]
	track_title [0.12962962962962962]
	...
	track 7 b29dc974-8a28-45d6-9cf3-ef239180252f 7 Ein deutsches Requiem, op. 45: VII. Selig sind die Toten, die in dem Herrn sterben
	track_id [0.0]
	track_index [0.0]
	track_length [0.0]
	track_title [0.4603174603174603]
	...
	choice: /Data/tmp/music-minTest/Barbara Hendricks, Jose Van Dam; Herbert Von Karajan_ Vienna Philharmonic Orchestra, Vienna Singverein/Brahms_ Ein Deutsches Requiem	action.SKIP	Brahms: Ein Deutsches Requiem	Barbara Hendricks, Jose Van Dam; Herbert Von Karajan: Vienna Philharmonic Orchestra, Vienna Singverein	0.170705052658	ec7e7dde-55ff-33cc-b2df-20276f7371eb	"album:[0.24]; tracks:[0.049180327868852465, 0.03535353535353535, 0.12554112554112554, 0.03743315508021391, 0.05222437137330755, 0.03697996918335902, 0.03761755485893417]; mediums:[0.0]; year:[0.27586206896551724]; artist:[0.6703296703296703]"

in the case of my target album, i can make it match with a score above strong_rec_thresh by reducing the match.distance_weights.track_title weight:

	album 0.0327272727273
	artist 0.0914085914086
	mediums 0.0
	year 0.012539184953
	* dist.tracks
	0 0.0128805620609
	1 0.00925925925926
	2 0.0328798185941
	3 0.00980392156863
	4 0.0136778115502
	5 0.00968523002421
	6 0.00985221674877
	* tracks
	track 1 2279052b-27b9-48bd-bc3b-56f9bb9ec6f8 1 Ein deutsches Requiem, op. 45: I. Selig sind, die da Leid tragen
	track_id [0.0]
	track_index [0.0]
	track_length [0.0]
	track_title [0.12962962962962962]
	...
	track 7 b29dc974-8a28-45d6-9cf3-ef239180252f 7 Ein deutsches Requiem, op. 45: VII. Selig sind die Toten, die in dem Herrn sterben
	track_id [0.0]
	track_index [0.0]
	track_length [0.0]
	track_title [0.4603174603174603]
	...
            dist: Barbara Hendricks, Jose Van Dam; Herbert Von Karajan: Vienna Philharmonic Orchestra, Vienna Singverein - Brahms: Ein Deutsches Requiem / ec7e7dde-55ff-33cc-b2df-20276f7371eb: 0.15 "{'album': [0.24], 'tracks': [0.01288056206088993, 0.009259259259259259, 0.032879818594104306, 0.009803921568627453, 0.013677811550151976, 0.009685230024213076, 0.009852216748768473], 'mediums': [0.0], 'year': [0.27586206896551724], 'artist': [0.6703296703296703]}"
        Success. Distance: 0.15
	choice: /Data/tmp/music-minTest/Barbara Hendricks, Jose Van Dam; Herbert Von Karajan_ Vienna Philharmonic Orchestra, Vienna Singverein/Brahms_ Ein Deutsches Requiem	action.APPLY	Brahms: Ein Deutsches Requiem	Barbara Hendricks, Jose Van Dam; Herbert Von Karajan: Vienna Philharmonic Orchestra, Vienna Singverein	0.15	ec7e7dde-55ff-33cc-b2df-20276f7371eb	7	7	0	0

to my mind, if i have an album with all tracks having identified
track_id (acoustic fingerprint ID’s) attributes, that should trump
(you should excuse the expression:) minor mismatches in things like the track_title; do you agree, or can there be mitigating issues?

further, (and i suggest this with all humility) there seems to me a
general bug in beets’ DISTANCE logic with respect to
EXACT logical matches like identical track_id: there seems no way
to up-weight this feature because it will have distance=0 on matches!? is there another mechanism whereby plugins could/should manipulate the distance calculation in logical (vs. weighted sums of distances) situations like this?

def pprint(self,lbl):
	print("** "+lbl)
	for k in sorted(self._penalties.keys()):
		if k=='tracks':
			continue
		print(k,self[k])
	print('* dist.tracks')
	for ti,tp in enumerate(self._penalties['tracks']):
		print(ti,tp)
	print('* tracks')
	sortTracks = sorted(self.tracks.keys(),key=lambda ti: ti.index)
	for ti in sortTracks:
		tinfo = self.tracks[ti]
		tid = ti.track_id
		print("track",ti.index,tid)
		for tk in sorted(tinfo._penalties):
			print(tk,tinfo._penalties[tk])

adrian · January 13, 2020, 2:13am

Cool! I like the idea of a pretty-printer for distance objects to get more detail about what went into a match.

It is true that beets tends to steer away from “logical,” rule-based measurement of match quality. That is, you could imagine building in a fixed-function rule that says “if the track ID matches, then nothing else matters.” (One pseudo-exception is the match.ignored setting, which does kind of override the distance-based matching.) There are two reasons we have not done this:

Modularity. Keeping all factors for deciding match quality under a single, uniform representation has a lot of advantages in simplicity. There is a single source of truth to look at to understand what beets “thinks” about a match, for example. If we had a long list of exceptions and rules, it would be even harder than it is now to see when something went wrong. And it would be even harder to change the relative importance of one factor over another.
Contingency. It’s actually not true that an MBID or fingerprint match should always override differences in track titles and artists in all cases. First of all, fingerprints can be wrong! It’s not an exact science, especially for different recordings of the same song. Second, if we’re changing the titles of tracks based on fingerprint matches, that’s a big change—it’s probably something the user will want to know about. So applying a match penalty is our way of having the UI alert the user that something big is going to change.

So the “normal beets way” to do what you want is to just increase the weight of the ID match. (FWIW, “correct” matches still count when computing the distance, so distance=0 does mean something—and it means more for fields with greater weight.) You can actually do that with an undocumented config option that lets you manually set match weights. See this example:

github.com

beetbox/beets/blob/a08f2315ea0c13c60d46f2947984e835898d083c/beets/config_default.yaml#L114


    ratelimit_interval: 1.0
    searchlimit: 5


match:
    strong_rec_thresh: 0.04
    medium_rec_thresh: 0.25
    rec_gap_thresh: 0.25
    max_rec:
        missing_tracks: medium
        unmatched_tracks: medium
    distance_weights:
        source: 2.0
        artist: 3.0
        album: 3.0
        media: 1.0
        mediums: 1.0
        year: 1.0
        country: 0.5
        label: 0.5
        catalognum: 0.5
        albumdisambig: 0.5

Of course, I can see the argument for Chroma in particular overriding the normal matching logic and “forcing” matches when IDs match. That could be worth exploring, but it would not be an easy change.

rik · January 13, 2020, 3:35am

right, i’m using those match.distance_weights. now. and i can certainly see modularity reasons for doing dot-product style weight*feature combination. but unless i’m missing something, even a match.distance_weights.track_id = \infinity gives zero if you multiply it times a feature = 0.0?

so with all those caveats… if one wanted to graft a plugin to the existing match process…?

adrian · January 13, 2020, 4:00am

No, a 0.0 weight still means something because of normalization. Here’s how it works: imagine that there are only two factors, A and B. And that A has weight 1000 and B has weight 2. Then the total distance if A has score 0.0 and B has score 1.0, then the total distance is 0.0 * 1000 + 1.0 * 2. But then we normalize, so the final normalized distance is (0.0 * 1000 + 1.0 * 2)/(1000 + 2). That means that the match is seen extremely good, i.e., the normalized distance is very small.

Unfortunately, I don’t really know what to recommend if you wanted to change how all this works. It’s very much baked in to the matching process. Adding penalties is one thing, but changing the entire way this stuff is calculated is another and there is no specific plugin hook built for that.

rik · January 13, 2020, 4:18am

ok, thanks, i get the normalization bit. i’ll hack some more, see if i can build something that suits my needs, and get back to you.

thanks again for the friendly beets playground you’ve driven.

rik · January 16, 2020, 6:46pm

Coda: testing against a set of known albums with matching fingerprints, i can only get them to be accepted with massive up-weighting of track_id, down-weighting of other key attributes, and strong_rec_thresh: 0.1:

strong_rec_thresh: 0.1
distance_weights:
    artist: 1.0
    album: 1.0
    album_id: 5.0
    track_title: 1.0
    track_artist: 1.0
    track_id: 10000.0

Topic		Replies	Views
Chroma auto suddenly slow Help	2	446	November 22, 2019
Tools for sample clearance Help	3	597	August 1, 2017
New problem with chroma fingerprint matching Help	3	490	December 8, 2020
Beatport plugin not creating matches / not working? Help	8	774	June 18, 2018
Import using only on mp3 tags? Help	8	466	April 29, 2021

Chroma match information is dropped, not available for musicbrainz?

Related topics