Making song matching more robust

Stavros · March 3, 2017, 3:07pm

Hey everyone,
Like many of you, I’ve had my music for decades now, and my playlists also. A few years ago I uploaded my favorite songs from my Favorites playlist to Google Music, and now I want to migrate back to local again. I have locally all the songs that are on Google Music, but the latter won’t export the playlist at all, and I don’t want to go looking for all the songs one by one, so I figured beets can help with this.

I managed to export a list of Artist - Title from Google Music, and I figure I can write a plugin to query beets’ database for the name and show me a few file paths so I can pick. That should be easily doable, yes? (I’m a professional Python developer, so I’m handy with Python).

Secondly, I wrote a new playlist spec, which I call Universal Playlist, because of exactly this problem. This playlist format includes MusicBrainz IDs, hashes, etc as well as artist name and title. I have written a small utility to convert from PLS to UPL, but the opposite is harder, because I need a database that’s indexed by both the metadata that already exists in ID3 tags and things like the SHA/MD5 hash of the song, etc.

My second question is: Does beets allow arbitrary, indexable metadata into its database? I would like to write a plugin that would, on song import, read all the metadata and add the MBID, SHA/MD5/etc hash to the database, so I could quickly get a song by its SHA hash later on. Is this possible without changing the beets core?

Thanks!

adrian · March 3, 2017, 5:03pm

Interesting problem!

Yes, the plugin you describes shouldn’t be too hard (I think). It’s pretty straightforward to construct queries of the kind you’re describing. Take a look at the dbcore.query module for the options; it’s as easy as something like AndQuery([MatchQuery('artist', artist'), MatchQuery('title', title)]).

I’m really interested in the UPL idea, which looks like a great alternative to plain old m3u or XSPF playlists. It would be cool to explore direct integration with beets for something like this.

Finally, yes—storing extra data attached to your music is a “core competency” for beets. We have a thing called flexible attributes that lets you associate new, arbitrary fields with tracks and albums.

Stavros · March 3, 2017, 6:50pm

It’s pretty straightforward to construct queries of the kind you’re describing.

Great, I should be able to create something that outputs a PLS file quite easily, then, thanks!

It would be cool to explore direct integration with beets for something like this.

I would love that. I found out about beets while I was writing the UPF spec, and found it quite exciting, so I installed and am playing with it. Am I correct that it seems more of a plugin platform, than an actual manager, though? All the functionality seems to be handled by plugins, except querying, which just tells you what songs exist, but doesn’t give you a filename.

Regardless, it’d be great if we could add UPL support to beets, and perhaps playlist management. I know many people who don’t make playlists because they consider it futile (they break as soon as you move a file), but UPL playlists wouldn’t break, and beets could make working with them a breeze.

We have a thing called flexible attributes

That looks to be exactly what I need, thanks!

adrian · March 3, 2017, 7:55pm

We do rely a lot on plugins, but you can indeed do most of the basic things you want without any. For example, to get the filenames for your music, type beet ls -p.

Using a representation like yours in our SQLite database, that can round-trip with actual UPL files, does indeed seem like a great fit. Here’s our (very old) tracking ticket for playlist management in beets, for what it’s worth: Playlists · Issue #123 · beetbox/beets · GitHub

Stavros · March 4, 2017, 11:09pm

Great, thank you! Here’s my current plan, then, about what my plugin should do and how:

When a song is being imported, it should write as many of the relevant metadata as possible to the database (from both the tags and the file itself, for the hashes).
There should be a function to return the path of a song in the database from any metadata that is passed as input. This will allow the plugin to read any supported kind of playlist (PLS/M3U/UPL) and output a UPL or PLS playlist with valid filenames.
The plugin should implement a command to do the above, i.e. accept a playlist as input, which may be a broken PLS file, a UPL file, or anything else, and write a UPL (for working with) or PLS (for compatibility) with all the proper paths (perhaps after disambiguating duplicate songs) to disk.

How does that sound? Is it doable, or do you foresee any potential issues with it?

adrian · March 5, 2017, 12:07am

Yeah, that sounds reasonable! I’d be happy to take a look if you get the project started on GitHub.

Stavros · March 5, 2017, 11:03am

Great, thank you! Reading through the flexible fields documentation, I see a potential problem. Does beets really do a sequential scan in Python for each query? For a list of a thousand songs, and for six attributes per song, that’s six thousand scans and loads and unloads of a database of potentially hundreds of thousands of songs from memory. I think that’s going to be a rather large performance problem. Why are the fields in the database not indexed?

adrian · March 5, 2017, 1:34pm

Yes, queries on flexible attributes are currently implemented in a pretty naive way. There’s no fundamental reason for this: we just need someone to take a close look at indexing and clever joins to avoid the linear scan. So if you’re curious, it would be awesome to have help addressing that.

I’d also urge you to put together a small test—often, the unoptimized queries aren’t quite as bad as they seem.

Stavros · March 5, 2017, 2:18pm

Admittedly I haven’t looked at the schema, but is it much harder than a table of (foreign key to track, key, value), index on (key, value), and then select * from table where key = key and value = value left join songs on track.id = track.id?

adrian · March 5, 2017, 2:46pm

Yes, that’s exactly what the schema looks like—there’s a tracks table (called items) with the “fixed” built-in attributes and an item_attributes table that is indeed consists of the foreign key (id), the key (string), and the value.

So an index like you’re describing is exactly the right thing. The only complication is that users can add arbitrary new fields to the database—so we’d need to decide on some sort of policy for when to create the index. For example:

It could be created automatically when the user first adds an attribute. (But do we actually want to pay for an index for every attribute? And how do we know when to remove an index?)
There could be an explicit option where the user asks for an attribute to be indexed. (But that sounds a little unnecessarily complex. And it also doesn’t solve the deletion problem.)
Perhaps indices should be managed by plugin code.

Et cetera.

Stavros · March 5, 2017, 3:35pm

Honestly, it would just be easier to always create the index. The database isn’t that insert-heavy anyway, the majority of operations are reads. Besides, how will you do per-plugin indexes on a single table? You either have an index or you don’t, no?

adrian · March 5, 2017, 3:52pm

Right; if we did the plugin-managed thing, the plugins would be in control of the creation of shared indices in the central database—they wouldn’t be creating an index for their own exclusive use.

Yes, always making the index is probably the right thing to do—I don’t currently see any other alternative that seems better. Of course, the devil is in the details: it will be somewhat annoying to constantly check whether indices exist yet before creating them. An alternative, I suppose, would be to defer creating indices from scratch until a periodic point where they can be batched up.

Stavros · March 5, 2017, 4:11pm

I must be missing something, isn’t there a single item_attributes table? If so, the index would just be created on table creation. If not, what do the tables look like?

Also, even if indices are per-app, you don’t have to check if they exist before creating them. You can just do CREATE INDEX IF NOT EXISTS every time you create the tables.

adrian · March 5, 2017, 4:15pm

Oh oh! Forgive me, I totally misunderstood—I thought you were proposing a separate index for every key. Of course, a single big index on (key, value) in the item_attributes table makes way more sense and gets rid of all of these problems. Sorry about that; I feel dumb for not catching on sooner.

Anyway, yes, we should add that index. We still need a way to craft the SQL query to actually use the index, rather than just loading the data and matching in Python, but I don’t think that will be very hard.

Stavros · March 5, 2017, 4:34pm

Yes, it shouldn’t be too hard (and creating the index is trivial). Generally, the database is going to be much more efficient than Python at things. Just make sure you give it hints properly, so if the user intends a “starts with” query, don’t do something like “LIKE ‘%foo%’” and then filter in Python, or the database won’t be able to optimize properly. It’s just basic considerations like those, and queries will be many times faster.

Stavros · March 8, 2017, 3:00am

I am trying to write a plugin for this, but I’m running into a few issues:

It seems that beets cannot search for non-english characters? Am I doing something wrong?
The documentation doesn’t detail exactly what sort of parameters library.items() accepts as a query, but I’m having trouble writing a simple artist/title query. I want to find the path of a song knowing its artist and title, but beets either returns too many songs or too few. How should I structure my query so that works?

Thanks!

adrian · March 8, 2017, 3:58am

Hmm, Unicode strings should work fine in queries. What sort of trouble are you running into?

About the query objects to pass to library.items, I mentioned something about this above:[quote=“adrian, post:2, topic:40”]
Take a look at the dbcore.query module for the options; it’s as easy as something like AndQuery([MatchQuery(‘artist’, artist’), MatchQuery(‘title’, title)]).
[/quote]

Is that close to what you’re trying? Perhaps it would be helpful to take a look at your in-progress code.

Stavros · March 8, 2017, 9:36am

Hmm, Unicode strings should work fine in queries. What sort of trouble are you running into?

It’s not detecting a song I know I have, but I’ll test some more and see, because it’s detecting others.

About the query, sorry, by the time I reached that place in the plugin I forgot that you’d mentioned it, and was trying something like “artist:%s title:%s”. I’ll fix it and report back, thanks!

Stavros · May 12, 2017, 11:12pm

Hello again! I’ve gotten quite a bit of the way there, but I need a few details that I can’t find in the docs:

How can I add information to the database on import? I want to add the hash of the file when it’s imported.
How can I update the above data when the file is changed? Ie how can I detect the change?
How can I get the metadata for a file given the file name?

Thank you!

adrian · May 13, 2017, 2:41am

Hi!

You just need to set fields on the Item or Album object and then call store. For example, item.hash = 'foo' ; item.store().
Perhaps it would suffice to listen for the write event.
You probably want to issue a query that matches on the path. Here’s an example from the web plugin.

Topic		Replies	Views
Improve music discovery aspect of Beets Help	11	1084	September 6, 2022
M3U Playlist Plugin Help	11	2016	August 11, 2019
A wishlist Help	2	1601	May 31, 2017
Better Interoperability Help	0	556	August 26, 2018
Not Indexing propertly Help	0	37	July 17, 2024

Making song matching more robust

Related topics