Any suggestions for improving import/update throughput?

amatala · February 25, 2023, 9:35am

Hi,

I am new to beets. I have been using Picard for many years and all my files are properly tagged, using MB IDs as much as possible (I’ve also been adding lots of albums to MB whenever not available). I was interested in some extra functionalities that beets can offer (advanced database queries, mass tagging and also thinking of using some plugins like the LastImport to bring my 15 years of listening history into my own database). So at this stage, I am not doing any tagging in beets, but wanted to import my library accessible over a network share.

However, my library is fairly large (over 350K tracks) and I have found that the beets import was rather slow. It started well in the beginning but as the DB size grew I started getting more than 5 seconds in between albums, which would have meant about 2 days for the full import to complete.
After a bit of investigation I found that the import was CPU bound (it was using a full CPU core on the computer running the import) at it was actually spending time full scanning the ever growing ITEMS table which kept on slowing it down. I also found out that creating an index on PATHS would actually help a lot in my case, e.g.:

CREATE INDEX items_path_idx on items(path);

This helped increase the throughput of the import from 1 album every 5 seconds to about 3 albums per second on average, so the import would complete in less than 4 hours.

This is not bad for an initial import, but every time I rerun the import after adding more files to the library, it takes about the same time and an update takes about 6 hours…
So if I have to run update + import every time I update existing files and add new ones (about once a week) then it’s a total of 10 hours, which is a bit of a show stopper for me.

At this stage I am running out of ideas about how to make it faster, so any help would be appreciated. In comparison, when running an incremental scan on LMS, it completes in 10-20 minutes, depending on the amount of changes, which is perfectly acceptable to me. If I would get to no more than an hour in beets it would also be fine.

Thanks for any inputs, all help appreciated!

RollingStar · February 25, 2023, 7:23pm

Can you post what commands you’re using? It sounds like you’re doing a full import every time you get a new album. That is, you’re importing 350k tracks, getting another 10, then importing 350k+10 tracks. If so, proper usage is like:

beet import /bigfolder/
# days later, big folder is done
beet import /tinynewalbum/
# no need to reimport existing albums, this import will be quick

If you have a new path format, use move: FAQ — beets 1.6.1 documentation

Still, Beets has been known to perform at a “meh” level in many areas for a while. Your problems wouldn’t terribly surprise me even if you’re doing everything right.

github.com/beetbox/beets

beet list -a path:some/path queries are extremely slow

opened 11:28PM - 31 May 22 UTC

snejus

discussion

### Problem * `beet list -a path:` seems to be much, much slower than `beet list path:` while in reality an album query should be at least a tiny bit faster than an item query since there are fewer entities in the table. Running a path query for **items**: ```sh $ count beet list path:/run/media/sarunas/music/Music/micronaudio/djstingray313_molecularlevelsolutions DJ Stingray 313 Bioplastics (Molecular Level Solutions, 2021) DJ Stingray 313 Construction Materials From Organic Waste (Molecular Level Solutions, 2021) DJ Stingray 313 Carbon Neutral Fuels (Molecular Level Solutions, 2021) DJ Stingray 313 Enzymatic Detergents (Molecular Level Solutions, 2021) 0s 943ms 734us ``` Running the same query for **albums**: ```sh $ count beet list -a path:/run/media/sarunas/music/Music/micronaudio/djstingray313_molecularlevelsolutions DJ Stingray 313 - Molecular Level Solutions released on Micron Audio, Cat No. (2021, DE) 13s 951ms 836us ``` And querying albums for a path that does not exist: ``` $ count beet list -a path::djstingray3133333333 15s 398ms 109us ``` Note: `count` is a shell function which measures the timing of the supplied command Seeing that the last query took so long even though the result is empty, I have a feeling it's caused by some initial query returning **all** items/albums from the database, where filtering is done in python by looping across the items (potentially making additional queries, - seeing that this operation is _really_ slow). FYI, probably related, that's my `beet stats` output: ``` Tracks: 5697 Total time: 3.6 weeks (2165608.03 seconds) Total size: 266.1 GiB (285764413488 bytes) Artists: 2634 Albums: 1304 Album artists: 901 ``` ### Setup * OS: Arch, Linux kernel 5.17.9 * Python version: 3.10.4 * beets version: tip of the `master` branch, 988bf267 * Turning off plugins made problem go away (yes/no): No * Using the default configuration helped: No

amatala · February 25, 2023, 10:11pm

Yes, indeed, I am doing an import of the full folder, but not after each album but once every 1-2 weeks on average. In the meantime I have added anywhere between 50 and 100 new albums and also could have updated tags for some tens of albums (mainly adding them to MB because they are unknown). It’s impossible to keep track of each and every added and modified album and manually running 50 to 100 separate commands for adding them to beets is also not very practical.
In LMS I just use “look for new and changed music” and it figures out on its own what files it has to scan. I assumed an import command in beets would do the same, but apparently I was wrong… In beets I have to run update to find all modified albums then import to add all new ones… As mentioned, I am not using beets to tag and move/reorganize files, all I want is for them to be added to the database so I can run my custom reports and other plugins.

RollingStar · February 26, 2023, 2:23am

Oh. That sounds like you’re doing it right then.

amatala · February 26, 2023, 8:49pm

Well actually it turns out I’ve been doing some things wrong after all… Apparently the paths are case sensitive in beets so when importing files from my network share, using beet import \server\share is not at all the same thing as \SERVER\share for example… When using different upper/lower case combinations all files will be imported again, so what I thought to be an incremental import turned out to be a full import all over again. After a few imports I ended up with same files imported 3 or 4 times in my database which had grown over 1 million items…

I have now restarted from scratch with a new database and will always use the same spelling for my network share, e.g. always use \server\share and nothing else.

I have also created an unique constraint in my database which will prevent the creation of duplicates:

CREATE UNIQUE INDEX enforce_unique_paths_idx on items(lower(cast(path as text)));

So if I first import using \server\share then it will force me to use the same every time, otherwise the import will just fail on unique key violations and this will prevent creating inconsistencies again.

Now if I run a full import again without adding any new albums, it will complete in just over an hour which is much better already. The update command still takes as long as before (around 6 hours), no improvement there, so I will probably have to plan that less frequently than the weekly import.

Topic		Replies	Views
Speed of beets Help	5	1025	March 21, 2018
Large import gripes Help	3	2096	December 16, 2017
Where Does the Time Go? Importing a Large Music Collection Help	3	693	January 13, 2021
Update scanning an existing beets database Help	1	3199	January 4, 2020
Importing large music library Help	7	1914	April 15, 2019

Any suggestions for improving import/update throughput?

Related topics