Import 2.3million files, or split them up?

tantog · March 1, 2021, 8:33pm

When i ran an import against my library it ate up all 256G of ram, and after 2 hours crashed. Import per directory works though. IS there a method to throttle import to chunks ? or will a for i loop script for each dir accomplish this ?

wisp3rwind · March 2, 2021, 9:15am

There is already a certain amount of throttling since the number of pending tasks for each stage of the importer is limited (to a fairly large value). I don’t see how this number of tasks could use as much RAM. So additional throttling is probably not solving the problem. More likely something (a plugin?) is keeping references to the tasks beyond the import process. Anyway, I’m just speculating here without further details. You might want to try disabling plugins and see whether the issue persists. It is generally a good practice to post at least your beets config when reporting such issues.

The for-loop would work too, of course.

tantog · March 2, 2021, 1:17pm

beet config

directory: /cloud/music
library: ~/data/musiclibrary.db
threaded: yes

import:
    copy: no
    write: yes
    move: yes
    autotag: yes
    log: ~/beetslog.txt
    incremental: yes
    quiet: yes
original_date: yes
per_disc_numbering: yes
embedart:
    auto: yes
art_filename: albumart

plugins: mbcollection inline fetchart lastgenre rewrite fromfilename bucket
mbcollection:
    auto: yes
    collection: library
    remove: no
pluginpath: ~/data/

ui:
    color: yes

paths:
    default: $albumartist/$album/$track $title
    singleton: $albumartist/$artist - $title
    comp: $albumartist/$album/$track $title
    albumtype:soundtrack: Soundtracks/$album/$track $title
duplicate_action: keep

musicbrainz:
    user: nnn
    pass: REDACTED
    auto: yes
    collection: library
fetchart:
    auto: yes
    cautious: yes
    sources: filesystem coverart itunes amazon albumart
    minwidth: 0
    maxwidth: 0
    quality: 0
    enforce_ratio: no
    cover_names:
    - cover
    - front
    - art
    - album
    - folder
    google_key: REDACTED
    google_engine: 001442825323518660753:hrh5ch1gjzm
    fanarttv_key: REDACTED
    lastfm_key: REDACTED
    store_source: no
    high_resolution: no
lastgenre:
    auto: yes
    source: album
    whitelist: yes
    min_weight: 10
    count: 1
    fallback:
    canonical: no
    force: yes
    separator: ', '
    prefer_specific: no
    title_case: yes

replace:
    '[\\/]': _
    ^\.: _
    '[\x00-\x1f]': _
    '[<>:"\?\*\|]': _
    \.$: _
edit:
    itemfields:
    - album
    - albumartist
    - artist
    - track
    - title
    - year
    albumfields:
    - albumartist
    - album
    - year
    - albumtype

match:
    strong_rec_thresh: 0.04
    medium_rec_thresh: 0.25
    rec_gap_thresh: 0.25
    max_rec:
        source: strong
        artist: strong
        album: strong
        media: strong
        mediums: strong
        year: strong
        country: strong
        label: strong
        catalognum: strong
        albumdisambig: strong
        album_id: strong
        tracks: strong
        missing_tracks: medium
        unmatched_tracks: medium
        track_title: strong
        track_artist: strong
        track_index: strong
        track_length: strong
        track_id: strong
chroma:
    auto: no
bucket:
    bucket_alpha:
    - _
    - A
    - B
    - C
    - D
    - E
    - F
    - G
    - H
    - I
    - J
    - K
    - L
    - M
    - N
    - O
    - P
    - Q
    - R
    - S
    - T
    - U
    - V
    - W
    - X
    - Y
    - Z
    bucket_alpha_regex:
        _: ^[^A-Z]
    bucket_year: []
    extrapolate: no
pathfields: {}
item_fields: {}
album_fields: {}
rewrite: {}

wisp3rwind · March 7, 2021, 11:25am

I had a very brief look at the code of those plugins and didn’t see any obvious way that they would leak memory. If you want to debug this further, I’d nevertheless suggest to attempt a large import with all plugins disabled in order to narrow down on the culprit.

I guess that by far the easiest way forward for you would be the for-loop.

Some more thoughts on debugging: If it’s not one of the plugins, I think the next step forward would be to build a test case that generates such huge import sessions (maybe stub out importer.read_tasks instead of actually generating thousands or millions of files?) and then hook up Python’s tracemalloc in the album_imported event to detect the site where the leaking memory is allocated. Such a test case would be nice to have in general, so I did put it on my todo-list, but I don’t think I’ll look into it anytime soon.

Topic		Replies	Views
Importing large music library Help	7	1958	April 15, 2019
Resuming interrupted import starts over Help	3	820	September 5, 2018
Where Does the Time Go? Importing a Large Music Collection Help	3	702	January 13, 2021
Large import gripes Help	3	2103	December 16, 2017
Problem with merging tracks into albums when importing Help	18	1679	April 21, 2023

Import 2.3million files, or split them up?

Related topics