Import 2.3million files, or split them up?

When i ran an import against my library it ate up all 256G of ram, and after 2 hours crashed. Import per directory works though. IS there a method to throttle import to chunks ? or will a for i loop script for each dir accomplish this ?

There is already a certain amount of throttling since the number of pending tasks for each stage of the importer is limited (to a fairly large value). I don’t see how this number of tasks could use as much RAM. So additional throttling is probably not solving the problem. More likely something (a plugin?) is keeping references to the tasks beyond the import process. Anyway, I’m just speculating here without further details. You might want to try disabling plugins and see whether the issue persists. It is generally a good practice to post at least your beets config when reporting such issues.

The for-loop would work too, of course.

1 Like

beet config

directory: /cloud/music
library: ~/data/musiclibrary.db
threaded: yes

import:
    copy: no
    write: yes
    move: yes
    autotag: yes
    log: ~/beetslog.txt
    incremental: yes
    quiet: yes
original_date: yes
per_disc_numbering: yes
embedart:
    auto: yes
art_filename: albumart

plugins: mbcollection inline fetchart lastgenre rewrite fromfilename bucket
mbcollection:
    auto: yes
    collection: library
    remove: no
pluginpath: ~/data/

ui:
    color: yes

paths:
    default: $albumartist/$album/$track $title
    singleton: $albumartist/$artist - $title
    comp: $albumartist/$album/$track $title
    albumtype:soundtrack: Soundtracks/$album/$track $title
duplicate_action: keep

musicbrainz:
    user: nnn
    pass: REDACTED
    auto: yes
    collection: library
fetchart:
    auto: yes
    cautious: yes
    sources: filesystem coverart itunes amazon albumart
    minwidth: 0
    maxwidth: 0
    quality: 0
    enforce_ratio: no
    cover_names:
    - cover
    - front
    - art
    - album
    - folder
    google_key: REDACTED
    google_engine: 001442825323518660753:hrh5ch1gjzm
    fanarttv_key: REDACTED
    lastfm_key: REDACTED
    store_source: no
    high_resolution: no
lastgenre:
    auto: yes
    source: album
    whitelist: yes
    min_weight: 10
    count: 1
    fallback:
    canonical: no
    force: yes
    separator: ', '
    prefer_specific: no
    title_case: yes

replace:
    '[\\/]': _
    ^\.: _
    '[\x00-\x1f]': _
    '[<>:"\?\*\|]': _
    \.$: _
edit:
    itemfields:
    - album
    - albumartist
    - artist
    - track
    - title
    - year
    albumfields:
    - albumartist
    - album
    - year
    - albumtype

match:
    strong_rec_thresh: 0.04
    medium_rec_thresh: 0.25
    rec_gap_thresh: 0.25
    max_rec:
        source: strong
        artist: strong
        album: strong
        media: strong
        mediums: strong
        year: strong
        country: strong
        label: strong
        catalognum: strong
        albumdisambig: strong
        album_id: strong
        tracks: strong
        missing_tracks: medium
        unmatched_tracks: medium
        track_title: strong
        track_artist: strong
        track_index: strong
        track_length: strong
        track_id: strong
chroma:
    auto: no
bucket:
    bucket_alpha:
    - _
    - A
    - B
    - C
    - D
    - E
    - F
    - G
    - H
    - I
    - J
    - K
    - L
    - M
    - N
    - O
    - P
    - Q
    - R
    - S
    - T
    - U
    - V
    - W
    - X
    - Y
    - Z
    bucket_alpha_regex:
        _: ^[^A-Z]
    bucket_year: []
    extrapolate: no
pathfields: {}
item_fields: {}
album_fields: {}
rewrite: {}
1 Like

I had a very brief look at the code of those plugins and didn’t see any obvious way that they would leak memory. If you want to debug this further, I’d nevertheless suggest to attempt a large import with all plugins disabled in order to narrow down on the culprit.

I guess that by far the easiest way forward for you would be the for-loop.

Some more thoughts on debugging: If it’s not one of the plugins, I think the next step forward would be to build a test case that generates such huge import sessions (maybe stub out importer.read_tasks instead of actually generating thousands or millions of files?) and then hook up Python’s tracemalloc in the album_imported event to detect the site where the leaking memory is allocated. Such a test case would be nice to have in general, so I did put it on my todo-list, but I don’t think I’ll look into it anytime soon.

1 Like