Felix Crux

Technology & Miscellanea

Tags: , ,

Many years ago I laboriously ripped a collection of CDs (and a few cassettes) into digital files. It was already clear at the time that I wouldn’t want to lug discs around and that these newfangled “MP3 players” were on to something good.

Unfortunately, the hard drive space required to store my collection in a lossless format was far out of reach for me at the time, so I resigned myself to keeping everything as lossy low-bitrate MP3s. Later, a new free/open-source and patent-unencumbered format called Ogg Vorbis turned up on the scene, and my enthusiasm for it overpowered my horror at re-encoding from one lossy format to another, so I gritted my teeth at the quality degradation and converted everything.

Now in the present, Vorbis has been surpassed by Opus, FLAC exists, hard drives are cheap, and it’s time to start over.

My objective is to end up with a high-quality lossless “archival” version of all my physical and digital album purchases on my home storage server, comprehensively tagged with a full set of accurate metadata. Along with that, I want at least two sets of derived Ogg Opus re-encodings with differing levels of compression/quality: a lower-quality one for keeping on my cellphone, and a better one for desktop and laptop personal computer use.

I want all of the files to be accurately and consistently tagged with a basic set of metadata — enough to populate a music player’s database and UI fully, but not necessarily every possible piece of information that could be associated with a track. Additionally, I want the collection to be reasonably navigable through just the filesystem. That means a logical folder hierarchy, informative file names, and cover art stored alongside the files (as well as within them).

I’m writing down every step of my process here for two reasons: first, to remind myself of what to do when I get a new album; and second, in the hopes that being wrong on the internet will cause helpful souls to chime in with ways to do it better (or at least avoid terrible mistakes). Please let me know!

Ripping, encoding, and storage

I start by ripping the CD using Asunder, which stores the files in FLAC format at compression level 8. All compression levels for FLAC are still lossless; they just trade off file size for longer encoding time — but I don’t mind waiting. I didn’t have a great reason for choosing Asunder, but it’s simple and it works without too much hassle or ceremony. It does a good job at filling in basic metadata such as album names, but we’ll be fleshing that out by hand later anyway.

I start with a “music/original/” directory where these original rips go. Within that, I organize the files by artist, and within each artist’s folder, by album. Within each album directory, I name the files in the form “Track # - Title.flac” with track numbers padded with leading zeroes to be two digits long (e.g. “02 - Wellspring.flac”). I add leading zeroes to ensure UIs that do simple alphabetical sorting still put the files in the correct order; that is to say they will show up as 01, 02, 03, … 10, 11, etc. rather than the lexicographic 1, 10, 11, …, 2, 20, 21, and so on. I don’t think I’ve encountered an album yet with more than 99 tracks, but I guess I’d pad those track numbers out to three digits!

One choice I’m uncertain about is how to represent multi-disc sets. Conceptually they’re not different albums, so I don’t think I’d want to put them as separate directories within the artist directory. However they do have separate track listings and track numbers, and most metadata I’ve looked at online does seem to preserve that distinction (so you have things like “Disc 2 Track 2” rather than simply “Track 18”). My compromise has been to introduce a leading second number in the filename, representing the disc. Putting it all together, you get something like: “music/​original/​The Beatles/​Live at the BBC/​02-18 - I Feel Fine.flac”. I zero-pad these disc numbers for consistency with the track numbers, but in hindsight that’s kind of pointless. I’m still not entirely happy with this because it feels more like preserving an arbitrary limitation of the physical format (how many tracks fit on a disc) than something essential about the album. Would I tag things “side A” and “side B” if they were from a cassette or a record? Should albums that have been released in multiple formats really be structured differently depending on which format was digitized?

After ripping, I search the MusicBrainz database for the album, and grab the most accurate-looking cover art I can find to put in the folder with the songs. There seem to be many quasi-standard filenames that music players will recognize, but I choose to name the files “cover.jpg” (or .png). I couldn’t find any reason to prefer any of the names over the others, so I picked this one as being in my opinion the most descriptive, and just stuck with it. I’ve been aiming for files of between 1000×1000 up to 1600×1600 pixels in size, again for no particularly strong reason — it just sounded like a reasonable ballpark, but now I wonder if it’s good enough on high-DPI displays. I keep the MusicBrainz page open for the next part, where I’ll be adding metadata to the tracks.

From there, I have a simple shell script that recursively walks the directory tree and re-encodes things as needed with various settings. It’s a little more complex than that, because it also handles scenarios such as online purchases of albums that I only downloaded in Vorbis or even MP3 format and don’t have the FLACs for. In those cases, it looks at the existing file and either hard-links it into the target directory, or re-encodes with lower quality/bitrate settings if needed.

Another thing I’m experimenting with is creating par2 “parchives” of each album rip to store alongside the originals and protect them against bitrot. I don’t yet know enough to judge whether this is entirely necessary or whether the way I’m doing it makes sense. I may write more about that later.


The rips may take a long time, but I don’t have to pay attention to them while they run in the background. Filling in metadata is the most actively time-consuming step.

For this, I primarily use Kid3. I previously used EasyTAG and sometimes still pop into it because I’m more familiar with its utilities for mass-renaming files (though I believe Kid3 can do most of the same things).

Within Kid3, I set several fields:

  • Album Artist: Usually just the same as Artist, but more on that below.
  • Date: Currently just the year the album was released.
  • Genre: In text form, chosen from a short list I use for consistency.
  • Track Number: Using the “Tools > Number Tracks” utility to give them all leading zeroes.
  • Total Tracks: If a multi-CD set, the number of tracks on that disc only.
  • Disc Number and Total Discs: Only if there are multiple discs.
  • Publisher: The record label. Turns out to be less useful than I hoped; may drop it.
  • Compilation: True/False (or more precisely, 1 or unset), but only if it’s from multiple artists.
  • Picture: Album front cover, making sure to remove the default comment containing the file name.
  • Composer, Work, soloist/featured instrumentalist: Only on classical albums.

Anything that’s not in the list above — like additional metadata added by the digital store or encoding tools — gets deleted in the name of consistency and uniformity, which I’m prioritizing over completeness.

I’m really interested in figuring out how to use beets for scripting/automating some of the above, or at the very least quality-checking my manual work to confirm I didn’t miss anything or introduce an inconsistency. I don’t want to completely hand over my library and full control over all metadata to the tool — just to use it to augment what I’m already doing. Another tool I need to explore more is MusicBrainz Picard. I’m already relying heavily on the MusicBrainz database, so maybe I should be using a more comprehensively integrated tagging tool.

Lots of tiny decisions

Despite limiting myself to the relatively short list of tags above, there’s an annoying number of small choices to make. Even something as straightforward as “track number” can be expressed as either a single number plus a “total tracks” field, or as a composite “n out of m” field that covers both (e.g. “2/17”). What difference does it make? I don’t know, but I fear I’ll find out someday and have to go back and change everything. There’s still more, just in this field: I originally numbered all the tracks with leading zeroes to make it easier to use tagging programs’ auto-file-renaming feature to get the files on disk named correctly (as per the rules from above to get correct sorting order)… but that’s kind of illogical in a metadata field that doesn’t have that sorting problem.

Another one: Should the date field contain the date of the original release of the album in any form, or the date of this specific format? I don’t like “throwing away” useful information like which particular release a recording came from, but since I primarily use the date field to sort through a particular artist’s work (and not to disambiguate between multiple versions of an album, since I usually don’t have multiple versions), I’ve been going with “year of original release in any form”… except that’s somewhat fuzzy for classical music where one usually cares more about the specific recording, so there I use the year that particular CD came out. And for compilation albums, should each track have the release date of that specific song, or the release date of the record? With these messy compromises, I end up with things like “Blowin’ in the Wind (1962)” apparently being on a CD — a format that didn’t exist at that time — but also Chopin’s “Nocturne in C♯ minor (2002)”.

Who is the artist on a classical music album? In practical terms when searching through my collection, I’m usually looking for, say, Rachmaninoff, not the London Symphony Orchestra. This is where the differences between the “Artist”, “Album Artist”, and “Composer” fields comes in, and where extra fields such as “Conductor”, “Orchestra”, and “Piano” can be used. I don’t actually know how many music players handle these, and whether my current scheme works well. What I’ve settled on is to put the composer as the “Artist” and “Composer” (e.g. “Sergei Rachmaninoff”), whatever is printed on the CD cover as the “Album Artist” (e.g. “Vladimir Ashkenazy and the London Symphony Orchestra”), and any featured conductors, soloists, etc. in the closest-matching field I can find for them (e.g. “Conductor: André Previn”, “Piano: Vladimir Ashkenazy”).

What about on non-classical compilation albums? That’s a bit easier: “Album Artist” is always “Various Artists”, and “Artist” is whoever is performing each specific track. The “Compilation” flag should also be set to “True” for these so that music players catalogue them correctly. Compilations from a single artist, however, like “Greatest Hits” type albums… I just don’t treat as compilations at all, and pretend they are standard albums of all new material.

Artists’ names aren’t always easy, either. Would you recognize “Пётр Ильич Чайковский” in your collection? I’ve settled for going with whatever name and spelling/romanization Wikipedia prefers — in this case, “Pyotr Ilyich Tchaikovsky”.

One final wrinkle for classical albums is that individual tracks are often not standalone “songs”. I think this is where the “Work” tag fits in. We can use it to indicate that, for example, the track with title “Adagio, ma non troppo” is part of the work “Concerto no. 2 in B minor for Cello and Orchestra, op. 104”. That said, I’m not 100% sure I’m using this right.

Future work

There’s a whole slew of things I haven’t yet figured out, but would like to experiment with. Whether or not they are practical and how exactly they should be implemented is likely to come down to how well these features are supported by the music players I use. They include:

Multiple artists per track. I dislike having metadata like “(feat. So-and-So)” in the track name. I’m currently just deleting that info altogether, but I believe it’s possible to represent it in metadata by having multiple “Artist” tags. I’m just not sure how that shows up in music players. Do they use it at all? Will they show the track under the contributing artist’s file-browser entry as well?

Sort Artist. The listing of band names that start with “The” is overwhelming and unhelpful. I think it’s possible to use “Sort Artist” and “Sort Album Artist” tags to refile them under more findable names, like for example, “Beatles, The”. I haven’t done this yet because some music players do it automatically.

ReplayGain. It would be great to avoid wide disparities in volume across different tracks/albums. I just need to learn more about it; for example whether it should be calculated and set across the entire collection, or on an album-by-album basis, and which tool is best to use.

Optimizing padding. Metadata is conceptually “tacked on” to the real music file, but in practice it’s sometimes placed at the start of the file, not at the end. To avoid having to “move over” the entire file if even a single character of metadata is added, some (most?) tools add extra “padding” space in the metadata section, allowing it to grow or shrink a bit before there’s a need to rewrite the layout of the full file. Once the metadata for a file is finalized and unlikely to ever change again, though, it should be possible to optimize storage space slightly by shrinking that padding down to near zero. Just not sure if it’s worth it.

Unique identifiers. It’s possible to embed a unique ID such as MusicBrainz Identifiers, International Standard Recording Codes, or, heck — even just the barcode printed on the CD case — in track metadata so that tools can automatically disambiguate the file and potentially update and improve the metadata over time. I haven’t settled on which format to use, so I don’t have that data today.

Third time’s the charm…

It wasn’t until after I started this project that I learned that CD rips aren’t always necessarily perfect bit-by-bit copies, but can have glitches and errors. I don’t know how my cheap external USB CD/DVD reader stacks up in that regard. Likewise, I hadn’t heard of accuracy-oriented rippers like EAC and morituri/whipper, or about the practice of ripping whole albums into one file and using “CUE files” alongside them to denote track boundaries. Am I missing out? Do I need to start over again?

blog comments powered by Disqus