Some unusual results in lame 3.99.5
Reply #22 – 2013-05-01 21:27:26
Im sorry to have not been more specific. I was looking to see when developers are programming lame, do they have a target bitrates in mind when they are writing the code for VBR mode? In what follows, 'transparent' means audibly-indistinguishable from the original lossless audio. No, as a crude rule of thumb, usually LAME is optimized to be usually transparent at around -V2 or -V3 (-V2 was once called 'standard' preset for that reason) and uses only just as much accuracy as it needs and hardly any more (which means it allows almost as much distortion or noise at it can get away with). The data is then packed into as few bits as possible, so the bitrate simply falls where it may rather than being targetted. Over the years, problem samples or 'codec killers' have been found, revealing errors in the psychoacoustic model that decides how much of which types of distortion is allowed, leading to corrections that 'fix' the problem samples without wasting too many bits on non-problematic samples. For the most part, settings such as -V1 and -V0 add in a little extra margin of safety by requiring slightly higher accuracy than was calculated by the psychoacoustic model. This accounts for a lot of the extra bitrate they consume and the amount of margin allowed might have been chosen by the developers with the resulting bitrate increase over a large corpus of music in mind, but with no idea of the bitrate from moment to moment. The extra accuracy in -V1 or -V0 is spread thinly over reducing distortion across the frequency range and increasing temporal accuracy somewhat, so it might marginally reduce the audibility of subtle flaws in the psychoacoustic model compared to the -V3 or -V2 version, and might make a subtle flaw at -V2 become just inaudible and thus transparent at -V0, for example. Any major flaws in the psychoacoustic model will not have the extra bits applied only to the problematic part (because the model doesn't know where it should have applied them), so major flaws will often be only marginally improved by higher quality settings. For example, flaw A might go from 'very annoying' at -V2 to 'somewhat annoying' at -V0 while flaw B might go from 'somewhat annoying' at -V2 to 'perceptibly different but not annoying' at -V0, and a different flaw C might go from 'somewhat annoying' at -V2 to 'transparent' at -V0. It varies from case to case, with different classes of problem sample. The aim of the LAME developers is typically to fix flaws in -V2 or -V3 by improving the psychoacoustic model so that it allocates extra accuracy (and as a consequence, momentarily-higher bitrate) to just the right places in those problem samples without increasing the accuracy and thus without increasing the bitrate for the majority of non-problem samples. The developers have naturally concentrated their efforts on most of the annoying problem samples in early years, and now that LAME is mature, they have few annoying problem samples remaining. Below is a crude picture of how an 'only just transparent' idealized Constant Quality (i.e. Variable Bit Rate) audio encoder might work. There are complexities I'm omitting for simplicity, and I'll use the word 'chunk' instead of the words like 'frame' or 'granule' to avoid implying that I'm referring to specific details of an encoder such as LAME. 1. Divide the audio into 'chunks' of a few tens of milliseconds 2. Analyze the current 'chunk' of audio, deriving its frequency spectrum (the loudness of each 'pitch' in the sound) 3. Detect whether a fast 'transient' sound, highly localized in time, occurs within the current chunk or not. 4. If a fast 'transient' was detected, subdivide the chunk into a number of short blocks. Short blocks allow the transient to be encoded with high time resolution using relatively few bits at the expense of reduced frequency resolution (for technical reasons I will omit), and frequency resolution is usually less important during transients. 5. If there was no 'transient' detected, use long blocks to encode high frequency resolution with fewer bits at the expense of lower time-resolution. 6. Apply a Modified Discrete Cosine Transform (MDCT) to each block in the current 'chunk' to convert it from a loudness-versus-time representation to a loudness-versus-frequency representation, divided into some 'subdivisions' roughly comparable to certain aspects of human hearing. 7. Calculate the Masking Threshold of the signal as a function of frequency (and if following a transient, calculate masking as a function of time, too - called Temporal Masking). This uses the fact that the ear becomes less sensitive to subtle difference for sounds that are close in pitch to a louder sound, allowing their loudness to be encoded with less precision without sounding any different. (Temporal masking uses the fact that the ear is less sensitive for sounds that occur immediately after a brief loud sound) 8. Apply an offset to the calculated Masking Threshold if you want to provide more margin of safety (a bit like LAME -V0) or less margin of safety (a bit like LAME -V5). For example you might add 10 decibels of safety margin to all frequencies. 9. For each 'subdivision' of the MDCT data from step 6, calculate what bit-depth will ensure that the quantization noise caused by that bit depth will remain just below the calculated and offset Masking Threshold from step 8 and quantize to that bit-depth. This quantization is the main LOSSY step in the process. 10. Losslessly pack the data in an efficient manner using similarity between left and right channels and so on to improve efficiency. The number of bits this takes up (which equates to the bitrate) is not known before this stage. Store it according to the format specification so that a decoder can reconstruct the lossy audio later. 11. Move on to the next 'chunk' of audio and repeat from step 2. You'll note that this idealized constant-quality encoder does not decide on the bitrate, it just emerges naturally from the decisions made and the analysis of the audio and how well the lossless part of the packing happens to work. In reality, the choices available to the developers in tuning the quality scale such as LAME's -V n scale are more varied than suggested by step 8 above. One aspect is the frequency of the low-pass filter, which is raised at higher quality settings like -V0 and lowered at low settings like -V5. A few semitones of extra bandwidth at the top end costs a lot of bits, and reducing the lowpass to around 16kHz can save a lot of bits to spend where they really mattter more on -V5. Likewise at lower quality settings, a developer may choose to sacrifice a little accuracy at the extreme frequencies to allow more accuracy to be retained at the most important frequencies for typical music and at higher settings, the extremes may receive less of the extra margin of safety than the most important frequencies. So, even once the 'problem samples' have been 'solved' in the early stages of encoder tuning, the developers can still choose how to make the varying scale of adjustments from the supposed ideal Masking Threshold and the standard low-pass filter frequency as the quality number is changes. The developers do have a range of choices, which will affect the bitrate, and especially for lower-quality settings, they'll affect the quality. They also occasionally come up with more efficient ways of working round limitations of the format (e.g. 'sfb21 bloat' was mentioned elsewhere) which will allow lower bitrate for the same quality but may also allow a greater margin of safety at the same typical bitrate for the higher settings like -V0. Some more advanced formats than MP3 include special tools that can provide pretty good accuracy using very few bits. Sometimes these achieve high accuracy with fewer bits than older encoders. Sometimes they allow reduced accuracy in the less important aspects of audio to concentrate the lower bitrate on the most vital aspects. An example is in low-bitrate AAC audio, the HE-AAC version makes an approximation of the high frequency spectrum estimated from the low-frequency spectrum and relatively few numerical parameters (this is called Spectral Band Replication) and performs better than LC-AAC at around 64kbps. At even lower bitrates, the precise stereo image is replaced with a method of steering portions of the monophonic audio with a few numerical parameters. This is called Parametric Stereo (in HE-AAC v2, where it works in conjunction with Spectral Band Replication) and it performs better than HE-AAC at around 32kbps. Both of these methods allow what little bitrate is available to be spent on encoding the most important parts of the signal better. Similarly, Ogg Vorbis and Opus/CELT have some special tricks up their sleeves to allow the spectral envelope to be encoded quite efficiently to eliminate certain nasty artifacts and improve the bitrate efficiency of transparent encoding.