IPB

Welcome Guest ( Log In | Register )

Why are 128kbit/s MP3s usually 44,100 Hz?
eamon123
post Oct 14 2012, 01:44
Post #1





Group: Members
Posts: 30
Joined: 14-October 12
Member No.: 103838



Since 128kbit/s MP3s usually has a low-pass filter at 16,000 Hz, and the Nyquist-Shannon theorem states that all frequencies under f/2 Hz can be totally described at an f Hz sampling rate, why not encode at 32,000 Hz? Wouldn't it save more space?
Go to the top of the page
+Quote Post
 
Start new topic
Replies
saratoga
post Oct 14 2012, 04:03
Post #2





Group: Members
Posts: 4853
Joined: 2-September 02
Member No.: 3264



QUOTE (eamon123 @ Oct 13 2012, 20:44) *
Since 128kbit/s MP3s usually has a low-pass filter at 16,000 Hz, and the Nyquist-Shannon theorem states that all frequencies under f/2 Hz can be totally described at an f Hz sampling rate, why not encode at 32,000 Hz? Wouldn't it save more space?


Since encoding happens in the frequency domain anyway, there isn't much savings. It'll just not encode those frequencies, which is pretty close to downsampling, but much easier to implement.

That said, at very low bitrates LAME does downsample.
Go to the top of the page
+Quote Post
eamon123
post Oct 15 2012, 05:02
Post #3





Group: Members
Posts: 30
Joined: 14-October 12
Member No.: 103838



QUOTE (saratoga @ Oct 14 2012, 04:03) *
Since encoding happens in the frequency domain anyway, there isn't much savings.


Listen to these test files and tell me what you think (24 vs 48 kHz, both with 11.5 kHz lpf):


https://rapidshare.com/files/666351554/Baba...y24.11.5lpf.mp3
https://rapidshare.com/files/2623776446/Bab...y48.11.5lpf.mp3

QUOTE (halb27)
pre-echo issues get worse


If you ask me the attack sounds as good in the 24 kHz file as the 48 kHz, and everything else sounds better. What do you make of it?


Go to the top of the page
+Quote Post
Dynamic
post Oct 16 2012, 18:07
Post #4





Group: Members
Posts: 795
Joined: 17-September 06
Member No.: 35307



QUOTE (eamon123 @ Oct 15 2012, 05:02) *
Listen to these test files and tell me what you think (24 vs 48 kHz, both with 11.5 kHz lpf):
QUOTE (halb27)
pre-echo issues get worse


If you ask me the attack sounds as good in the 24 kHz file as the 48 kHz, and everything else sounds better. What do you make of it?


Counterintuitively, actually, that's a different situation to comparing 32 kHz and either 44.1 kHz or 48 kHz and the attack should be expected to sound as good!

MPEG-1 layer 3 uses 1152 sample frames at either 32, 44.1 or 48 kHz sampling rates
MPEG-2 layer 3 uses 576 sample frames at either 16, 22.05 or 24 kHz sampling rates

Thus, the short-block duration (in milliseconds) to handle pre-echo and transients is the same for both 24 kHz and 48 kHz.

The full frame durations are:
36 ms for 16 or 32 kHz sampling rates
26 ms for 22.05 or 44.1 kHz sampling rates
24 ms for 24 or 48 kHz sampling rates

and this duration can be divided into three short blocks of 192 samples each (and a third of the duration) to handle transients with greater time resolution for the same bitrate at the expense of worse frequency resolution for the same bitrate (very high bitrate overcomes this, but you're talking about CBR)

So, to get the maximum 50% difference in short-block length, compare 32 kHz against 48 kHz or 16 kHz against 24 kHz (or indeed 32 kHz (poorer short block) against 24 kHz (better short blocks)). Converesely, as halb27 says, tonal problem samples are helped by the longer frame durations.
Go to the top of the page
+Quote Post
eamon123
post Oct 17 2012, 19:31
Post #5





Group: Members
Posts: 30
Joined: 14-October 12
Member No.: 103838



QUOTE (Dynamic @ Oct 16 2012, 18:07) *
tonal problems


I'm not really sure what those are. Could you explain that for me please?
Go to the top of the page
+Quote Post
Dynamic
post Oct 17 2012, 21:48
Post #6





Group: Members
Posts: 795
Joined: 17-September 06
Member No.: 35307



QUOTE (eamon123 @ Oct 17 2012, 19:31) *
QUOTE (Dynamic @ Oct 16 2012, 18:07) *
tonal problems


I'm not really sure what those are. Could you explain that for me please?


OK, lets do this rather thoroughly...

You can consider sounds to be of I guess three types: tonal, transient and continuous noise.

Tonal is like a whistle - a pure note, with a sharp frequency response and quite often overtones (harmonics) at multiples of the base frequency which give the timbre or character of the instrument's note. Vocally, vowel sounds are tonal and can be sung.
A chord - multiple notes played at once - is also tonal.

A transient is like a click, a cymbal or hi-hat hit or the breathy or plucked onset of a note from an instrument. (as an aside: often the type of onset gives the human as much information about the instrument as the timbre, which is why although first gen synthesizers tried mainly to reproduce timbre and overtones, later generations improved onset transients for more realism). In the frequency domain, most transients are spread over a wide spectrum (like noise) but in the time domain they are of short duration. Vocally, plosive consonant sounds and similar like p, b, f, k, t have transient nature.

Continuous noise is largely uncorrelated to previous samples, it's an essentially random signal that has components over a broad frequency spectrum. While transients are noiselike in the frequency domain - a broad spectrum with little in the way of frequency peaks - they last only a short time. A brushed snare drum or tape hiss is a good example of continuous noise. Vocally, breath sounds such as sh, ss, ff, dh/th are continuous noiselike sounds.

So the word COSINE for example
starts with a sharp transient C with a clicking noise
the long O is a tonal, singable vowel
the S is noiselike and lasts longer than the transient C without a particularly sharp onset
the I is a tonal, singable vowel
and the N is mostly tonal with a fairly abrupt but not really transient ending.
(the E is silent and modifies the vowel sound represented by letter I)

To accurately match the frequency or pitch of a slow-varying tonal signal, a long block in a transform codec, provides more points in the frequency domain, each representing a narrower frequency band. The frequency resolution can be said to be good. Because of the long duration, the time resolution is poor.

Imagine a grossly oversimplified example pretty much plucked from thin air:

If you imagine a bunch of frequency components in a tonal signal represented as decimal integer numbers, in a long block lasting 12 ms a small selection of them (12 by coincidence only) might be:
CODE
160Hz 240Hz 320Hz 400Hz 480Hz 560Hz 640Hz 720Hz 800Hz 880Hz 960Hz 1040Hz
   30   119   475   879  3049 10234  4520   960   214    53   178   422


If you need to encode them only to the nearest 100, say, you might then get
CODE
160Hz 240Hz 320Hz 400Hz 480Hz 560Hz 640Hz 720Hz 800Hz 880Hz 960Hz 1040Hz
    0   100   500   900  3000 10200  4500  1000   200   100   200   400


and being all zero in the units and tens digits, we only need to send the higher digits (hundreds, thousands, tens-of-thousands etc). This is similar to how we save bitrate in lossy encoding compared to lossless.

This gives a pretty good match for the frequency and amplitude of that tone when reconstructed, which our psychoacoustic model tells us is indistinguishable from the original on this occasion.

In a short block, there might be only a third of the number of samples, and a third of the number of frequency bins, each representing a 3-times wider frequency band but over a shorter time (e.g. 4ms), so while the frequency resolution is poor, the time resolution to represent rapid changes in the signal is good. The same bunch of frequencies over the same 12 ms is now divided into three short blocks, but instead of 12 frequency components in one long block, there are just 4 frequency components, each of which is three times broader in bandwidth, in each of three short-time blocks, lasting 4ms each.

CODE
        first 4ms       |        second 4ms       |        third 4ms
240Hz 480Hz 720Hz 960Hz | 240Hz 480Hz 720Hz 960Hz | 240Hz 480Hz 720Hz 960Hz
  375  3849  2022   111 |   208  4721  1898   218 |   142  5431  1468   102


If the psychoacoustic model has detected that there's a transient and requested a short block, it might well assume that the frequency spectrum is fairly broad, which is true for purely transient noiselike signals like hi-hat cymbals, and might calculate that rounding to the nearest 100, say, is enough:

CODE
        first 4ms       |        second 4ms       |        third 4ms
240Hz 480Hz 720Hz 960Hz | 240Hz 480Hz 720Hz 960Hz | 240Hz 480Hz 720Hz 960Hz
  400  3800  2000   100 |   200  4700  1900   200 |   100  5400  1500   100


However, there are cases where there is both a strong transient component and a strong tonal component. One example I've tested a few times is the problem sample Angels Fall First. This has a close-microphone on the right-channel guitarist's pick, producing strong clicking sounds (transients) as each string is picked. The string's notes are strongly tonal and the first string continues to sound as the next string is picked.

My guess is that the click of the pick triggers a short block to capture the short-duration sound. However, the bandwidth of each frequency bin in these three short blocks is a good deal broader now and if the same rounding accuracy (e.g. to nearest 100 in the above example) is provided, it sounds as though the frequency or amplitude of the continuous tones from the strings the sound throughout is wavering slightly.

To encode both the short time of the transient and preserve the sharp frequency spectrum of the tonal part of the signal over the whole time, a very high bitrate is required to produce finer-than-usual rounding accuracy for these broad bandwidth frequency bins to still result in fine frequency precision. This is a large part of what halb27's lame3.99.5z version does in the -Vn+ and -V0+eco settings when a short block is triggered and I think it's why it solves the Angels Fall First problem sample.

(The fact that there's a trade off between fine rounding accuracy and high time & frequency precision is a subtle mathematical point in the field of windowed overlapping Fourier Transforms, that's too advanced to explain in this context. There's some hope that the new Opus codec's band-by-band time/frequency preference will allow some frequency ranges to encode tonal components at low bitrate with poor time resolution while simultaneously providing good time resolution at low bitrate to other frequency bands.)
Go to the top of the page
+Quote Post

Posts in this topic


Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 25th July 2014 - 17:54