IPB

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
Theoretical limits of mp3 quality
RD
post Oct 25 2001, 15:28
Post #1





Group: Members
Posts: 151
Joined: 29-September 01
Member No.: 31



Recently, there has been a lot of discussion about the merits and deficiencies of many audio compression formats (ogg, mpc, mp3, etc.)

There has also been discussion of the Lame project, its possible demise, and what needs to be improved in future lame encoders (e.g., better tonality estimation, use of mixed blocks, improved noise shaping and block switching, non-linear psycho acoustics, HQ filters, IS, better ath adjust, improved m/s decision, etc.).

My question here is about the THEORETICAL limits of the mp3 format.

In other words, suppose someone, and remember this is hypothetical, came up with a *perfect* psymodel, and a *perfect* tonality estimation, and a *perfect* block swithcing criterion, and a *perfect* joint-stereo m/s decision making criterion, and so on...
WOULD there still be deficiencies in the mp3 format?

From what I have read, I believe there would still be some music that would give even this "super lame encoder" difficulties.

And what I want to know is: (A) what deficiencies would this "super lame encoder" have", (B) what causes these deficiencies (i.e., what part of the mp3 specificaiton) and © what types of sound would still cause problems for it?

I'm working from the position (and I think I'm correct) that there are just certain limitations in the mpeg1 layer 3 specification that limit it in some ways--even if it were developed as perfectly as could be.

So here are my guesses, and if you can add or correct, etc. to this I would be very appreciative! :-)

(A) Deficiencies of the "super lame encoder":

1. Time Domain Resolution

I believe Tangent is correct when he says: "[mp3] use[s] filterbanks which divides the sample into 32 frequency bands. MP3 then uses iMDCT to further divide into a total of 1152 frequency bands, however this further division completely removes the time domain resolution of the sample and this causes the biggest problem in transform coders - pre/post echos."

I understand that the use of short blocks and long blocks can help reduce pre-echos but Buschmann once told me that those are "tricks" which help but do not really address the problem fully...

So I think the super lame encoder would still occasionally have pre-echos, etc... but then again perhaps it would be too small to notice?

2. ??? Help me out with more....

(B) What causes these Deficiences

1. That it is a transformation encoder and subject
to ____ (fill in) which limits it by_____ unlike sub-band encoders and wavelet encoders and _____ encoders...(I know all have there advantages and disadvantages but I would like to focus on the disadvantages of transformation encoders....)

2. That unlike AAC it has inefficient sizes for long
and short blocks (AAC =1024, long, 128, short..hope i got that right) whereas mp3 has 576? long and 192 short? my memory fails..

3. That it has a practical maximum of 320kbps for a griven frame and some music needs more than 320 (I know there is a free format, but hardly anything supports it so its almost as if it doesn't exist...)

4. _______ anything else I should have mentioned?

© Types of sound that would still cause problems for the super lame encoder:

1. Transient-filled electronica
2. _____ fill in....

Thanks to everbody for reading this and for all help.

RD
Go to the top of the page
+Quote Post
Dibrom
post Oct 25 2001, 16:04
Post #2


Founder


Group: Admin
Posts: 2958
Joined: 26-August 02
From: Nottingham, UK
Member No.: 1



Maybe this can answer some of your questions.. (hopefully someone else will give some more in depth technical information.. Ivan, Robert, etc?):

- Mp3 in a perfect world, would probably be pretty damn good for most music. The music it would have the most problems with would of course be electronic music, stuff like what you see in many of the test samples I've used in the past.

- Mp3 will never be great on pre-echo. It can be good for most natural transients perhaps except for cases where you can clearly hear attacks without much else going on in the background (listen to the first couple attacks of castanets... very obvious). Part of this is due to inefficient block sizes, part of it is due to other issues. AAC actually uses a multitude of techniques to mostly overcome this problem. Another thing which I believe is actually related to pre-echo, though not often noted, is joint stereo issues. It appears that joint stereo can sometimes make pre-echo worse. This has been the case in mp3, and in some early versions of MPC this was found to be the case, and of course fixed. Since AAC (as well as MPC) can use joint stereo per frequencies instead of simply per frame, they have another advantage here.

- Mp3 doesn't have the most efficient joint stereo model (read the end of the above point).

- Mp3 lacks a scalefactor for the last scalefactor band. This means that you can never get nearly 100% dropout free encoding over 16khz. For many people this may not be much of a problem really, but besides the fact that it is a possible quality concern (even at 320kbps), it also causes very large bitrate increases as you attempt to approach the point of increasingly diminishing returns of getting that extra little bit above 16khz encoded.

- Mp3 does allow mixed blocks, but as I understand it, they may not be implemented in the most optimal manner, due to possible restrictions on when they can be used (problems with transitioning between different block types I think).

About block switching.. yes, fundamentally that is a "hack" to help reduce pre/post-echo, and it actually doesn't really "solve" the problem. Something like AAC though, when used correctly, can come very close to eliminating pre-echo.. but it is still a fundamental issue. Subband encoders naturally don't have this problem because they encode in the time domain instead of the frequency domain. As you probably know, the disadvantage is that theoretically this requires more bits to sound as good.

I believe a good wavelet implementation may be the most optimal approach as it allows you to decide between higher frequency resolution or higher time resolution per band. That way you can get the best of both worlds. Admittedly, I'm hardly an expert on wavelets, so if I'm wrong I'm sure someone will correct me smile.gif

About 320kbps frame limit. Maximum frame size is one problem, but more than that fixed frame sizes at lower bitrates are also a problem. I think a totally freeform vbr would probably be much more efficient. At a similar bitrate, this may actually allow a quality increase due to more efficient "packing" of bits per frame. At first glance, it appears this may not be an issue because if a frame size which is larger than needed is used, you can just add the extra to the bit reservoir right? Unfortunately there is a limit to how large this can be, and also there is a limit to the effective distance this can be used over. As I understand it, you can't actually store this for the entire file, it only lasts for a set amount of frames and this limit is actually quite low I think.

This is pretty much completely arbitrary, but I think I'd probably guess that in a perfect world... LAME could probably be increased in quality (at a similar bitrate) about another 20% or so.. it's hard to say for sure though. That 20% might only be on difficult samples too.
Go to the top of the page
+Quote Post
RD
post Oct 25 2001, 17:39
Post #3





Group: Members
Posts: 151
Joined: 29-September 01
Member No.: 31



Thanks Dibrom, its always a pleasure to read your responses...

I guess your new compile of lame is mostly aimed at fixing transients/pre-echo then or will it have any benefits for rock and metal music too?

Oh btw, how is the tweaking coming along? (Sorry--dying to know when the new compile might be ready...)

Finally, what is this "better tonality estimation" thing about? What is tonality? Does it play a part in determining whether or not the encoder uses short or long blocks? I know at least a little something about most aspects of lossy compression, but I know nothing about this tonality thingy....

Thanks again,
RD
Go to the top of the page
+Quote Post
tangent
post Oct 25 2001, 17:40
Post #4





Group: Members
Posts: 674
Joined: 29-September 01
Member No.: 63



To maintain backward compatibility with MP2, MP3 does it's iMDCT on the 32 subbands produced by the MP2 encoder. The subbanding stage itself is an extra lossy stage which is really not required for the transform coder. That's a quality limitation in MP3 itself.

AAC and Vorbis do not subband before applying iMDCT, so assuming perfect encoding of all the transform coefficients, AAC and Vorbis can provide completely lossless encoding while that is not possible for MP3.
Go to the top of the page
+Quote Post
Dibrom
post Oct 25 2001, 23:01
Post #5


Founder


Group: Admin
Posts: 2958
Joined: 26-August 02
From: Nottingham, UK
Member No.: 1



QUOTE
Originally posted by RD
Thanks Dibrom, its always a pleasure to read your responses...


Np. smile.gif

QUOTE
I guess your new compile of lame is mostly aimed at fixing transients/pre-echo then or will it have any benefits for rock and metal music too?


Actually what I'm working on more than anything else is noise measuring, which is probably more related to tonality issues. Basically it should be a more fundamental fix to issues like dropouts, without requiring as many bits as like -X1.. at least thats the idea. It should also help situations like fatboy and cases with those kind of impulses. More classic pre-echo such as in castanets I'm also going to try and work on some, but this will come after the other stuff.

QUOTE
Oh btw, how is the tweaking coming along? (Sorry--dying to know when the new compile might be ready...)


Well I spoke a little too soon last time unfortunately.. biggrin.gif Things are coming along, but I've been trying to work on a more efficient solution than what I did the first time around. I think that I am making progress though. I don't want to give an exact estimate when I'll be ready this time though just in case, but I'm hoping "soon".

QUOTE
Finally, what is this "better tonality estimation" thing about? What is tonality? Does it play a part in determining whether or not the encoder uses short or long blocks?  I know at least a little something about most aspects of lossy compression, but I know nothing about this tonality thingy....


Well, I don't know all of what it encompasses myself actually, but basically it is what helps determine how much masking is needed for a particular signal. The idea is that if a signal is more tonal it probably requires more masking but if it is less tonal and more noiselike, you can be more aggressive with the quantization. I think that is how you would describe it basically. So since block switching and mid side stereo selection (as well as many other things) rely on masking calculations to make decisions, this does have an effect on many of those things.
Go to the top of the page
+Quote Post
JohnV
post Oct 26 2001, 01:42
Post #6





Group: Developer
Posts: 2797
Joined: 22-September 01
Member No.: 6



QUOTE
Originally posted by Dibrom
The idea is that if a signal is more tonal it probably requires more masking but if it is less tonal and more noiselike, you can be more aggressive with the quantization.

Umm, I always mix more masking and less masking smile.gif. I would say when a signal is more tonal, it has lower masking capability, thus higher quantization resolution (higher bitrate) must be used. When a signal is more noise like, it has higher masking capability which allows lower quantization resolution (lower bitrate) because quantization noise can be masked better.

For example, if tonality estimation estimates a noise like signal to be even noisier than it really is (thus resulting that psychoac thinks there's higher masking capability than actually is), it results that psychoac may use lower quantization resolution (lower bitrate) than really should, thus causing audible quantization noise.


--------------------
Juha Laaksonheimo
Go to the top of the page
+Quote Post
Dibrom
post Oct 26 2001, 03:15
Post #7


Founder


Group: Admin
Posts: 2958
Joined: 26-August 02
From: Nottingham, UK
Member No.: 1



Yeah, I think thats pretty much right. Heh.. I always start mixing that up when describing it.. like you said smile.gif But anyway, what counts is basically how aggressive the encoder is at injecting or allowing noise in a particular area (because it believes it will be masked) than in others. Quantization noise in tonal areas is usually more audible than in non tonal areas.

This is part of the problem you see with clips like fatboy actually.. too much noise is allowed during the impulses, and as a result you hear these nasty artifacts. If you look at fatboy encoded with mp3 from a spectral view and zoom in between the pulses, you can sometimes see noise smeared throughout.. and if you listen, these are usually the areas where you are hearing the worst artifacts.
Go to the top of the page
+Quote Post
RD
post Oct 26 2001, 14:30
Post #8





Group: Members
Posts: 151
Joined: 29-September 01
Member No.: 31



QUOTE
- Mp3 lacks a scalefactor for the last scalefactor band.  This means that you can never get nearly 100% dropout free encoding over 16khz.  For many people this may not be much of a problem really, but besides the fact that it is a possible quality concern (even at 320kbps).


Hmm... so I guess that means that above 16Khz Lame's noise shaping is either technically non-existent, or is at least considerably impaired (i.e. we still can use a noise measuring criterion, e.g., -X3, but it is really inefficient without a scalefactor for the last band....)

Do other mp3 encoders have a scalefactor for the last band? For example, Fraunhofer? Or is it only absent in Lame? If only absent in Lame, how could we create a scalefactor for the last band--through listenting tests? Is it feasible? or should we just forget about it...?

QUOTE
It also causes very large bitrate increases as you attempt to approach the point of increasingly diminishing returns of getting that extra little bit above 16khz encoded.


I know... just recently I encoded Joe Stump's excellent metal-virtuoso instrumental album: "Night of the Living shred" first with:
(1) --dm-preset standard -Z
(2) --dm-preset standard --lowpass 16
and the results were dramatic...

(1) = 86.1 megabytes
(2) = 59.7 megabytes

A whopping 44% increase for frequencies above 16Hz!
Actually, I should not have added -Z to (1) above because that will inflate bitrate, but still what an INCREASE!!

If we had a scalefactor band for the last band, would this increase be a lot less, or only a little less?

I end with a technical question about high frequencies... My understanding is that the wavelength (distance between the crests/peaks shortens as you go from 20 Hertz to 20 Kilohertz...

Thus it would seem that there is a lot more information to store in the higher frequencies than the lower ones...is this ALSO one of the reasons for the:

QUOTE
very large bitrate increases as you attempt to ... [get] that extra little bit above 16khz encoded.


Or is this not one the reasons at all...?

Take care,
RD
Go to the top of the page
+Quote Post
Dibrom
post Oct 26 2001, 14:46
Post #9


Founder


Group: Admin
Posts: 2958
Joined: 26-August 02
From: Nottingham, UK
Member No.: 1



QUOTE
Originally posted by RD
Hmm... so I guess that means that above 16Khz Lame's noise shaping is either technically non-existent, or is at least considerably impaired (i.e. we still can use a noise measuring criterion, e.g., -X3, but it is really inefficient without a scalefactor for the last band....)


Not exactly. Since there is no scalefactor, noise shaping is the only thing that we can use to actually encode a good portion of this content at all. I believe the end result of the scalefactors is that they dictate how bits are distributed efficiently for a particular band. Since we don't have this for sfb21 (long blocks) or sfb12 (short blocks) we have to rely on noise shaping in the last band to determine what to encode. Then, the only way to encode this is to change the quantization resolution on a global scale. Thus the frequencies over 16khz aren't what is causing the massive bitrate increases, it is all the other frequencies below 16khz which are now encoded with increased resolution (more bits) because we needed to encode above 16khz.

QUOTE
[b]Do other mp3 encoders have a scalefactor for the last band?


No, this is a limitation of the mp3 spec.

QUOTE
[b]A whopping 44% increase for frequencies above 16Hz!


Well as I explain above, it isn't so much what is over 16khz that causes this increase, instead it is more of a result of what happens to the other frequencies when we try to do this.

QUOTE
[b]If we had a scalefactor band for the last band, would this increase be a lot less, or only a little less?


It would be significantly less. In the cases of AAC, Vorbis, and MPC, none of them have this problem, and most of them are capable of full or nearly full frequency reproduction at around 200kbps or even less. Even at 320kbps, this is impossible 100% of the time with mp3.

QUOTE
[b]I end with a technical question about high frequencies...  My understanding is that the wavelength (distance between the crests/peaks shortens as you go from 20 Hertz to 20 Kilohertz...

Thus it would seem that there is a lot more information to store in the higher frequencies than the lower ones...is this ALSO one of the reasons for the:

Or is this not one the reasons at all...?


No, this doesn't really have anything to do with the scalefactor issue.
Go to the top of the page
+Quote Post
Garf
post Oct 26 2001, 15:45
Post #10


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



QUOTE
I end with a technical question about high frequencies... My understanding is that the wavelength (distance between the crests/peaks shortens as you go from 20 Hertz to 20 Kilohertz... 

Thus it would seem that there is a lot more information to store in the higher frequencies than the lower ones...is this ALSO one of the reasons

No, this doesn't really have anything to do with the scalefactor issue.


The reason for this is that the encoder encodes in the frequency domain, not the time domain. The same thing that causes preechos helps out nicely here smile.gif

--
GCP
Go to the top of the page
+Quote Post
RD
post Oct 26 2001, 20:21
Post #11





Group: Members
Posts: 151
Joined: 29-September 01
Member No.: 31



Thanks so much to everybody esp. Dibrom, Tangent, John V, and Garf, for their great comments.

How could we create a scalefactor for the last band--through listenting tests?

Is it feasible? or should we just forget about it...?

Dibrom said its not in the specification, but will adding one
affect decoding? Obviously we want mp3 files that work with all decoders....

Because if we can "fix" this problem.... lame will greatly benefit...

RD
Go to the top of the page
+Quote Post
Garf
post Oct 26 2001, 20:51
Post #12


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



QUOTE
Originally posted by RD
Thanks so much to everybody esp. Dibrom, Tangent, John V, and Garf, for their great comments.

How could we create a scalefactor for the last band--through listenting tests? 

Is it feasible? or should we just forget about it...?

Dibrom said its not in the specification, but will adding one
affect decoding? Obviously we want mp3 files that work with all decoders....

Because if we can "fix" this problem.... lame will greatly benefit...

RD


The scalefactor is basically a field that is lacking in the mp3 format specification. There is no way to fix and it stay compatible (really compatible, not like Mp3pro/Plusv) with MP3.

--
GCP
Go to the top of the page
+Quote Post

Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 20th September 2014 - 20:03