IPB

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
Uniform frequency resolution used in lossy coding: why?
knutinh
post Apr 20 2012, 10:05
Post #1





Group: Members
Posts: 569
Joined: 1-November 06
Member No.: 37047



Why is it that lossy audio encoders, that try to model the human auditory system in order to avoid encoding perceptually irrelevant information tend to use uniform frequency resolution (AFAIK), when models of the human auditory system are highly non-uniform?

Is it only because current designs are cpu-limited and the FFT (and its derivates) is the only feasible transform? Or is it some underlying limitation of our perception that ensure that uniform transforms are as good as it gets (limited max neuron firing rate? half-wave rectifier? signal stationarity?)

regards
k
Go to the top of the page
+Quote Post
Garf
post Apr 20 2012, 10:56
Post #2


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



QUOTE (knutinh @ Apr 20 2012, 11:05) *
Why is it that lossy audio encoders, that try to model the human auditory system in order to avoid encoding perceptually irrelevant information tend to use uniform frequency resolution (AFAIK)


No, they don't. They all use Bark bands, or similar things like ERB, which are explicitly modelled after the non-uniform human hearing system. Similarly, encoding formats like MP3 or AAC use scalefactor bands for encoding decisions, whose wideness is correlated to those Bark bands.

If you're wondering about why they're using a uniform transform to eventually make those non-uniform encoding decisions, it's a simple matter of efficiency. The MDCT gets you aliasing cancellation, good frequency resolution, decent time resolution (with block switching), good energy compaction and a very fast speed. We have nothing better so far.
Go to the top of the page
+Quote Post
knutinh
post Apr 20 2012, 11:13
Post #3





Group: Members
Posts: 569
Joined: 1-November 06
Member No.: 37047



If you base your representation on a uniform block-transform, and combine coefficients to selectively widen the higher-frequency bands, you ought to be loosing temporal resolution compared to full-band filterbank, wavelet or whatever.

Are changes in the upper octaves that happen on smaller timescales (e.g. 1 ms) perceptually less important, or is it somehow not possible to exploit that information to improve perceptual quality vs bitrate?

Would not e.g. a gammatone filterbank be a better mapping to our hearing, and therefore a better starting-point for deciding what information to throw away (if cpu cycles were of no concern, and possible redundancy in the representation could be efficiently compressed)?



e.g.:
"WIDEBAND SPEECH AND AUDIO CODING USING GAMMATONE FILTER BANKS", Eliathamby Ambikairajah, Julien Epps, Lee Lin

I seem to remember that the neuron firing rate is limited to something like an effective bandwidth of 2kHz. If critical bands of higher bandwidth are encoded only as half-wave rectified envelope, this would put a limit on the temporal resolution relevant needed to model the system, and perhaps this is some "lucky coincidence" that ensure the success of (perceptually modified) uniform block transforms?

Another (only somewhat related) question: When 50% overlapped block transforms are used, this would seem to introduce some redundancy into the parameters (coefficients). How (if at all) is this redundancy removed unless there is some parameter prediction across frames (that would tend to make starting at a random point of the stream impossible)

-k

This post has been edited by db1989: Apr 20 2012, 13:09
Reason for edit: removing pointless full quote
Go to the top of the page
+Quote Post
Garf
post Apr 20 2012, 14:23
Post #4


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



QUOTE (knutinh @ Apr 20 2012, 12:13) *
If you base your representation on a uniform block-transform, and combine coefficients to selectively widen the higher-frequency bands, you ought to be loosing temporal resolution compared to full-band filterbank, wavelet or whatever.


Not sure how the "combine coefficients to selectively widen the higher-frequency bands" fits into your question, but indeed if the transform worked directly into larger bands at higher frequencies, it would have better temporal resolution. Note that I said "decent time resolution (with block switching)". MDCT based codecs generally allow switching to a coarser frequency resolution (and hence better time resolution) if they determine that is more appropriate to the material at hand. This mostly sidesteps the problem, though it certainly is a bit of a hack and adds considerable implementation complexity.

QUOTE
Would not e.g. a gammatone filterbank be a better mapping to our hearing, and therefore a better starting-point for deciding what information to throw away (if cpu cycles were of no concern, and possible redundancy in the representation could be efficiently compressed)?


The latter two are an issue, of course. The MDCT is both fast and has good energy compaction. In my experience filterbanks modelled less uniformly tend to fail at both of the latter and hence perform worse overall.

QUOTE
I seem to remember that the neuron firing rate is limited to something like an effective bandwidth of 2kHz. If critical bands of higher bandwidth are encoded only as half-wave rectified envelope, this would put a limit on the temporal resolution relevant needed to model the system, and perhaps this is some "lucky coincidence" that ensure the success of (perceptually modified) uniform block transforms?


I'm not familiar with this research nor am I sure I understand what you're saying.

QUOTE
Another (only somewhat related) question: When 50% overlapped block transforms are used, this would seem to introduce some redundancy into the parameters (coefficients).


MDCT is 2N coefficients in, N coefficients out, with 50% overlap. So there's no redundancy - it's critically sampled. (Another reason why it's so popular!)
Go to the top of the page
+Quote Post
saratoga
post Apr 20 2012, 15:22
Post #5





Group: Members
Posts: 4963
Joined: 2-September 02
Member No.: 3264



The ATRAC family of codecs do not use entirely uniform frequency resolution. In the first few flavors, they use a QMF to split up the audio into uneven subbands which are then processed with different sized MDCTs. The newer variants use an even more complex scheme:

http://wiki.multimedia.cx/index.php?title=...ding_techniques
Go to the top of the page
+Quote Post
Woodinville
post Apr 21 2012, 17:32
Post #6





Group: Members
Posts: 1402
Joined: 9-January 05
From: JJ's office.
Member No.: 18957



Kutinh you are missing one of the issues in coding.

There are two:

1) We must impliment the masking thresholds well.

2) We must squeeze as much rate coding gain out of the codec as we can afford latency, without 1) falling out from under the time window.

The reason uniform banks are used is that we get a lot of coding gain.

There are several attempts at codecs that use critical bands (Sinha, the original MUSICAM from IRT, Johnston-Brandenburg Hybridcoding) and all of them give you a gain of about 4x in the rate for transients, which is about 5% of the signal, and cost you about 2-4X the rate for the other 95%, give or take. I may have those numbers wrong, I'm tyuping at you from a resort in Dalaman, Turkey where SIU just ended,b ut the poitn is simple, you lose much more than you gain.

Every one of the attempts has foundered on exactly that.

This is a different problem that hte codecs that switch between uniform and wavelet. Each works well, but the transition windows utterly murder your buffer management.


--------------------
-----
J. D. (jj) Johnston
Go to the top of the page
+Quote Post
splice
post Apr 22 2012, 00:37
Post #7





Group: Members
Posts: 125
Joined: 23-July 03
Member No.: 7935



You're at a resort in Turkey and you're reading HA? This is obviously some definition of "relaxation" that I'm not familiar with.


--------------------
Regards,
Don Hills
Go to the top of the page
+Quote Post
Woodinville
post Apr 22 2012, 15:18
Post #8





Group: Members
Posts: 1402
Joined: 9-January 05
From: JJ's office.
Member No.: 18957



QUOTE (splice @ Apr 21 2012, 16:37) *
You're at a resort in Turkey and you're reading HA? This is obviously some definition of "relaxation" that I'm not familiar with.


I'm at a resort in Turkey under the IEEE DL banner.

Giving talk in Athens tomorrow, Patras day after, Erlangen thursday, then edinburgh Sun, tu and we.


--------------------
-----
J. D. (jj) Johnston
Go to the top of the page
+Quote Post
knutinh
post Apr 23 2012, 09:02
Post #9





Group: Members
Posts: 569
Joined: 1-November 06
Member No.: 37047



QUOTE (Garf @ Apr 20 2012, 15:23) *
QUOTE

I seem to remember that the neuron firing rate is limited to something like an effective bandwidth of 2kHz. If critical bands of higher bandwidth are encoded only as half-wave rectified envelope, this would put a limit on the temporal resolution relevant needed to model the system, and perhaps this is some "lucky coincidence" that ensure the success of (perceptually modified) uniform block transforms?

I'm not familiar with this research nor am I sure I understand what you're saying.

It was suggested to me some years ago that communication from ears to brain operate at an effective sampling rate of a couple of kHz. It was also suggested that the reason why we still can "hear" stuff at higher frequencies was that (at some point), the signalling started to detect the envelope instead, sampled at the same sampling rate.

If this is true, then it would seem that the potential temporal resolution suggested by a relatively wide bandpass filter close to 10kHz or 20kHz is not something that humans can exploit, and therefore not something that a lossy codec needs to care about.
QUOTE
QUOTE
Another (only somewhat related) question: When 50% overlapped block transforms are used, this would seem to introduce some redundancy into the parameters (coefficients).


MDCT is 2N coefficients in, N coefficients out, with 50% overlap. So there's no redundancy - it's critically sampled. (Another reason why it's so popular!)

Ahh, thank you for enlightening me.

-k
Go to the top of the page
+Quote Post
knutinh
post Apr 23 2012, 09:14
Post #10





Group: Members
Posts: 569
Joined: 1-November 06
Member No.: 37047



QUOTE (Woodinville @ Apr 21 2012, 18:32) *
Kutinh you are missing one of the issues in coding.

There are two:

1) We must impliment the masking thresholds well.

This is the "lossy" part, right? Introducing error/inaccuracy in such a way as to decrease bitstream bandwidth without affecting audible distortion too much. ( irrelevancy removal).
QUOTE
2) We must squeeze as much rate coding gain out of the codec as we can afford latency, without 1) falling out from under the time window.

The reason uniform banks are used is that we get a lot of coding gain.

Is this the "lossless" part, i.e. removing redundancy in the original representation?

I had just assumed that since lossless codecs (such as flac) operate at typically 1/2 the rate of LPCM, that this is a good estimate of the redundancy in typical PCM, and that the remainder have to be removed through removal of irrelevancy.

Would not (any orthogonal transform + ideal vector quantization) do as well as (ideal orthogonal transform)? If PCA or KLT gives you the most energy compaction (hopefully something close to DCT), could not (in principle, and at unreasonable computational cost) a raw time-sample representation (identity transform) + vector quantization do the same thing?

It strikes me as unfortunate that the transform have to be chosen for satisfying both energy compaction and masking thresholds at the same time, if those two are somewhat different goals?

-h

This post has been edited by knutinh: Apr 23 2012, 09:15
Go to the top of the page
+Quote Post
dhromed
post Apr 23 2012, 10:57
Post #11





Group: Members
Posts: 1314
Joined: 16-February 08
From: NL
Member No.: 51347



QUOTE (knutinh @ Apr 23 2012, 09:02) *
It was suggested to me some years ago that communication from ears to brain operate at an effective sampling rate of a couple of kHz. It was also suggested that the reason why we still can "hear" stuff at higher frequencies was that (at some point), the signalling started to detect the envelope instead, sampled at the same sampling rate.


The ear-brain connection is not analog, and the operating frequency of the neurons is not directly related to the frequency of the input. Rather, a certain hair in the inner ear at a certain frequency position is stimulated by mechanical input of a certain frequency, and thus sends a signal that means "Hey! This is frequency X!".

The firing rate of neurons might limit the ability of a person to detect two consecutive clicksó though a hearing expert will be better able to comment on that.
Go to the top of the page
+Quote Post
saratoga
post Apr 23 2012, 15:14
Post #12





Group: Members
Posts: 4963
Joined: 2-September 02
Member No.: 3264



QUOTE (knutinh @ Apr 23 2012, 04:14) *
QUOTE (Woodinville @ Apr 21 2012, 18:32) *
Kutinh you are missing one of the issues in coding.

There are two:

1) We must impliment the masking thresholds well.


This is the "lossy" part, right? Introducing error/inaccuracy in such a way as to decrease bitstream bandwidth without affecting audible distortion too much. ( irrelevancy removal).


Technically quantization is the lossy part, but the decision about how to quantize is made using the masking thresholds, so essentially yes. Basically, in a modern transform coder, you do:

MDCT > Quantization > Lossless coding

The MDCT gets you your time/frequency distribution, the quantization does the lossy part by throwing away bits of information that masking says won't be needed, and then entropy coding does the lossless bit (basically taking all those zeros you've added by throwing out information and efficiently packs them).

QUOTE (knutinh @ Apr 23 2012, 04:14) *
I had just assumed that since lossless codecs (such as flac) operate at typically 1/2 the rate of LPCM, that this is a good estimate of the redundancy in typical PCM, and that the remainder have to be removed through removal of irrelevancy.


FWIW, that 1/2 for FLAC/APE/etc is largely due to joint stereo. For mono files, FLAC compresses much less. Of course, your lossy codec can also use joint stereo (either before or after the MDCT) in order to get the same gain.

QUOTE (knutinh @ Apr 23 2012, 04:14) *
Would not (any orthogonal transform + ideal vector quantization) do as well as (ideal orthogonal transform)? If PCA or KLT gives you the most energy compaction (hopefully something close to DCT), could not (in principle, and at unreasonable computational cost) a raw time-sample representation (identity transform) + vector quantization do the same thing?


This is where I step in over my head, but my guess is no. I think the DCT is somewhat special in that it's decomposition lines up fairly neatly with the actual time/frequency response of the human ear (except that it is linear in frequency). I think if you went with something like PCA you would get better energy compaction, but have a lot of trouble computing the masking thresholds as well as with the DCT.

However, I admit I am much more familiar with decoders then actual encoding.
Go to the top of the page
+Quote Post
knutinh
post Apr 24 2012, 07:47
Post #13





Group: Members
Posts: 569
Joined: 1-November 06
Member No.: 37047



QUOTE (saratoga @ Apr 23 2012, 16:14) *
This is where I step in over my head, but my guess is no. I think the DCT is somewhat special in that it's decomposition lines up fairly neatly with the actual time/frequency response of the human ear (except that it is linear in frequency). I think if you went with something like PCA you would get better energy compaction, but have a lot of trouble computing the masking thresholds as well as with the DCT.

However, I admit I am much more familiar with decoders then actual encoding.

I think that the DCT (or something similar) is special in that it does a)representation in a form suitable for perceptually guided quantization, b)energy compaction, c)low implementation cost. I agree that PCA would probably give a (small) benefit wrgt energy compaction, but a big hit wrgt perceptually sensible quantization.

But if there is some transform that does perceptual mapping better than DCT (probably some non-uniform (non-linear?) wavelet/filterbank), and another operation that can squeeze out the redundancy of any finite-length correlated, transformed signal (vector quantization?), then those two together would seem to be able to both a)convert all irrelevancy into redundancy, and b) convert all redundancy into fewer transmitted bits?

-k
Go to the top of the page
+Quote Post
Alexey Lukin
post Apr 28 2012, 15:50
Post #14





Group: Members
Posts: 191
Joined: 31-July 08
Member No.: 56508



QUOTE (knutinh @ Apr 24 2012, 02:47) *
I think that the DCT (or something similar) is special in that it does a)representation in a form suitable for perceptually guided quantization, b)energy compaction, c)low implementation cost. I agree that PCA would probably give a (small) benefit wrgt energy compaction, but a big hit wrgt perceptually sensible quantization.

Don't forget that with PCA you'd have to transmit the dictionary along with transform coefficients. And if you want better compaction than DCT, your dictionary should be adaptive over time.


QUOTE (knutinh @ Apr 24 2012, 02:47) *
But if there is some transform that does perceptual mapping better than DCT (probably some non-uniform (non-linear?) wavelet/filterbank), and another operation that can squeeze out the redundancy of any finite-length correlated, transformed signal (vector quantization?), then those two together would seem to be able to both a)convert all irrelevancy into redundancy, and b) convert all redundancy into fewer transmitted bits?

You could use DCT over a few wavelet bands. The ATRAC codec mentioned above implements this strategy and thus achieves nonuniform time-frequency resolution.
Go to the top of the page
+Quote Post
Woodinville
post Apr 28 2012, 18:07
Post #15





Group: Members
Posts: 1402
Joined: 9-January 05
From: JJ's office.
Member No.: 18957



QUOTE (knutinh @ Apr 23 2012, 01:14) *
It strikes me as unfortunate that the transform have to be chosen for satisfying both energy compaction and masking thresholds at the same time, if those two are somewhat different goals?

-h


That's not quite it. There are several goals, which are
1) Satisfy the needs (both frequency and time) demonstrated by the perceptual model
2) keep as much signal processing gain as possible.

This is a very difficult optimization, to say the very least.

The problem with KLT or approximations is that you now have to finda way to code teh basis vectors. Lotsa luck with that one. You can, of course, then your basis vector transmission costs what you just saved, and 100 times more.

Needless to say, it is possible to take a lot of audio blocks, and figure out the KLT for them (um, consider with a 1024 point transform, how many blocks do you need to fulfull the states to make the KLT valid, not so simple, eh??? A wee bit of latency, perchance???), you find out that you lose about 1-5% of the transform gain by using an MDCT (NOT DCT, please, and the MDCT filterbank IS NOT A TRANSFORM IT IS A FILTER BANK, so you need to formulate your KLT the same way, good luck too) and then spend a huge amount of bits to have to send the basis vectors at high accuracy, which you must do in order to have good transform gain.

You lose, overall, by a lot. There is implicit "information" in the MDCT bank that does not have to be transmitted (not real information, but by using the fixed filterbank, you do not have to signal the actual signal statistics). This is a feature, not a bug, in the codec design, as we are trying not to be "optimum" in terms of signal processing, but rather in bit rate.

This of course avoids an even more difficult problem with KLT, that of determining how to map the psychoacosutic model (which is terms of Fourier Frequency, give or take) to your KLT bases. This is a (bleeper) of a problem and I am being kind to say that.

So live from Edinburg, your answer.


--------------------
-----
J. D. (jj) Johnston
Go to the top of the page
+Quote Post
Woodinville
post Apr 28 2012, 18:09
Post #16





Group: Members
Posts: 1402
Joined: 9-January 05
From: JJ's office.
Member No.: 18957



QUOTE (Alexey Lukin @ Apr 28 2012, 07:50) *
You could use DCT over a few wavelet bands. The ATRAC codec mentioned above implements this strategy and thus achieves nonuniform time-frequency resolution.


And the worse performance that goes along with that. Sad, but true.


--------------------
-----
J. D. (jj) Johnston
Go to the top of the page
+Quote Post
Alexey Lukin
post Apr 28 2012, 19:27
Post #17





Group: Members
Posts: 191
Joined: 31-July 08
Member No.: 56508



JJ, are you referring to "overall performance" of ATRAC codec or the energy compaction property of its filter bank? I would be surprised if its energy compaction was worse than a single-resolution MDCT (not considering block switching here).
Go to the top of the page
+Quote Post
Woodinville
post Apr 28 2012, 21:45
Post #18





Group: Members
Posts: 1402
Joined: 9-January 05
From: JJ's office.
Member No.: 18957



QUOTE (Alexey Lukin @ Apr 28 2012, 11:27) *
JJ, are you referring to "overall performance" of ATRAC codec or the energy compaction property of its filter bank? I would be surprised if its energy compaction was worse than a single-resolution MDCT (not considering block switching here).


The aliasing in the tree structure hurts the energy compaction. Just like in MP3.


--------------------
-----
J. D. (jj) Johnston
Go to the top of the page
+Quote Post
Alexey Lukin
post Apr 29 2012, 06:40
Post #19





Group: Members
Posts: 191
Joined: 31-July 08
Member No.: 56508



Interesting. Their QMF seems to be reasonably high-order (48), I thought that aliasing should be limited in frequency.

Go to the top of the page
+Quote Post
Woodinville
post Apr 29 2012, 16:46
Post #20





Group: Members
Posts: 1402
Joined: 9-January 05
From: JJ's office.
Member No.: 18957



It is not too bad in this case, but it does exist.

At least it's better than the Mp3 filterbank, with the 32 band bank creating gobs of aliasing to deal with (yes, the aliasing control works to some extent) at the band edges.

There is a basic mathematical issue here, though, that you lose seomthing in a hybrid bank. What you lose may vary, but usually is "a bit of everything", things being rate gain, frequency diagnalization, time response, or power complimentary performance. You don't have to have all the problems...

This post has been edited by db1989: Apr 29 2012, 20:10
Reason for edit: deleting pointless full quote of above post


--------------------
-----
J. D. (jj) Johnston
Go to the top of the page
+Quote Post
Alexey Lukin
post Apr 29 2012, 17:07
Post #21





Group: Members
Posts: 191
Joined: 31-July 08
Member No.: 56508



You seem to imply that aliasing in mp3 has a strong effect on the coding efficiency. Is it really true for real-world signals? I thought that sparsity of transform coefficients is not strongly affected by aliasing.
Go to the top of the page
+Quote Post
Woodinville
post Apr 29 2012, 21:55
Post #22





Group: Members
Posts: 1402
Joined: 9-January 05
From: JJ's office.
Member No.: 18957



QUOTE (Alexey Lukin @ Apr 29 2012, 09:07) *
You seem to imply that aliasing in mp3 has a strong effect on the coding efficiency. Is it really true for real-world signals? I thought that sparsity of transform coefficients is not strongly affected by aliasing.


In the aliased region, it doubles the number of coefficients, and it also places very annoying constraints on how one must interpret the psychoacosutic threshold.

It's not easy to handle properly.


--------------------
-----
J. D. (jj) Johnston
Go to the top of the page
+Quote Post

Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 19th September 2014 - 10:56