IPB

Welcome Guest ( Log In | Register )

2 Pages V  < 1 2  
Reply to this topicStart new topic
Can audio encoders target quality w/o caring about bit rate/file size?, [OP = softrunner / split from “IETF Opus codec now ready for testing”]
IgorC
post Feb 17 2013, 20:01
Post #26





Group: Members
Posts: 1572
Joined: 3-January 05
From: ARG/RUS
Member No.: 18803



The emphasis is mine.
QUOTE (jensend @ Feb 17 2013, 03:30) *
But if you look at the Google tests you'll see that subjects rated 64kbps stereo music to have a much much greater quality difference from the original than 32kbps mono speech, even though you'd expect channel coupling to have a very major benefit.

First, those are completely different Google's tests. http://www.opus-codec.org/comparison/GoogleTest1.pdf

32 kbps speech test - 17 listeners.
64 kbps test - 9 listeners. No wonder they got considerably less participants for higher bitrate test.

Second, it's not "much much" greater quality at all.
MUSHRA scores:
32 kbps speech (mono) - Opus - 97.2
64 kbps music - Opus - 90.7
64 kbps music - LC-AAC - 90.7 (oh, please!)

Once MUSHRA score is >90 (>4.5 in our world) all cats are lions.

This post has been edited by IgorC: Feb 17 2013, 20:12
Go to the top of the page
+Quote Post
jensend
post Feb 18 2013, 00:10
Post #27





Group: Members
Posts: 145
Joined: 21-May 05
Member No.: 22191



QUOTE (Nessuno @ Feb 17 2013, 10:18) *
I do think you are simply mistaking transparency for intelligibility: the fact that you can perfectly understand someone talking on the phone doesn't mean the phone is transparent to speech, not the way this term is used in perceptual codec evaluation, at least.
You must not have understood a single thing I was saying. I'm perfectly aware that codecs with vastly better than telephone quality may still not be transparent. Try reading again.
QUOTE (IgorC @ Feb 17 2013, 12:01) *
First, those are completely different Google's tests.
What's your point here? I already said the tests are distinct and gave a disclaimer about the limits of comparability. Yes, I didn't spend a bunch of time and money to set up a professional-quality large-scale direct comparison. I'll happily do so once you wire me $10K. (Since no test protocol can make cross-sample quality comparisons blind, the usefulness of my or any single individual's listening tests and preferences is sharply limited; an aggregate test of normal people neutral to this debate would be needed.) In the meantime this data does support my point even though it's not a rigorous proof.
QUOTE
Second, it's not "much much" greater quality at all.
I never said much greater quality, I said much greater quality difference. Listeners gave the 64kbps stereo music a score 9/100 points lower than the reference. That's a much larger difference than giving the 32kbps mono speech a score 2/100 points lower than the reference. That's despite having the advantage that, thanks to channel coupling, coding these normal stereo samples at 64kbps is considerably easier than coding a mono version at 32kbps would have been. This indicates that 32kbps mono music would likely be rated well below 32kbps mono speech.

Some of you seem to be saying "maybe it's just that the speech is equally degraded but people don't find that as unacceptable as they do for music." Since people's preferences are what define quality, this makes zero sense. A VBR encoder that encodes speech at the same bitrate as music when listeners find the degradation of music at that bitrate to be annoying but would not be annoyed with speech at a marginally lower bitrate is simply not managing to maintain constant quality.

Some of you are saying "well, since the VBR encoders don't drop the bitrate for speech and the VBR encoders are absolute perfection handed down to us from Olympus by the gods, obviously speech is hard to code. The rate allocation scheme of LAME is perfect, enlightening the eyes; the bitrate->lowpass map of LAME is true and righteous altogether. Holy, holy, holy. To say otherwise is blasphemous." The authors of LAME and Vorbis, mere mortal men like ourselves, would happily tell you that their encoders' decisions are not tuned for mono speech and that their encoders have no capability to detect speech and adjust their decisions accordingly. The Vorbis devs straight up tell you in the FAQ that even though it's decent for speech they've given speech little thought and you should consider other codecs. Long ago the LAME devs added a --speech option which uses a low bitrate, forces ABR since their normal bitrate allocation is suboptimal, and forces a lower lowpass than normal (any of that sound familiar from what I've been saying?).
Go to the top of the page
+Quote Post
IgorC
post Feb 18 2013, 05:27
Post #28





Group: Members
Posts: 1572
Joined: 3-January 05
From: ARG/RUS
Member No.: 18803



jensend,

Shortly, an interpolation between two different listening tests with different conditions (as this case) isn't just not precise enough but completely wrong. We had been through this many times.

This post has been edited by IgorC: Feb 18 2013, 05:27
Go to the top of the page
+Quote Post
Nessuno
post Feb 18 2013, 08:14
Post #29





Group: Members
Posts: 422
Joined: 16-December 10
From: Palermo
Member No.: 86562



QUOTE (jensend @ Feb 18 2013, 00:10) *
QUOTE (Nessuno @ Feb 17 2013, 10:18) *
I do think you are simply mistaking transparency for intelligibility: the fact that you can perfectly understand someone talking on the phone doesn't mean the phone is transparent to speech, not the way this term is used in perceptual codec evaluation, at least.
You must not have understood a single thing I was saying. I'm perfectly aware that codecs with vastly better than telephone quality may still not be transparent. Try reading again.

Yes, it's clearly me not understanding. But since the figures I gave in my previous post (which you did fully read, right?) clearly demonstrate that an audio encoder targeting quality does't care about bitrate, as per this thread's subject, could you please help me understand any better which your point in this discussion exactly is? Are you still speaking of signal quality evaluation or something completely different and uncorrelated, like subjective speech recognition (which in my opinion, is something completely out of the realm of this forum)?


--------------------
... I live by long distance.
Go to the top of the page
+Quote Post
softrunner
post Feb 18 2013, 13:21
Post #30





Group: Members
Posts: 48
Joined: 19-July 12
Member No.: 101579



Hooh..., finally I've read the whole thread... And what can I say? The most part of misunderstandings in this thread turns around only one word, and this word is "quality" (or so called "quality level"). Can you tell me, what is quality when we are talking about sound? I can tell you: quality here means audibility i.e. closeness of the encoded sound to the source sound. It is not some virtual abstract quality, calculated by encoder. It is something, that is recognized by human ears as "unaudible difference", "very close to source", "audible, but good quality of sound", "audible, and acceptable quality, not annoying" and "unacceptable quality, annoying sound". Do you understand? I am talking about real audio listening, not about some abstract quality in terms of encoder.
And all audio encoders produce different quality/closeness to source for different sources at the same so called "quality preset". Does not matter, whether it is one source file with mixed content or set of files. Suppose, we have 1000 files, 500 of which are some audiobooks, and 500 - music like Fighter Beat by The Prodigy (very complex), and all this files are randomly mixed in one folder. Which "quality preset" would you use for encoding all this stuff to get REAL high quality on output without using excessive bitrate? You really cannot answer. Audiobooks would need ~64kbps vbr quality preset, and music like ~208kbps vbr quality preset (if it is Opus), in both cases output will be close to source. And you have to listen every separate file and only then you can get approximate understanding (of course, if you are experienced enough) which preset to use for every separate file. This listening is a HUGE pain for user (if he is really interested in high ratio of quality/file size).
I am a user, and I want to have a tool intelligent enough to do this work for me. Nobody has invented such a tool (let's call it encoder) so far. Also, this tool should operate not audio files but audio frames, because one file can contain frames of various complexity for encoding to be close to source at output. Check my samples in Uploads part. They are basically small looped pieces of another samples, because usually you can here the difference only at some part of it, where bitrate, provided by encoder, is not high enough to be unaudible; or we can say, that other parts do not need such a high bitrate they have, because they will be still unaudible for user at lower bitrate.

p.s. I usually post rarely. If I do not reply fast, it's normal.

This post has been edited by softrunner: Feb 18 2013, 13:25
Go to the top of the page
+Quote Post
ggf31416
post Feb 18 2013, 16:09
Post #31





Group: Members
Posts: 34
Joined: 1-June 06
Member No.: 31342



QUOTE (softrunner @ Feb 18 2013, 09:21) *
Hooh..., finally I've read the whole thread... And what can I say? The most part of misunderstandings in this thread turns around only one word, and this word is "quality" (or so called "quality level"). Can you tell me, what is quality when we are talking about sound? I can tell you: quality here means audibility i.e. closeness of the encoded sound to the source sound. It is not some virtual abstract quality, calculated by encoder. It is something, that is recognized by human ears as "unaudible difference", "very close to source", "audible, but good quality of sound", "audible, and acceptable quality, not annoying" and "unacceptable quality, annoying sound". Do you understand? I am talking about real audio listening, not about some abstract quality in terms of encoder.....


All audio codecs at VBR try to reach some level of audible quality as perceived by humans, however
1) There isn't a good metric that takes the input and output and gives a result that is well correlated with perceived quality as we don't have a good model of how humans perceive quality. Video compression has good metrics like SSIM but even they are far from perfect.
2) Some codecs may be too conservative. If the psycho-model predicts that a part needs say 32 kpbs to reach the target quality when it should take 96 for average audio it may mean that part only need 32, in that case the target quality is achieved, or it may mean that the psycho-model underestimated the bitrate and that will result in audible artifacts. So if the developers don't trust their psycho-model they may give easy parts a higher bitrate than needed (better be safe than sorry).
3) Unlike Opus, not all formats are optimized to encode audio at low bitrates.
Go to the top of the page
+Quote Post
db1989
post Feb 18 2013, 18:53
Post #32





Group: Super Moderator
Posts: 5275
Joined: 23-June 06
Member No.: 32180



QUOTE (softrunner @ Feb 18 2013, 12:21) *
The most part of misunderstandings in this thread turns around only one word, and this word is "quality" (or so called "quality level"). Can you tell me, what is quality when we are talking about sound? I can tell you: quality here means audibility i.e. closeness of the encoded sound to the source sound. It is not some virtual abstract quality, calculated by encoder. It is something, that is recognized by human ears as "unaudible difference", "very close to source", "audible, but good quality of sound", "audible, and acceptable quality, not annoying" and "unacceptable quality, annoying sound". Do you understand? I am talking about real audio listening, not about some abstract quality in terms of encoder.
That’s strange, seeing as you wrote this:
QUOTE (softrunner @ Feb 14 2013, 02:33) *
x264 video encoder has encoding mode called Constant Rate Factor. In this mode number (16, 17, etc) is used to define desired quality (lower - better quality and higher bitrate), and encoder does not care about bitrate, only about keeping rate factor constant. It is a question, why nobody has invented something similar for audio encoding (except lossyWAV, which needs too much bitrate for acceptable quality)?
I think every encoder with real vbr (not abr) does that? Lame has V(0-9), QT AAC has --tvbr (0-127), Vorbis has -q((-2)-10). The bitrate may vary a lot with these settings between different songs/genres.

Either you still don’t understand what people have been saying, or you expect perfection from perceptual encoders. Neither will make this thread useful.
Go to the top of the page
+Quote Post
splice
post Feb 18 2013, 21:08
Post #33





Group: Members
Posts: 125
Joined: 23-July 03
Member No.: 7935



QUOTE (softrunner @ Feb 14 2013, 11:33) *
Well, if you mix audiobook and complex electronic music in one file, then which bitrate will you use for this file? Opus 64 kbps will give good quality for that part, which contains audiobook, but the quality of musical part will be very low. And 176 kbps will give good quality for music, but that bitrate will be too excessive for audiobook. And I would like to have encoder, which takes from me "good quality" as input option, and gives ~64 kbps for audiobook part of the file and ~176 kbps for musical part. None of modern audio encoders can to this. sad.gif


Your problem hinges on the point that we usually accept a lower level of quality for the spoken word than we do for music. Likewise, we can accept poor quality printing of text, so long as it is legible, but we prefer images to be high quality.

You have two choices - either develop (or have developed for you) an encoder that recognises sections containing the spoken word and adopts a different quality metric for them, or do it manually - encode the spoken words separately from the music and join (edit) the sections together afterwards. I assume that you already generate the speech and music separately and join then afterwards, so it should not be too much of a change in workflow. I know you can do this with MP3. I don't know about Opus.



--------------------
Regards,
Don Hills
Go to the top of the page
+Quote Post
jensend
post Feb 18 2013, 23:34
Post #34





Group: Members
Posts: 145
Joined: 21-May 05
Member No.: 22191



QUOTE (splice @ Feb 18 2013, 13:08) *
Your problem hinges on the point that we usually accept a lower level of quality for the spoken word than we do for music. Likewise, we can accept poor quality printing of text, so long as it is legible, but we prefer images to be high quality.
*sigh* No. Quality!=PSNR. Long ago, a wise man once said,
QUOTE (jensend @ Feb 17 2013, 16:10) *
Some of you seem to be saying "maybe it's just that the speech is equally degraded but people don't find that as unacceptable as they do for music." Since people's preferences are what define quality, this makes zero sense. A VBR encoder that encodes speech at the same bitrate as music when listeners find the degradation of music at that bitrate to be annoying but would not be annoyed with speech at a marginally lower bitrate is simply not managing to maintain constant quality.


Nessuno: no, this isn't about recognizability either. (Recognizability could be considered for music too- e.g. "regardless of how awful it sounds, I can tell- just barely- this is Beethoven's 9th.") It's about quality. This can't be reduced to a binary distinction, but if you must have a binary distinction to start with and you want something more descriptive than "good vs bad" perhaps the best one is "annoying vs not annoying" (c.f. MUSHRA, ABC/HR). It may be distinguishably different under ideal conditions- so what? Is it any worse, or would you be perfectly fine with listening to this instead of that? On the opposite end of the quality spectrum, it may be recognizable- so what? Is it any good, or would you tear your hair out if you had to listen to it for any substantial amount of time?

Softrunner: It appears you were substantially more confused than I thought you were. Others esp. ggf31416 and db1989 are doing a good job of explaining why.

IgorC: The test was done with the same low anchors and quite likely a subset of the same listeners using the same equipment. You think that the differences significantly biased the results in one coherent direction? Whatever. My opinion is of course not based on these tests but on my own 12-64kbps listening comparisons. Feel free to try your own. Of course, as I already said, no test protocol can make cross-sample quality comparisons blind, so whenever you can ABX the speech there's nothing preventing you from saying "gee, I'm going to rate this a 2 and the encoded music a 4.9, just 'cause I wanna show jensend is wrong."

Please note that despite what it may look like from the pile-on in this thread, my view appears to be in the majority. Just about everybody recommends bitrates for speech they would not recommend for mono music (or recommend bitrates for speech less than half what they recommend for stereo music despite the savings of channel coupling). This is not because they expect people to just put up with being more annoyed.

This post has been edited by jensend: Feb 18 2013, 23:35
Go to the top of the page
+Quote Post
Nessuno
post Feb 19 2013, 11:36
Post #35





Group: Members
Posts: 422
Joined: 16-December 10
From: Palermo
Member No.: 86562



QUOTE (jensend @ Feb 18 2013, 23:34) *
Nessuno: no, this isn't about recognizability either. (Recognizability could be considered for music too- e.g. "regardless of how awful it sounds, I can tell- just barely- this is Beethoven's 9th.") It's about quality. This can't be reduced to a binary distinction, but if you must have a binary distinction to start with and you want something more descriptive than "good vs bad" perhaps the best one is "annoying vs not annoying" (c.f. MUSHRA, ABC/HR). It may be distinguishably different under ideal conditions- so what? Is it any worse, or would you be perfectly fine with listening to this instead of that? On the opposite end of the quality spectrum, it may be recognizable- so what? Is it any good, or would you tear your hair out if you had to listen to it for any substantial amount of time?

Do you know what the quality parameter in all true VBR modes of every modern encoder stands for? Do you know that, for example, AAC accepts 128 different quality level values in VBR mode? Do you know that this quality parameter is a dimensionless number and in fact is a (guess what?!?) qualitative property of the desired output?

What misleads you is that you still think that you always set a desired bitrate (which is wrong, as I shown you before with numbers). In fact you still say:
QUOTE
Just about everybody recommends bitrates for speech they would not recommend for mono music (or recommend bitrates for speech less than half what they recommend for stereo music despite the savings of channel coupling).

This remark is completely out of context when you set a VBR mode.

In the end, what you want is an encoding mode which takes no parameter at all and select the right (for what?!?!?) output quality level by understanding that its input is speech or music or whatever in between, because it knows how much in each case you'll be more or less annoyed by artifacts.

So it must be smart enough to understand that a musical piece (target: transparency) could contain a speech segment (opera anyone?), that a speech (target: not annoying) could contain background music, that even if the input is music, this time the user would accept a lower quality output (target: a few artifact above audible threshold) because he's planning to use it for listening on the go, that even if the input is speech, the user would like to have a higher quality output (target: better than just enough, but not that much anyway) because it is a lecture in a foreign language and the speaker has also a strong regional accent, so harder to comprehend...

All of the above can be easily accomplished just selecting a VBR mode and an appropriate quality level between the ones that the specified encoder accepts. Then the encoder will choose the lower bitrate possible depending on that quality level, on its psycoacoustic model and on the instantaneous properties of input signal. Only it's self evident that the desired quality level must be a user choice, not an encoder one!


--------------------
... I live by long distance.
Go to the top of the page
+Quote Post
zerowalker
post Feb 19 2013, 13:17
Post #36





Group: Members
Posts: 268
Joined: 6-August 11
Member No.: 92828



I am pretty sure that Vorbis is Bitrate based though.

But i really would like this mode in Opus, as i always use it in x264, it´s easier to have a favorite Quality rate, then bitrate, as it can be such a big difference of bitrate needed to produce that Quality.
Go to the top of the page
+Quote Post
Garf
post Feb 19 2013, 17:37
Post #37


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



QUOTE (zerowalker @ Feb 19 2013, 13:17) *
I am pretty sure that Vorbis is Bitrate based though.


Vorbis is quality based. If you enable bitrate management, it basically shifts quality around to hit the bitrate you asked for. This is why it's slower with managed bitrate than with quality mode.

QUOTE
But i really would like this mode in Opus, as i always use it in x264, it´s easier to have a favorite Quality rate, then bitrate, as it can be such a big difference of bitrate needed to produce that Quality.


A quality mode == Bitrate mode that reaches an average bitrate over a large copus of music.

Opus has an average bitrate mode that is simultaneously a pure quality based mode. This is not a contradiction. Encoding music at a given quality with a given codec will eventually even out to a certain bitrate. The average bitrate. You can map that back to the quality.
Go to the top of the page
+Quote Post
Silversight
post Feb 19 2013, 17:50
Post #38





Group: Members
Posts: 310
Joined: 5-April 06
From: Aachen, Germany
Member No.: 29203



QUOTE (zerowalker @ Feb 19 2013, 13:17) *
I am pretty sure that Vorbis is Bitrate based though.

I am pretty sure it is not. Vorbis can be forced to work in ABR or faux-CBR mode, but the standard -qx levels are VBR. The "nominal bitrate" written to the Vorbis header is the statistical average bitrate achieved with the given set of internal parameters, but it does not have to have a specific correlation to the actual file content. I have files encoded with -q7 that have the bitrate_nominal field set to 224 while the actual average bitrate is ~190 kbit/s. Like with all VBR codecs, it depends on the material.


--------------------
Nothing is impossible if you don't need to do it yourself.
Go to the top of the page
+Quote Post
jensend
post Feb 19 2013, 17:53
Post #39





Group: Members
Posts: 145
Joined: 21-May 05
Member No.: 22191



Nessuno, if I wasn't aware that VBR encoders try to do constant-quality, we wouldn't be having this discussion. Yes, they vary their bitrate for different kinds of input. That doesn't mean their rate allocation scheme is perfect.
QUOTE (Nessuno @ Feb 19 2013, 03:36) *
What misleads you is that you still think that you always set a desired bitrate (which is wrong, as I shown you before with numbers). In fact you still say:
QUOTE
Just about everybody recommends bitrates for speech they would not recommend for mono music (or recommend bitrates for speech less than half what they recommend for stereo music despite the savings of channel coupling).
This remark is completely out of context when you set a VBR mode.
No, it isn't "out of context" at all. A quality setting will not give consistent bitrate results for individual samples, but it will give a fairly consistent average bitrate over diverse collections of samples, as Garf just said. For instance, for LAME, people routinely recommend using -V4/V5 for stereo music, which generally average 165/130 kbps respectively on such input, while for speech they recommend either -V7/V8 (average something like 56/48kbps for mono respectively) or ABR modes (e.g. the --preset voice setting, which downmixes to mono, resamples to 32kHz, and uses 56kbps ABR).

I'm not asking for something which will target transparency for music and non-annoyance for speech. That would not be constant quality. Nor am I asking for the encoder to magically discern what quality level the user wants. I have no idea how you're coming up with these absurd straw men.

QUOTE
All of the above can be easily accomplished just selecting a VBR mode and an appropriate quality level
No, constant quality is only truly accomplished if the VBR encoder's rate allocation is flawless. I'm not asking for perfection, and in most cases the rate allocation in modern encoders does a pretty decent job. The distinction between speech and music is the one area I'm aware of where current encoders seem to deviate substantially from the constant-quality ideal and where therefore substantial improvement seems possible. For many VBR encoders, using the codec for speech is somewhat rare, mixed-content files where people can't just tweak the parameters to be better suited for speech is much rarer, and so the codec developers understandably have little interest in trying to solve the problem of speech/music classification and do additional tuning for speech. But this discussion started by talking about Opus, which is better optimized for speech and already has code to classify input as speech or music. People are interested in seeing that classification be used to improve bitrate allocation.
Go to the top of the page
+Quote Post
Nessuno
post Feb 19 2013, 22:35
Post #40





Group: Members
Posts: 422
Joined: 16-December 10
From: Palermo
Member No.: 86562



jensend, let's stop using for one moment the terms "bitrate" and "quality". What you expect from an ideal encoder is that it automatically switch to mono and allow an higher amount of noise and distortion between input and output when it recognizes the input as speech, right?

BTW: how should it consider an opus like Glenn Gould's radio documentaries?


--------------------
... I live by long distance.
Go to the top of the page
+Quote Post
m0rbidini
post Feb 20 2013, 03:46
Post #41





Group: Members
Posts: 212
Joined: 1-October 01
From: Lisbon, Portugal
Member No.: 127



I think encoders in true VBR mode could in theory detect "pure speech" sections (although this may not be very simple to define) and end up outputting lower average bitrates for those sections while maintaining a quality level that results in higher average bitrates for non speech sections, ie with both sections having the same average results in listening tests. That may be a future improvement.

If that's not so simple maybe it's just the case that devs prefer to err on the safe side.

Are there any lossy codec devs or researchers that can share any insight on this?
Go to the top of the page
+Quote Post
Garf
post Feb 20 2013, 11:19
Post #42


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



QUOTE (m0rbidini @ Feb 20 2013, 03:46) *
I think encoders in true VBR mode could in theory detect "pure speech" sections (although this may not be very simple to define) and end up outputting lower average bitrates for those sections while maintaining a quality level that results in higher average bitrates for non speech sections, ie with both sections having the same average results in listening tests. That may be a future improvement.

If that's not so simple maybe it's just the case that devs prefer to err on the safe side.

Are there any lossy codec devs or researchers that can share any insight on this?


That's basically correct, and Opus 1.1+ works that way. It doesn't only fiddle with the bitrate, it can change the underlying encoder mode even.
Go to the top of the page
+Quote Post
Nessuno
post Feb 20 2013, 18:19
Post #43





Group: Members
Posts: 422
Joined: 16-December 10
From: Palermo
Member No.: 86562



QUOTE (Garf @ Feb 20 2013, 11:19) *
That's basically correct, and Opus 1.1+ works that way. It doesn't only fiddle with the bitrate, it can change the underlying encoder mode even.

I hope that when this feature will reach production stage, will be user selectable.


--------------------
... I live by long distance.
Go to the top of the page
+Quote Post

2 Pages V  < 1 2
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 16th September 2014 - 21:56