IPB

Welcome Guest ( Log In | Register )

Can audio encoders target quality w/o caring about bit rate/file size?, [OP = softrunner / split from “IETF Opus codec now ready for testing”]
softrunner
post Feb 14 2013, 02:33
Post #1





Group: Members
Posts: 48
Joined: 19-July 12
Member No.: 101579



QUOTE (Martel @ Jan 1 2013, 14:46) *
QUOTE (softrunner @ Dec 29 2012, 04:50) *
I don't know weather it is possible for encoder to do such an analysis of a source audio, but it would be great it yes.
It's only a matter of finding the right formula/algorithm.

x264 video encoder has encoding mode called Constant Rate Factor. In this mode number (16, 17, etc) is used to define desired quality (lower - better quality and higher bitrate), and encoder does not care about bitrate, only about keeping rate factor constant. It is a question, why nobody has invented something similar for audio encoding (except lossyWAV, which needs too much bitrate for acceptable quality)?
----------------
Opus 1.1 Alpha has some bugs, which can be found using samples from thread High Frequency Listening Test Samples. For example, at 16-24 kbps Opus gives this:

and for 32-40 kbps it gives this:

For samples 1_12kHz, 1_20kHz, 2_8kHz, 2_12kHz and 2_20kHz Opus sounds wrongly even at 512 kbps.
Full set of files is here (problematic sampes are marked with exclamation mark). Hope, developers will use this samples in their work.

This post has been edited by softrunner: Feb 14 2013, 02:34
Go to the top of the page
+Quote Post
 
Start new topic
Replies
Big_Berny
post Feb 14 2013, 12:11
Post #2





Group: Members
Posts: 242
Joined: 9-February 03
Member No.: 4921



QUOTE (softrunner @ Feb 14 2013, 02:33) *
x264 video encoder has encoding mode called Constant Rate Factor. In this mode number (16, 17, etc) is used to define desired quality (lower - better quality and higher bitrate), and encoder does not care about bitrate, only about keeping rate factor constant. It is a question, why nobody has invented something similar for audio encoding (except lossyWAV, which needs too much bitrate for acceptable quality)?

I think every encoder with real vbr (not abr) does that? Lame has V(0-9), QT AAC has --tvbr (0-127), Vorbis has -q((-2)-10). The bitrate may vary a lot with these settings between different songs/genres.
Go to the top of the page
+Quote Post
IgorC
post Feb 17 2013, 02:22
Post #3





Group: Members
Posts: 1556
Joined: 3-January 05
From: ARG/RUS
Member No.: 18803



Speech isn't that easy to code. http://research.nokia.com/files/public/%5B..._Opus_Codec.pdf

Opus uses hybrid mode only at very low bitrates. Speech requires comparable bitrate as for music for (near) transparent or high quality . There is no such thing as smart encoder that does"64 kbps for speech and 128 kbps for music".
That's enough to say that Opus 1.1 alpha (--bitrate 64) produces bitrates considerably >64 kbps on speech. It doesn't go anyhow lower.
Go to the top of the page
+Quote Post
db1989
post Feb 17 2013, 03:15
Post #4





Group: Super Moderator
Posts: 5275
Joined: 23-June 06
Member No.: 32180



QUOTE (IgorC @ Feb 17 2013, 01:22) *
Speech isn't that easy to code. http://research.nokia.com/files/public/%5B..._Opus_Codec.pdf […] Speech requires comparable bitrate as for music for (near) transparent or high quality .
Thanks for this! It supports earlier suppositions that bitrates for speech that are similar to music point to speech being more complex than we estimate, not to any failing in VBR modes.

I guess we’re conditioned to think of speech as requiring low bitrates, when in fact it’s often just a case of people forcing low bitrates due to constraints upon bandwidth or capacity, or even just habit. I can appreciate that actually encoding speech at a level that matches music may be more of a challenge than is assumed. That was the case for me, anyway. smile.gif
Go to the top of the page
+Quote Post
jensend
post Feb 17 2013, 06:30
Post #5





Group: Members
Posts: 143
Joined: 21-May 05
Member No.: 22191



QUOTE (saratoga @ Feb 16 2013, 17:28) *
The fact that codecs do not decrease the bitrate as much as you expect them to suggests that transparent speech is harder to encode then you think.

QUOTE (IgorC @ Feb 16 2013, 18:22) *
Speech isn't that easy to code. Opus uses hybrid mode only at very low bitrates. Speech requires comparable bitrate as for music for (near) transparent or high quality .

If you think of audio quality only in terms of the binary transparent vs non-transparent distinction, you divorce yourselves from the realities of non-trained everyday listening by the non-golden-eared public, you exclude a whole host of uses for which people would prefer not to pay the extra bitrate costs for very rapidly diminishing quality returns, and you will be forever chasing corner cases and ephemeral differences.

24-32kbps Opus speech is quite good. While trained listeners can frequently distinguish 32kbps hybrid mode mono Opus speech from the original in careful repeated listening in controlled environments, both the Google and Nokia 2011 Opus listening tests showed that in MOS, MUSHRA, or ABC/HR testing, people rate 32kbps Opus practically on a level with the originals. It's true that those tests showed 32kbps had nonoverlapping error bars with the originals, but that's not true for 40kbps, and remember that's with a two-year-old Opus encoder and there's been plenty of improvement since then. If for mono speech current 32kbps Opus doesn't qualify as "high quality" then I don't care in the slightest what high quality is.

On the other hand, while low-bitrate Opus doesn't totally mangle music like most speech codecs, the difference is considerably more clear. I don't see any large-scale mono music tests readily available to back up my personal listening tests and observations, and if there were such their test setup wouldn't be designed for making cross-sample quality comparisons with speech. But if you look at the Google tests you'll see that subjects rated 64kbps stereo music to have a much much greater quality difference from the original than 32kbps mono speech, even though you'd expect channel coupling to have a very major benefit.

Though the difference is less dramatic in other codecs which don't have speech-oriented technologies, it's still there. Part of this is because a lowpass that butchers the sound of many music samples will not have objectionable - or, often, readily-noticed - effects on speech. (The bitrate->lowpass cutoff maps in LAME and Vorbis were designed for music content - in fact, the one in LAME isn't even well tuned for mono, basically just naively scaling the target bitrate by the arbitrary factor of 3/2 before plugging it into a table which is tuned for stereo - and overriding the lowpass can enable them to do considerably better with speech at <56kbps bitrates.) There are many other factors.

On top of that, recorded music is more likely to have important stereo separation, while for speech we're generally listening to a single source at a time and so most recorded material is either mono, "stereo" with both channels practically identical (e.g. identical except for dithering), or easily representable by intensity stereo. Any decent VBR encoder will manage to reduce its bitrate substantially when stereo separation is practically nil but if such content were in a separate file you'd be well-advised to explicitly tell it to downmix, saving a little bitrate and avoiding the possibility of some nonoptimal encoder decisions. Opus has the fairly unique capacity to switch to a true mono mode and back within the same stream, but opusenc doesn't use it, and at low bitrates it doesn't seem to reduce its bitrate as much for such content as one might anticipate.

QUOTE (db1989 @ Feb 16 2013, 19:15) *
Thanks for this! It supports earlier suppositions that bitrates for speech that are similar to music point to speech being more complex than we estimate, not to any failing in VBR modes.
It does no such thing. It has no tests relating to music quality, and most definitely no tests where people were asked to directly compare the quality of encoded speech samples to that of encoded music. It tells us that 40kbps hybrid Opus with a two-year-old still-under-heavy-development encoder was statistically tied with the fullband original speech.
Go to the top of the page
+Quote Post
IgorC
post Feb 17 2013, 20:01
Post #6





Group: Members
Posts: 1556
Joined: 3-January 05
From: ARG/RUS
Member No.: 18803



The emphasis is mine.
QUOTE (jensend @ Feb 17 2013, 03:30) *
But if you look at the Google tests you'll see that subjects rated 64kbps stereo music to have a much much greater quality difference from the original than 32kbps mono speech, even though you'd expect channel coupling to have a very major benefit.

First, those are completely different Google's tests. http://www.opus-codec.org/comparison/GoogleTest1.pdf

32 kbps speech test - 17 listeners.
64 kbps test - 9 listeners. No wonder they got considerably less participants for higher bitrate test.

Second, it's not "much much" greater quality at all.
MUSHRA scores:
32 kbps speech (mono) - Opus - 97.2
64 kbps music - Opus - 90.7
64 kbps music - LC-AAC - 90.7 (oh, please!)

Once MUSHRA score is >90 (>4.5 in our world) all cats are lions.

This post has been edited by IgorC: Feb 17 2013, 20:12
Go to the top of the page
+Quote Post
jensend
post Feb 18 2013, 00:10
Post #7





Group: Members
Posts: 143
Joined: 21-May 05
Member No.: 22191



QUOTE (Nessuno @ Feb 17 2013, 10:18) *
I do think you are simply mistaking transparency for intelligibility: the fact that you can perfectly understand someone talking on the phone doesn't mean the phone is transparent to speech, not the way this term is used in perceptual codec evaluation, at least.
You must not have understood a single thing I was saying. I'm perfectly aware that codecs with vastly better than telephone quality may still not be transparent. Try reading again.
QUOTE (IgorC @ Feb 17 2013, 12:01) *
First, those are completely different Google's tests.
What's your point here? I already said the tests are distinct and gave a disclaimer about the limits of comparability. Yes, I didn't spend a bunch of time and money to set up a professional-quality large-scale direct comparison. I'll happily do so once you wire me $10K. (Since no test protocol can make cross-sample quality comparisons blind, the usefulness of my or any single individual's listening tests and preferences is sharply limited; an aggregate test of normal people neutral to this debate would be needed.) In the meantime this data does support my point even though it's not a rigorous proof.
QUOTE
Second, it's not "much much" greater quality at all.
I never said much greater quality, I said much greater quality difference. Listeners gave the 64kbps stereo music a score 9/100 points lower than the reference. That's a much larger difference than giving the 32kbps mono speech a score 2/100 points lower than the reference. That's despite having the advantage that, thanks to channel coupling, coding these normal stereo samples at 64kbps is considerably easier than coding a mono version at 32kbps would have been. This indicates that 32kbps mono music would likely be rated well below 32kbps mono speech.

Some of you seem to be saying "maybe it's just that the speech is equally degraded but people don't find that as unacceptable as they do for music." Since people's preferences are what define quality, this makes zero sense. A VBR encoder that encodes speech at the same bitrate as music when listeners find the degradation of music at that bitrate to be annoying but would not be annoyed with speech at a marginally lower bitrate is simply not managing to maintain constant quality.

Some of you are saying "well, since the VBR encoders don't drop the bitrate for speech and the VBR encoders are absolute perfection handed down to us from Olympus by the gods, obviously speech is hard to code. The rate allocation scheme of LAME is perfect, enlightening the eyes; the bitrate->lowpass map of LAME is true and righteous altogether. Holy, holy, holy. To say otherwise is blasphemous." The authors of LAME and Vorbis, mere mortal men like ourselves, would happily tell you that their encoders' decisions are not tuned for mono speech and that their encoders have no capability to detect speech and adjust their decisions accordingly. The Vorbis devs straight up tell you in the FAQ that even though it's decent for speech they've given speech little thought and you should consider other codecs. Long ago the LAME devs added a --speech option which uses a low bitrate, forces ABR since their normal bitrate allocation is suboptimal, and forces a lower lowpass than normal (any of that sound familiar from what I've been saying?).
Go to the top of the page
+Quote Post
Nessuno
post Feb 18 2013, 08:14
Post #8





Group: Members
Posts: 422
Joined: 16-December 10
From: Palermo
Member No.: 86562



QUOTE (jensend @ Feb 18 2013, 00:10) *
QUOTE (Nessuno @ Feb 17 2013, 10:18) *
I do think you are simply mistaking transparency for intelligibility: the fact that you can perfectly understand someone talking on the phone doesn't mean the phone is transparent to speech, not the way this term is used in perceptual codec evaluation, at least.
You must not have understood a single thing I was saying. I'm perfectly aware that codecs with vastly better than telephone quality may still not be transparent. Try reading again.

Yes, it's clearly me not understanding. But since the figures I gave in my previous post (which you did fully read, right?) clearly demonstrate that an audio encoder targeting quality does't care about bitrate, as per this thread's subject, could you please help me understand any better which your point in this discussion exactly is? Are you still speaking of signal quality evaluation or something completely different and uncorrelated, like subjective speech recognition (which in my opinion, is something completely out of the realm of this forum)?


--------------------
... I live by long distance.
Go to the top of the page
+Quote Post

Posts in this topic


Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 29th August 2014 - 23:58