Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Public AAC Listening Test @ ~96 kbps [July 2011]: Results (Read 70310 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

After the long time of preparations, discussions and realization of the test the results are finally here.

http://listening-tests.hydrogenaudio.org/i...-a/results.html

Summary: Apple won, FhG is the second, Coding Technologies is the third and Nero is the last

I appreciate all people who has supported the test and participated in it.

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #1
It would be interesting to do a rank-sum analysis comparing each pair of encoders.  Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #2
It would be interesting to do a rank-sum analysis comparing each pair of encoders.  Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.


Completely and utterly false. We're asking to grade on a reference scale, compare to a low anchor, and judge the severity of distortions, not whether codecs are better than others.

If you're going to claim this only "seems like legitimate", you better back up that statement. Specifically, why the interval scale here (used in each and every previous test) suddenly has to be abandoned for an ordinal scale, or why we're dropping the tracking of ITU-R BS.1116-1 methodology that's generally done in these tests. Are you saying the ITU methodology only "seems like legitimate"?

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #3
... whether or not a listener ranked one encoder higher or lower than another.

I thought that (Garf, correct me if necessary) this information is reflected in the p-value tables and whether or not the confidence intervals of two coders overlap.

Interesting results. I guess I have to add Sample20 to my standard test set at work...

Chris
If I don't reply to your reply, it means I agree with you.

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #4
... whether or not a listener ranked one encoder higher or lower than another.

I thought that (Garf, correct me if necessary) this information is reflected in the p-value tables and whether or not the confidence intervals of two coders overlap.


Yes (aggregate over all listeners). Note that the graphics are simplified plots, and don't have the correct confidence intervals for the bootstrap (because the tool doesn't support generating them) nor for ANOVA (IIRC, the plots don't consider the blocking).

This is why you'll see overlap in the graphics but not in the bootstrap nor blocked ANOVA results.

Basically, the graphics suck, but they look cute 

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #5
... whether or not a listener ranked one encoder higher or lower than another.

I thought that (Garf, correct me if necessary) this information is reflected in the p-value tables and whether or not the confidence intervals of two coders overlap.

Interesting results. I guess I have to add Sample20 to my standard test set at work...

Chris


Yes, the ANOVA test is using Friedman which is ranking the codecs.  The graphs seem to be built based on parametric analysis of the test results as if they were normal data.

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #6
It would be interesting to do a rank-sum analysis comparing each pair of encoders.  Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.


Completely and utterly false. We're asking to grade on a reference scale, compare to a low anchor, and judge the severity of distortions, not whether codecs are better than others.

If you're going to claim this only "seems like legitimate", you better back up that statement. Specifically, why the interval scale here (used in each and every previous test) suddenly has to be abandoned for an ordinal scale, or why we're dropping the tracking of ITU-R BS.1116-1 methodology that's generally done in these tests. Are you saying the ITU methodology only "seems like legitimate"?


Sorry, I only now read the caveat in the results page - "The graphs are a simple ANOVA analysis over all submitted and valid results. This is compatible with the graphs of previous listening tests, but should only be considered as a visual support for the real analysis.".    My initial reaction was to the box-plot graphs, not to the analysis at the bottom of the page. 

The Friedman ANOVA analysis (bootstrap or not) are using rank-based testing.

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #7
It would be interesting to do a rank-sum analysis comparing each pair of encoders.  Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.

ranksum analysis would be even more unfavorable for FhG encoder.

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #8
It would be interesting to do a rank-sum analysis comparing each pair of encoders.  Although the numeric values assigned by the listener seem like legitimate statistical data, the only real value is whether or not a listener ranked one encoder higher or lower than another.

ranksum analysis would be even more unfavorable for FhG encoder.


Actually, it gives the same order.  CVBR > TVBR > FhG > CT > Nero

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #9
Actually, it gives the same order.  CVBR > TVBR > FhG > CT > Nero


Yes but TVBR> FhG with p = 0.00 for rank-sum. But not for bootstrap.

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #10
The Friedman ANOVA analysis (bootstrap or not) are using rank-based testing.


(Blocked) ANOVA is a parametric, means-based test. FRIEDMAN is the name of the utility (which unsurprisingly, also supports Friedman analysis). The result posted is means-based, not rank-based. It's there mostly to allow referencing with older tests and with other statistical packages, which are more likely to support normal blocked ANOVA than the nonparametric variants. Friedman wasn't developed further because it doesn't allow p-value step-down without losing a significant amount of power for many comparisons, and because for high-bitrate tests it is no longer clear the results are normally distributed. That's exactly what lead to bootstrap.

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #11
I should also mention that I've participated in this test too. Steve Forte Rio has made the ABC/HR sessions and new key for me. He has checked my results. (You can find this key in results.zip).
A Big thank You to him for that.

If somebody is interested to analyse the results:

SampleXX - original
SampleXX_1 - Nero
SampleXX_2 - Apple CVBR
SAmpleXX_3 - Apple TVBR
SampleXX_4 - FhG (Winamp 5.62)
SampleXX_5 - Coding Technologies (Winamp 5.61)
SampleXX_6 - ffmpeg AAC (low anchor)

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #12
Maybe there could be a legend for X-axis, the abbreviations used, at least under the first graph?

FhG, low_anchor* and Nero are almost fine enough (*though "wait, what was it again"? ;p ), but making CT, CVBR, and TVBR clear might require going back to the test page; which I think should be superfluous.

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #13
It is interesting that QT tvbr and cvbr encoded files are identical for samples # 7, 10, 13, 14. (foobar2000 comparator: "No differences in decoded data found")

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #14
zima,

will fix it later.

It is interesting that QT tvbr and cvbr encoded files are identical for samples # 7, 10, 13, 14. (foobar2000 comparator: "No differences in decoded data found")

Yes, I've noticed that too. It's interesting that listeners still rate them differently (despite they are bit-exact). Though it's normal.

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #15
Thanks to all who participated in this test and to those who made this test possible, especially to IgorC.

My findings are in line with the general results and I am actually surprised by Nero ending up rather low. Curious to see CVBR mean is a bit higher than TVBR but I suppose this fact really has not much meaning as both fall into each other's confidence interval.

Some personal testing about a year ago with Apple CVBR at around 128kbps, I found it stunning good but never really compared it to Nero (I use Nero for 2 years now). Is it safe to conclude that if a codec is better at about 100kbps, it also is at 128kbps? Or might the quality of tuning be different for different quality settings (therefore different bitrates)?

Many thanks again!


Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #16
Is it safe to conclude that if a codec is better at about 100kbps, it also is at 128kbps? Or might the quality of tuning be different for different quality settings (therefore different bitrates)?


This is a tough question. The quality of the tuning can make a difference. But barring any more information, I'd bet the codec that is better tuned/performing at 100kbps will perform better at 128kbps, too.

You could say that the codecs performance at 100kbps is a hint, but not proof, of how it will do at 128kbps.

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #17
Many thanks for the test !

Interesting to notice Nero's lesser performance, even though I'm encoding at much higher bitrates, I should definitely have a look at that qtaacenc thing (huh, Apple stuff).

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #18
I've noticed that previous version of Nero 1.0.7.0 produces much better quality than last 1.5.4.0  at 96-100 kbps (I've made blind tests  though. Do not blame me with TOS8)

The only explanation that comes in my mind  that is probably tuning for some bitrates could produce regression for  another ones.

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #19
To be perfectly honest, I am surprised that FhG did so well against Coding Technologies. Since Winamp introducted it, some of my songs seemed to retain more quality when encoded with the Coding Technologies' encoder rather than FhG.
Too bad I found out about the test two days after it was already closed. I've been anxious to see the results; interesting how Nero did the worst. Great information for future references

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #20
Interesting results, I gotta see if the pre-echo handling with sharp attacks is improved on QuickTime though.

Sadly am not suprised that Nero lost. It still has trouble with hi-hats and with certain regression issues, that have been introduced after 1.0.0.7.
"I never thought I'd see this much candy in one mission!"

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #21
If it is fair to say that many of the samples were "killer" samples, the performance of CVBR is quite good. I will still continue with TVBR as there are substantial bits saved on "easier" samples.

Thanks again for this learning experience, and for the hard work of all concerned.

Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #22
It appears to me that the low anchor was way too bad. Shouldn't the low anchor be at around the same quality as the contenders, but "slightly" worse than all of them?


Public AAC Listening Test @ ~96 kbps [July 2011]: Results

Reply #24
It appears to me that the low anchor was way too bad. Shouldn't the low anchor be at around the same quality as the contenders, but "slightly" worse than all of them?


Not sure about this one, I thought it should "calibrate the scale". (Because the overall quality is so high, it's less needed at the upper end)

If you don't use an anchor, what happens  is that for a minor distortion users will tend to slam down the slider. The anchor serves as a reminder "what really bad really is".

It would be more useful if the anchor stayed the same throughout the tests, I guess. Probably the opportunity to test ffmpeg in one swoop was interesting. No idea if it was understood it is *this* bad.

FWIW, this is a somewhat relevant and interesting paper I hadn't seen before:
http://www.acourate.com/Download/BiasesInM...teningTests.pdf