IPB

Welcome Guest ( Log In | Register )

14 Pages V  « < 11 12 13 14 >  
Reply to this topicStart new topic
AAC @ 128kbps listening test discussion
rjamorim
post Mar 1 2004, 18:18
Post #301


Rarewares admin


Group: Members
Posts: 7515
Joined: 30-September 01
From: Brazil
Member No.: 81



QUOTE (rjamorim @ Feb 29 2004, 11:44 PM)
Can someone enlighten me on the origins of Velvet?
http://lame.sourceforge.net/download/samples/velvet.wav

All I know is that it was submitted by Roel (r3mix).

Does anybody know artist (Velvet Underground?), title and album of this song? Also, what would be the style (no way to figure out from just the introduction)

ff123 already enlightened me about it. Thank-you very much.

Details are available at the listening test results page.


--------------------
Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org
Go to the top of the page
+Quote Post
bond
post Mar 1 2004, 18:18
Post #302





Group: Members
Posts: 881
Joined: 11-October 02
Member No.: 3523



QUOTE (rjamorim @ Mar 1 2004, 06:12 PM)
Nope. I couldn't decrypt your sample 09 results. It's the only result file that gave me problems in the entire test. I sent it to schnofler so that he can investigate. Sorry about that.

damn, i shouldnt have tried to manipulate the resultfiles wink.gif


--------------------
I know, that I know nothing (Socrates)
Go to the top of the page
+Quote Post
rjamorim
post Mar 5 2004, 07:49
Post #303


Rarewares admin


Group: Members
Posts: 7515
Joined: 30-September 01
From: Brazil
Member No.: 81



A VERY IMPORTANT STATEMENT

OK. It seems I f-ed up very badly this time.

First, let me specify what ISN'T wrong: The ranking values are absolutely correct, as well as the screening methodology and the statystical calculations.

What is wrong: The error bars.

I didn't check how the error bars were being drawn in the excel spreadsheet I got from ff123. I thought the plots were getting values from a certain cell, but actually the values were hard-coded in the plot building routines.

So, the error bars are to this day the same as the ones used in his 64kbps listening test. And it affects all my listening tests. Both the overall plots and the individual ones.

I can't express how sorry I am.

Tomorrow I'll start fixingall the test results pages. Until I announce the results have been fixed, please disregard them.

In case someone is in a hurry to check the corrected zoomed result plot for the AAC test:
http://pessoal.onda.com.br/rjamorim/screen2.png
The only thing that changed is that iTunes is now clearly first place and Nero is second place.

Again, I'm terribly sorry. I can already feel my credibility going down the drain. sad.gif

Kind regards;

Roberto Amorim.


--------------------
Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org
Go to the top of the page
+Quote Post
ff123
post Mar 5 2004, 08:00
Post #304


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (rjamorim @ Mar 4 2004, 10:49 PM)
What is wrong: The error bars.

I didn't check how the error bars were being drawn in the excel spreadsheet I got from ff123. I thought the plots were getting values from a certain cell, but actually the values were hard-coded in the plot building routines.

The fault is also mine for not making it perfectly clear how I was drawing the error bars. Plus I violated an Excel/software rule by not using a spreadsheet as a spreadsheet should be used, instead hard-coding in the error bar values.

QUOTE
Again, I'm terribly sorry. I can already feel my credibility going down the drain. sad.gif


Your integrity is intact. Credibility is a matter of trust. If you own up to your mistakes, correct them, and prevent future ones, that goes a long way towards enhancing your credibility.

I suggest keeping both the old (incorrect) overall graphs and showing the new, corrected overall graphs side by side, to show the before and after. I think the individual sample graphs can just be replaced.

ff123

Edit: You should probably rename the old overall graph and then use the original name of the graph for the corrected one. That way, websites which link to your overall graphs will be automatically updated.

This post has been edited by ff123: Mar 5 2004, 08:23
Go to the top of the page
+Quote Post
rpop
post Mar 5 2004, 08:12
Post #305





Group: Super Moderator
Posts: 332
Joined: 20-May 03
From: Pittsburgh, USA
Member No.: 6718



QUOTE (ff123 @ Mar 5 2004, 03:00 AM)
Your integrity is intact.  Credibility is a matter of trust.  If you own up to your mistakes, correct them, and prevent future ones, that goes a long way towards enhancing your credibility.

Your integrity is, indeed, intact. I've seen a few other listening tests online, and discussion of their results always stops soon after the tests, with the page receeding in internet history. Updating these tests now goes a long way toward proving their reliability will be maintained in the future smile.gif.


--------------------
[url=http://noveo.net/ph34r.htm]Happiness[/url] - The agreeable sensation of contemplating the misery of others.
Go to the top of the page
+Quote Post
Garf
post Mar 5 2004, 09:01
Post #306


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



QUOTE (rjamorim @ Mar 5 2004, 08:49 AM)
In case someone is in a hurry to check the corrected zoomed result plot for the AAC test:
http://pessoal.onda.com.br/rjamorim/screen2.png
The only thing that changed is that iTunes is now clearly first place and Nero is second place.

Aaaaaah, this explains my previous complaint that the graph didn't seem to align with your written statement about the test significance smile.gif

Now it does. iTunes indeed almost beats Nero by a significant margin.

As far as the moral winner is concerned, though: sad.gif
Go to the top of the page
+Quote Post
Continuum
post Mar 5 2004, 09:20
Post #307





Group: Members
Posts: 473
Joined: 7-June 02
Member No.: 2244



QUOTE (Garf @ Mar 5 2004, 09:01 AM)
As far as the moral winner is concerned, though: sad.gif

huh.gif
"Moral winner"?
Go to the top of the page
+Quote Post
rjamorim
post Mar 5 2004, 09:27
Post #308


Rarewares admin


Group: Members
Posts: 7515
Joined: 30-September 01
From: Brazil
Member No.: 81



QUOTE (Garf @ Mar 5 2004, 05:01 AM)
Now it does. iTunes indeed almost beats Nero by a significant margin.

Erm.. I use Darryl's method to evaluate ranking positions.

Check, for instance, thear1 in his 64kbps test results
http://ff123.net/64test/results.html

Oggs are ranked second, according to him, although they overlap a little with MP3pro.

To put it short, I (and ff123, it seems) only consider codecs tied when one's confidence margin overlaps with the other's actual ranking. Or, to make things simpler, when more than half of the entire margins overlap.


--------------------
Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org
Go to the top of the page
+Quote Post
Gabriel
post Mar 5 2004, 09:28
Post #309


LAME developer


Group: Developer
Posts: 2950
Joined: 1-October 01
From: Nanterre, France
Member No.: 138



QUOTE
I can already feel my credibility going down the drain


Finding, admitting, correcting your own errors only increases credibility I think.
Go to the top of the page
+Quote Post
guruboolez
post Mar 5 2004, 13:29
Post #310





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



Your credibility, your honesty and your honor are now stronger. Thank you.
Go to the top of the page
+Quote Post
ScorLibran
post Mar 5 2004, 16:12
Post #311





Group: Banned
Posts: 769
Joined: 1-July 03
Member No.: 7495



You have nothing to worry about, Roberto...you're credibility is quite secure. Anyone who conducts tests like this will occasionally have a mistake. It's inevitable. You took the best approach in resolving it. Our trust in you is only higher now. smile.gif

QUOTE (rjamorim @ Mar 5 2004, 03:27 AM)
QUOTE (Garf @ Mar 5 2004, 05:01 AM)
Now it does. iTunes indeed almost beats Nero by a significant margin.

...To put it short, I (and ff123, it seems) only consider codecs tied when one's confidence margin overlaps with the other's actual ranking. Or, to make things simpler, when more than half of the entire margins overlap.

That's what I had always thought was the case, but it was just an assumption on my part (that I never communicated). Glad to know it was correct.
Go to the top of the page
+Quote Post
ff123
post Mar 5 2004, 16:34
Post #312


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



[quote=ScorLibran,Mar 5 2004, 07:12 AM] ...To put it short, I (and ff123, it seems) only consider codecs tied when one's confidence margin overlaps with the other's actual ranking. Or, to make things simpler, when more than half of the entire margins overlap.[/QUOTE]
That's what I had always thought was the case, but it was just an assumption on my part (that I never communicated). Glad to know it was correct. [/quote]
To be absolutely correct, a codec wins with 95% confidence, for that group of listeners and set of samples, when the bars do not overlap. Or to put it another way, 19 times out of 20, those results would not occur by chance. Any overlap reduces that confidence. If the bars just barely overlap, there is still quite a high likelihood that that result did not occur by chance. A reasonable way to describe this situation would be to say that the results are suggestive (if not significant). Actually, in an ideal world, the graphs would speak for themselves, and there would be no "interpretation" to cause controversy.

If this were a drug test or something else where there is a lot at stake for making the right decision, everything below 95% confidence (or whatever threshold is chosen) would not be considered to be significant.

Also, the test would be corrected for comparing multiple samples, which would make the error bars overlap more. I personally don't think it's a real big deal if the type I errors in this sort of test (falsely identifying a codec as being better than another) are higher than they would be in a more conservative analysis. But others, for example on slashdot, can (and do) complain about this sort of thing.

ff123
Go to the top of the page
+Quote Post
Garf
post Mar 5 2004, 16:47
Post #313


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



I take it from the previous comment by rjamorim that 'bars' should be interpreted as 'error bars' and 'mean score marker' and not 2x 'error bars'?
Go to the top of the page
+Quote Post
ff123
post Mar 5 2004, 17:05
Post #314


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (rjamorim @ Mar 5 2004, 12:27 AM)
Check, for instance, thear1 in his 64kbps test results
http://ff123.net/64test/results.html

Oggs are ranked second, according to him, although they overlap a little with MP3pro.

In that test I used an "eyeball" method to rank the codecs when trying to determine an appropriate overall ranking. People (including me) didn't like the subjectivity involved in that method, so I changed to the method used now, which is to perform another ANOVA/Fisher LSD once the means for each music sample are determined. The assumption this method makes is that each sample is equally important to the final overall results. This may not actually be true if, for example, there are lots of people listening to some samples and only a few listening to others. Also, the choice of samples greatly affects the overall results.

But at least it seems to produce reasonable results, and it's removed the subjectivity involved in the earlier method.

QUOTE
I take it from the previous comment by rjamorim that 'bars' should be interpreted as 'error bars' and 'mean score marker' and not 2x 'error bars'?


The length of each error bar from top to bottom (mean in the middle) is equal to the Fisher LSD.

ff123
Go to the top of the page
+Quote Post
JohnV
post Mar 5 2004, 17:16
Post #315





Group: Developer
Posts: 2797
Joined: 22-September 01
Member No.: 6



QUOTE (ff123 @ Mar 5 2004, 05:34 PM)
To be absolutely correct, a codec wins with 95% confidence, for that group of listeners and set of samples, when the bars do not overlap.  Or to put it another way, 19 times out of 20, those results would not occur by chance.  Any overlap reduces that confidence.  If the bars just barely overlap, there is still quite a high likelihood that that result did not occur by chance.  A reasonable way to describe this situation would be to say that the results are suggestive (if not significant).  Actually, in an ideal world, the graphs would speak for themselves, and there would be no "interpretation" to cause controversy.

If this were a drug test or something else where there is a lot at stake for making the right decision, everything below 95% confidence (or whatever threshold is chosen) would not be considered to be significant.

Also, the test would be corrected for comparing multiple samples, which would make the error bars overlap more.  I personally don't think it's a real big deal if the type I errors in this sort of test (falsely identifying a codec as being better than another) are higher than they would be in a more conservative analysis.  But others, for example on slashdot, can (and do) complain about this sort of thing.

ff123

Right, well, with 95% confidence for the tested 12 samples:
iTunes is better than Real,FAAC and Compaact
Nero is better than Real and Compaact

With lower confidence for the tested 12 samples:
Nero is better than FAAC (small overlap)

With even lower confidence for the tested 12 samples:
iTunes is better than Nero (a bit bigger overlap than with Nero-FAAC)

Correct?


--------------------
Juha Laaksonheimo
Go to the top of the page
+Quote Post
Garf
post Mar 5 2004, 17:36
Post #316


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



QUOTE (ff123 @ Mar 5 2004, 06:05 PM)
The length of each error bar from top to bottom (mean in the middle) is equal to the Fisher LSD.

So there shouldn't be any overlap between error bars at all, if I get that correctly, since no overlap between error bar and mean is only half the error length. (And hence my original comment was right).
Go to the top of the page
+Quote Post
Zed
post Mar 5 2004, 17:53
Post #317





Group: Banned
Posts: 6
Joined: 5-March 04
Member No.: 12484



QUOTE (rjamorim @ Mar 5 2004, 12:27 AM)
QUOTE (Garf @ Mar 5 2004, 05:01 AM)
Now it does. iTunes indeed almost beats Nero by a significant margin.

Erm.. I use Darryl's method to evaluate ranking positions.

Check, for instance, thear1 in his 64kbps test results
http://ff123.net/64test/results.html

Oggs are ranked second, according to him, although they overlap a little with MP3pro.

To put it short, I (and ff123, it seems) only consider codecs tied when one's confidence margin overlaps with the other's actual ranking. Or, to make things simpler, when more than half of the entire margins overlap.

how about this one?

where is the truth?
Go to the top of the page
+Quote Post
ff123
post Mar 5 2004, 18:05
Post #318


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (Garf @ Mar 5 2004, 08:36 AM)
QUOTE (ff123 @ Mar 5 2004, 06:05 PM)
The length of each error bar from top to bottom (mean in the middle) is equal to the Fisher LSD.

So there shouldn't be any overlap between error bars at all, if I get that correctly, since no overlap between error bar and mean is only half the error length. (And hence my original comment was right).

Yes. If the error bars do not overlap, that is a difference to 95% confidence. And yes, iTunes almost beats Nero with 95% confidence.
Go to the top of the page
+Quote Post
eagleray
post Mar 5 2004, 18:12
Post #319





Group: Members
Posts: 265
Joined: 15-December 03
Member No.: 10452



Is there anything in the testig methodology to assure that iTunes does not sound "better" than the original CD through the addition of some audio "sugar"?

I hope the experts around here do not think this is too off the wall. For that matter I don't know if there is a way to make any recording sound "better" than the original.
Go to the top of the page
+Quote Post
ff123
post Mar 5 2004, 18:16
Post #320


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (Zed @ Mar 5 2004, 08:53 AM)
this one?

where is the truth?

The biggest weakness of this test IMO is that there were only 3 samples tested, and they made it even worse by combining them into one medley. Other problems: IIRC, people were asked to rank the codecs from best to worst, not to compare and rate against a known reference. I believe the reference was hidden as one of the samples to be ranked.

But the 3 sample medley is really the killer. They would have been much better off distributing lots of different samples (with that amount of listeners they could have distributed 50 different samples with ease) to determine which codec is better overall.

ff123
Go to the top of the page
+Quote Post
rjamorim
post Mar 5 2004, 18:17
Post #321


Rarewares admin


Group: Members
Posts: 7515
Joined: 30-September 01
From: Brazil
Member No.: 81



Hello.

Thank-you very much for your support smile.gif

I have been correcting the plots (will upload them later) and so far, it seems very few will change:

-At the first AAC@128kbps test, it only becomes more clear that QuickTime is the winner.
-At the Extension test, it seems Vorbis and WMAPro are no longer tied to AAC and MPC, and now share second place. I'll leave it to others to discuss.
-The 64kbps test results stay the same: Lame wins, followed by HE AAC, then MP3pro, then Vorbis. LC AAC, Real and WMA are still tied at fifth place, and FhG MP3 is still way down the graph.
-The MP3 test stays the same as well.

Regards;

Roberto.

This post has been edited by rjamorim: Mar 5 2004, 18:24


--------------------
Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org
Go to the top of the page
+Quote Post
ff123
post Mar 5 2004, 18:17
Post #322


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (eagleray @ Mar 5 2004, 09:12 AM)
Is there anything in the testig methodology to assure that iTunes does not sound "better" than the original CD through the addition of some audio "sugar"?

I hope the experts around here do not think this is too off the wall.  For that matter I don't know if there is a way to make any recording sound "better" than the original.

Yes, the listener is asked to rate the sample against the reference. The reference is 5.0 by default, so any difference, even if it "sounds better" than the reference must be rated less than 5.0

ff123
Go to the top of the page
+Quote Post
Zed
post Mar 5 2004, 18:28
Post #323





Group: Banned
Posts: 6
Joined: 5-March 04
Member No.: 12484



QUOTE (ff123 @ Mar 5 2004, 09:16 AM)
But the 3 sample medley is really the killer.  They would have been much better off distributing lots of different samples (with that amount of listeners they could have distributed 50 different samples with ease) to determine which codec is better overall.

but small number of the ears is also the killer i guess...
Go to the top of the page
+Quote Post
ff123
post Mar 5 2004, 18:57
Post #324


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (Zed @ Mar 5 2004, 09:28 AM)
QUOTE (ff123 @ Mar 5 2004, 09:16 AM)
But the 3 sample medley is really the killer.  They would have been much better off distributing lots of different samples (with that amount of listeners they could have distributed 50 different samples with ease) to determine which codec is better overall.

but small number of the ears is also the killer i guess...

They had about 3000 listeners for both the 64 kbit/s and 128 kbit/s tests. If they had distributed 50 separate samples instead of the one medley, they could have gotten more than 50 listeners per sample. That's more than enough to make a statistical inference. In fact, one can do quite well with far fewer.

ff123
Go to the top of the page
+Quote Post
Garf
post Mar 5 2004, 19:07
Post #325


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



The test also seems at least 1.5 years old. Lots has happened in that time with AAC.

This post has been edited by Garf: Mar 5 2004, 19:08
Go to the top of the page
+Quote Post

14 Pages V  « < 11 12 13 14 >
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 16th September 2014 - 11:12