IPB

Welcome Guest ( Log In | Register )

5 Pages V  « < 2 3 4 5 >  
Reply to this topicStart new topic
MPC vs OGG VORBIS vs MP3 at 175 kbps, listening test on non-killer samples
Pio2001
post Jul 24 2004, 23:56
Post #76


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



QUOTE (ff123 @ Jul 24 2004, 09:43 PM)
You're talking about mutliple trials of rating a codec in the abc/hr module.  For example, rate a certain number of codecs for trial 1, then reshuffle them and rate them again for trial 2.


Yes, exactly.

QUOTE (ff123 @ Jul 24 2004, 09:43 PM)
Imagine testing just two, but very different quality codecs.  Then it doesn't make much sense to repeat the ratings:  they will be rated exactly the same every time.
*


But in this case, it would provide both the ratings and the ABX results at once, with nearly no more work than for two ABX tests. We just have to find the reference in addition.
Recognizing them 8 times out of 8, without ranking reference would replace ABXing the first against the reference, ABXing the second against the reference, but, the ranking being consistent, it would also replace ABXing them between each other without us having to do it !

I tried this data in your analyzer :

CODE
Reference Codec1 Codec2
5.00      3.01   4.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00



I chose 0.01 as p limit. The Anova analysis (by the way, what's the difference with Friedmann / non parametric ?).
The results were

Reference is better than Codec2, Codec1
Codec2 is better than Codec1


All this for p < 0.01

Thus the analyzer recognized that having no ranked reference for codec 1 8 times out of 8 meant that Reference is better than Codec 1 with p < 0.01.
It recognized that Reference is better than codec 2 with p < 0.01, so far we have the same information that with two ABX tests.
And it also says that Codec 2 is better than codec 1 with p < 0.01. This is right since the listener obviously distinguished the codecs (rating codec 1 3.00 and codec 2 4.00) 8 times out of 8 without mistake.

By the way, your analyzer is bugged : it doesn't work if the first rating for codec 1 is 3.00. I had to set 3.01 instead.


I also tested one mistake in the codec choice (that stands for a 7/8 ABX between the codecs, but still 8/8 for each codec against reference)

CODE
Reference Codec1 Codec2
5.00      3.01   4.00
5.00      4.00   3.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00
5.00      3.00   4.00


The Anova analysis still tells me that codec 2 is better than codec 1 (with p<0.001 ! ). This is strange.
However, the Friedman / non parametric analysis detects the problem and says that only the reference was recognized as superior to the codecs with p < 0.01.


Hey ! What's the problem with the Anova analysis ??

CODE
Reference Codec1 Codec2
5.00      3.01   4.00
5.00      3.00   4.00


It says from the above that Reference is better than Codec2, Codec1, and that Codec2 is better than Codec1, all with p < 0.001 ! It is plain wrong ! The above results can happen by chance !


The Friedmann analysis seems to work well (it says that the above data is not significant).
So I ran again Guruboolez data in the analyzer, but with Friedmann analysis, this time, in case of an Anova computation failure :

CODE
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Friedman Analysis

Number of listeners: 10
Critical significance:  0.05
Significance of data: 1.44E-05 (highly significant)
Fisher's protected LSD for rank sums:  16.398

Ranksums:

MPC-q5   MGX-q6   MP3-V2   MGX-q5.9 MGX-q5.5 MP3-V3  
55.00    49.00    31.50    30.50    26.50    17.50  

---------------------------- p-value Matrix ---------------------------

        MGX-q6   MP3-V2   MGX-q5.9 MGX-q5.5 MP3-V3  
MPC-q5   0.473    0.005*   0.003*   0.001*   0.000*  
MGX-q6            0.036*   0.027*   0.007*   0.000*  
MP3-V2                     0.905    0.550    0.094    
MGX-q5.9                            0.633    0.120    
MGX-q5.5                                     0.282    
-----------------------------------------------------------------------

MPC-q5 is better than MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3
MGX-q6 is better than MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3


Fortunately, it still says that MPC and Megamix q6 are the winners. However, MPC doesn't win over Megamix q6 anymore. This time, it tells that there is one chance out of two for getting this result by chance !
Go to the top of the page
+Quote Post
guruboolez
post Jul 25 2004, 01:43
Post #77





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



QUOTE (Pio2001 @ Jul 24 2004, 11:56 PM)
So I ran again Guruboolez data in the analyzer, but with Friedmann analysis, this time, in case of an Anova computation failure sad.gif...)
However, MPC doesn't win over Megamix q6 anymore. This time, it tells that there is one chance out of two for getting this result by chance !
*


I have simulated the addition of more results (i.e samples). I've just reproduced the scores obtained for the 10 first samples.
With 70 results (=10 x 7), the Friedman conclusion:
CODE
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Friedman Analysis

Number of listeners: 70
Critical significance:  0.05
Significance of data: 0.00E+00 (highly significant)
Fisher's protected LSD for rank sums:  43.386

Ranksums:

MPC-q5   MGX-q6   MP3-V2   MGX-q5.9 MGX-q5.5 MP3-V3  
385.00   343.00   220.50   213.50   185.50   122.50  

---------------------------- p-value Matrix ---------------------------

        MGX-q6   MP3-V2   MGX-q5.9 MGX-q5.5 MP3-V3  
MPC-q5   0.058    0.000*   0.000*   0.000*   0.000*  
MGX-q6            0.000*   0.000*   0.000*   0.000*  
MP3-V2                     0.752    0.114    0.000*  
MGX-q5.9                            0.206    0.000*  
MGX-q5.5                                     0.004*  
-----------------------------------------------------------------------

MPC-q5 is better than MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3
MGX-q6 is better than MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3
MP3-V2 is better than MP3-V3
MGX-q5.99 is better than MP3-V3
MGX-q5.5 is better than MP3-V3


With 7 times the same bunch of results, MPC can't still be said better than Vorbis -Q 6 with confidence. Even if 56 samples were superior with MPC and only 14 superior with Vorbis... Weird.

It's only with 8 times the same results that significance is reached:
CODE
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Friedman Analysis

Number of listeners: 80
Critical significance:  0.05
Significance of data: 0.00E+00 (highly significant)
Fisher's protected LSD for rank sums:  46.381

Ranksums:

MPC-q5   MGX-q6   MP3-V2   MGX-q5.9 MGX-q5.5 MP3-V3  
440.00   392.00   252.00   244.00   212.00   140.00  

---------------------------- p-value Matrix ---------------------------

        MGX-q6   MP3-V2   MGX-q5.9 MGX-q5.5 MP3-V3  
MPC-q5   0.043*   0.000*   0.000*   0.000*   0.000*  
MGX-q6            0.000*   0.000*   0.000*   0.000*  
MP3-V2                     0.735    0.091    0.000*  
MGX-q5.9                            0.176    0.000*  
MGX-q5.5                                     0.002*  
-----------------------------------------------------------------------

MPC-q5 is better than MGX-q6, MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3
MGX-q6 is better than MP3-V2, MGX-q5.99, MGX-q5.5, MP3-V3
MP3-V2 is better than MP3-V3
MGX-q5.99 is better than MP3-V3
MGX-q5.5 is better than MP3-V3


Now, if I suppose that the following scores I've initially planned to add to the first bunch of 10 results will not really differ from the 10 first, I need to find and test about 70 additional samples to claim that MPC is superior to vorbis "megamix" -q 6,00 without risking the banishment. Forget guruboolez's test: I've other things to do in my life wink.gif


With ANOVA analysis, the situation is less pathetic:
CODE
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Blocked ANOVA analysis

Number of listeners: 20
Critical significance:  0.05
Significance of data: 0.00E+00 (highly significant)
---------------------------------------------------------------
ANOVA Table for Randomized Block Designs Using Ratings

Source of         Degrees     Sum of    Mean
variation         of Freedom  squares   Square    F      p

Total              119          90.75
Testers (blocks)    19           7.35
Codecs eval'd        5          52.03   10.41   31.50  0.00E+00
Error               95          31.38    0.33
---------------------------------------------------------------
Fisher's protected LSD for ANOVA:   0.361

Means:

MPC-q5   MGX-q6   MGX-q5.9 MP3-V2   MGX-q5.5 MP3-V3  
 3.82     3.15     2.34     2.30     2.23     1.88  

---------------------------- p-value Matrix ---------------------------

        MGX-q6   MGX-q5.9 MP3-V2   MGX-q5.5 MP3-V3  
MPC-q5   0.000*   0.000*   0.000*   0.000*   0.000*  
MGX-q6            0.000*   0.000*   0.000*   0.000*  
MGX-q5.9                   0.826    0.546    0.013*  
MP3-V2                              0.701    0.023*  
MGX-q5.5                                     0.057    
-----------------------------------------------------------------------

MPC-q5 is better than MGX-q6, MGX-q5.99, MP3-V2, MGX-q5.5, MP3-V3
MGX-q6 is better than MGX-q5.99, MP3-V2, MGX-q5.5, MP3-V3
MGX-q5.99 is better than MP3-V3
MP3-V2 is better than MP3-V3


If the next 10 samples I'll test have the same notation as the 10 first one, then I could conclude about mpc superiority.


May I suggest to forget the "Friedman/non-parametric Fisher" analysis for analysing ABCHR scores? Could be helpful for testers...
Go to the top of the page
+Quote Post
ff123
post Jul 25 2004, 06:52
Post #78


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (Pio2001 @ Jul 24 2004, 02:56 PM)
I chose 0.01 as p limit. The Anova analysis (by the way, what's the difference with Friedmann / non parametric ?).


Non-parametric means that you're giving each codec a ranking (i.e., first, second, third, etc.) instead of a rating on a scale from 1.0 to 5.0. Ranking can be more robust than rating, but also less sensitive.

QUOTE
By the way, your analyzer is bugged : it doesn't work if the first rating  for codec 1 is 3.00. I had to set 3.01 instead.


You're running into a divide by 0 problem. If you set any number to be different (not just codec 1 in the first row) it will sidestep this problem. It's not a bug in the program -- that's the way the calculations work. If you use real data, you should never see this kind of behavior.

QUOTE
I also tested one mistake in the codec choice (that stands for a 7/8 ABX between the codecs, but still 8/8 for each codec against reference)

CODE
<!--QuoteEBegin-->Reference Codec1 Codec2<!--QuoteEBegin-->5.00      3.01   4.00<!--QuoteEBegin-->5.00      3.00   4.00<!--QuoteEBegin-->5.00      3.00   4.00<!--QuoteEBegin-->5.00      3.00   4.00<!--QuoteEBegin-->5.00      3.00   4.00<!--QuoteEBegin-->5.00      3.00   4.00 <!--QuoteEBegin-->5.00      3.00   4.00<!--QuoteEBegin-->5.00      3.00   4.00<!--QuoteEBegin-->


The Anova analysis still tells me that codec 2 is better than codec 1 (with p<0.001 ! ). This is strange.


Also not a bug. Set another row to be like row 2 and you'll see the p-value start to creep up.

ff123

This post has been edited by ff123: Jul 25 2004, 06:56
Go to the top of the page
+Quote Post
ff123
post Jul 25 2004, 07:05
Post #79


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (guruboolez @ Jul 24 2004, 04:43 PM)
May I suggest to forget the "Friedman/non-parametric Fisher" analysis for analysing ABCHR scores? Could be helpful for testers...
*


The Friedman non-parametric analysis makes fewer assumptions about the data, and is therefore more robust, but can also be less powerful than ANOVA. If one wanted to be ultra-conservative, he would do a non-parametric Tukey's analysis, which corrects for the fact that there are multiple codecs being ranked. But for abc/hr, there's little reason to use friedman. I should probably change the default.

ff123

Edit: I should also probably add the Tukey's analyses back in to the web page. They're in the actual command line program.

This post has been edited by ff123: Jul 25 2004, 07:09
Go to the top of the page
+Quote Post
Pio2001
post Jul 25 2004, 16:45
Post #80


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



QUOTE (ff123 @ Jul 25 2004, 06:52 AM)
QUOTE
By the way, your analyzer is bugged : it doesn't work if the first rating  for codec 1 is 3.00. I had to set 3.01 instead.


You're running into a divide by 0 problem. If you set any number to be different (not just codec 1 in the first row) it will sidestep this problem. It's not a bug in the program -- that's the way the calculations work. If you use real data, you should never see this kind of behavior.
*



Why ?
Is it forbidden to find one codec consistently rated 3 and the other always 4 ? I didn't set 0 anywhere, just 3.00 and 4.00.

EDIT : and I still don't understand how two people rating codecs 3 and 4 can lead to a confidence superior to 99.9 % !
I guess that the analyzer finds the coincidence very big : 3.00 and 3.00 again, while it could have been 3.05, or 2.99, that can't be a coincidence !
I should absolutely be avoided ! Real People will never rate a codec 2.99. Will the analyzer drop the accuracy if we set 3 without dot and digits ? Or will it have to be rewrited in order to work with integer precision instead of one hundredth ?
Go to the top of the page
+Quote Post
bleh
post Jul 25 2004, 17:10
Post #81





Group: Members
Posts: 273
Joined: 9-August 03
From: MI, USA
Member No.: 8257



With 3.00 and 3.01 for one codec and two scores of 4.00 for another, there's an incredibly low amount of variation, so, assuming that the data constitutes a representative sample of all scores people could give for each codec, there's practically no chance that the difference is a coincidence. That's how the test works. However, such an assumption is terrible with a sample size that low, so trying to run the test with only two people ranking codecs is a bad idea.

Also, the division by zero came from the fact that all scores were the same (mean - each score = 0). Again, this is either a symptom of the sample size being too low or the probability of a real difference being staggeringly high.
Go to the top of the page
+Quote Post
Pio2001
post Jul 25 2004, 18:38
Post #82


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



QUOTE (bleh @ Jul 25 2004, 05:10 PM)
With 3.00 and 3.01 for one codec and two scores of 4.00 for another, there's an incredibly low amount of variation, so, assuming that the data constitutes a representative sample of all scores people could give for each codec, there's practically no chance that the difference is a coincidence.  That's how the test works.
*


In this case, it should never be applied to ABC/HR tests. We ask people to choose between 1.00, 2.00, 3.00, 4.00, or 5.00. The analyzer will find that people always giving an integer answer can't be a coincidence, and will return insanely high levels of confidence because of this.

QUOTE (bleh @ Jul 25 2004, 05:10 PM)
However, such an assumption is terrible with a sample size that low, so trying to run the test with only two people ranking codecs is a bad idea.
*


What's the meaning of the p values then ?
Go to the top of the page
+Quote Post
ff123
post Jul 25 2004, 21:58
Post #83


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (Pio2001 @ Jul 25 2004, 09:38 AM)
QUOTE (bleh @ Jul 25 2004, 05:10 PM)
With 3.00 and 3.01 for one codec and two scores of 4.00 for another, there's an incredibly low amount of variation, so, assuming that the data constitutes a representative sample of all scores people could give for each codec, there's practically no chance that the difference is a coincidence.  That's how the test works.
*


In this case, it should never be applied to ABC/HR tests. We ask people to choose between 1.00, 2.00, 3.00, 4.00, or 5.00. The analyzer will find that people always giving an integer answer can't be a coincidence, and will return insanely high levels of confidence because of this.


No, not true. One of the assumptions that ANOVA makes is that the scale is continuous. ABC/HR's scale is not continuous, but it is close enough, since it has many intervals in between the major divisions. As I said, in real-world data, you are not likely to see a table of scores like the one you posted.

ff123
Go to the top of the page
+Quote Post
ff123
post Jul 25 2004, 22:20
Post #84


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



For anybody interested in seeing exactly where in the calculations this thing blows up with the sort of data Pio supplied, download this spreadsheet, which shows how things are computed:

http://ff123.net/export/anova.zip

ff123
Go to the top of the page
+Quote Post
Pio2001
post Jul 26 2004, 00:53
Post #85


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



In the meantime, I found some infos on the web :

Anova : http://www.psychstat.smsu.edu/introbook/sbk27.htm
Friedman : http://www.graphpad.com/articles/interpret...A/friedmans.htm

In short, it says that the Friedman analysis only cares about the ranking of the samples in each line.
If, in one line, codec 1 is rated 5.00 and codec 2 4.99, for the Friedman analysis, it is exactly the same thing as if they were rated 2000 and 1, as long as codec 1 is first and codec 2 second. It doesn't care about the scores at all.

The anova analysis computes the variance of the results that each codec got. Then it computes the variations between the codecs. If it finds the variation between codecs being abnormally high compared to the variance of the ratings of the codecs, it tells that the codec is superior, or inferior.
If it finds that the difference between the codecs is similar to the differences between each people or sample, it says that the variation was to be expected, and rates the codecs equal.
Go to the top of the page
+Quote Post
deaf
post Jul 26 2004, 02:38
Post #86





Group: Members
Posts: 51
Joined: 29-June 03
Member No.: 7443



It is interesting to see, that applying scientific methods, makes some fine detail to disappear from the results and wastes the effor was put in the test. Looking at charts of test results, like the latest one of low bitrate, even if the confidence intervals overlap, we still rate one codec better over the other, even that the probablility of being wrong increases. Without violating rule #8, we do make comments on it.
It has been discussed several times how to deal with differences in bitrate of the samples. I have not made an effort to find much about it and it is controversal because it is subjective, but may I suggest an XY chart of bitrate vs. rating of this result? A "how much bang for the money" style. Maybe others aware of some scientific way of calculating mean/std circles for each codec regardless, that quality is not linear even not proportional to size/bitrate. Could that give another perspective of the results?
Go to the top of the page
+Quote Post
ff123
post Jul 26 2004, 03:10
Post #87


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (deaf @ Jul 25 2004, 05:38 PM)
It is interesting to see, that applying scientific methods, makes some fine detail to disappear from the results and wastes the effor was put in the test.


For a group test, to get more sensitive results, decrease the number of codecs being tested. If you only compare 2 codecs, for example, you can get very fine detail.

ff123
Go to the top of the page
+Quote Post
ff123
post Jul 26 2004, 03:40
Post #88


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (ff123 @ Jul 24 2004, 10:05 PM)
Edit:  I should also probably add the Tukey's analyses back in to the web page.  They're in the actual command line program.
*


Added to the web page analyzer. I also made the Parametric Tukey's HSD the default, which is the conservative option, but the most statistically correct, especially with large numbers of codecs being compared.

ff123
Go to the top of the page
+Quote Post
Pio2001
post Aug 4 2004, 23:23
Post #89


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



FF123, could you explain us in common language why when two codecs are analyzed in a Friedmann way, we find the confidence that a difference exists match the binomial table, while adding completely independant columns, standing for other codecs, the exact same data between our two first codecs becomes unsignificant ?

Is it because of the probability of having a low probability of guessing among all possible pairs of codecs is taken into account ?
Go to the top of the page
+Quote Post
ff123
post Aug 5 2004, 01:04
Post #90


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (Pio2001 @ Aug 4 2004, 02:23 PM)
FF123, could you explain us in common language why when two codecs are analyzed in a Friedmann way, we find the confidence that a difference exists match the binomial table, while adding completely independant columns, standing for other codecs, the exact same data between our two first codecs becomes unsignificant ?

Is it because of the probability of having a low probability of guessing among all possible pairs of codecs is taken into account ?
*


The answer to the latter question: the Friedman (non-parametric) method does not do a Bonferroni-type correction for multiple comparisons (like the Tukey methods do).

I don't really know the answer to the first, but I can guess: there would have to be a separate LSD number for each comparison (for 2 codecs there can only be one comparison, for 3 codecs there are 3 comparisons, for 4 codecs 6 comparisons, etc.). Since there is only one LSD number, all of the comparisons would have to be exactly alike to match the binomial table. But that would almost never happen.

The way to get a better match to the binomial table would be to do a comparison like the resampling method used by the bootstrap program here:

http://ff123.net/bootstrap/

This method essentially performs many simulations, and produces a separate confidence interval for each comparison. The downside to using this type of method is that you can't really use the nice graphs any more (which we can draw because there is only one size error bar which applies to all comparisons), and have to stick to showing the results in tabular format.

ff123
Go to the top of the page
+Quote Post
guruboolez
post Aug 22 2004, 13:53
Post #91





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



...::: 8 additional results :::...



I. TESTING CONDITIONS

Few changes since last bunch of test: same hardware, same kind of music (classical), same software. I’ve nevertheless drawn the conclusion of past discussion with pio2001, and fixed a number of trials for all ABX test: 12 trials, no more, no less. This drastic condition implies a lot of concentration, many rests, and is therefore very time-consuming. Tests are less enjoying in my opinion (motivation is harder to find). Other consequence of this: there are now 5.0 [transparent] notation. If I failed [EDIT: "completely failed"] to ABX something, I cancelled my ABC/HR notation and gave a nice 5.0 as final note. I nevertheless kept trace of my initial feeling in the "general comment".



II. SAMPLES

I tried to vary as much as possible the samples (kind of instruments/signal). There aren't known-killers. All samples should be ‘normal’, with no correspondences to typical lossy/perceptual problems (as sharp attacks and micro-attacks signal for exemple).

Eight more samples. Two are from harashin:
- Liebestod: opera (soprano voice with orchestra)
- LadyMacbeth: festive orchestra, with predominant brass and cymbals

Six others are mine:
- Trumpet Voluntar: trumpet with organ (noisy recording)
- Vivaldi RV93: baroque strings, i.e period instruments (small ensemble)
- Troisième Ballet: cousin of bagpipes, playing with a baroque ensemble
- Vivaldi – Bassoon [13]: solo bassoon, with light accompaniment
- Seminarist: male voice (baritone) with a lot of sibilant consonants and piano accompaniment
- ButterflyLovers: solo violin playing alternately with full string orchestra



III. RESULTS

3.1. eight new results



3.2 cumulative results



3.3. comments about results

No big differences between the two parts of the test:
CODE
TEST    MP3_V2  MP3_V3  MPC_Q5  MGX5,5  MGX5,99 MGX6,00
NO.1    12,3    11,9    13,8    12,2    12,3    13,2
NO.2    12,7    11,9    13,9    12,1    12,3    13,4

Average notation is very stable, except maybe for lame --preset standard, in slight progress for these eight new samples. Hierarchy is identical. The conclusions are therefore the same as those posted in my first post smile.gif


IV. STATISTICAL ANALYSIS


I fed ff123’s friedman.exe application with the following table:
CODE

LAME_V2   LAME_V3   MPC_Q5    OGG5.5    OGG5.99   OGG6.00  
2.00      1.50      3.00      2.00      2.00      3.20      
1.50      1.00      4.00      2.90      2.90      3.50      
3.00      2.50      2.80      3.00      3.30      4.00      
3.00      2.00      4.00      2.00      2.00      2.30      
1.50      1.00      4.90      2.50      2.50      3.30      
3.00      1.80      3.80      2.20      2.40      3.00      
1.50      1.20      3.50      1.80      2.30      3.40      
1.50      2.70      4.00      2.00      2.00      2.30      
3.00      2.80      4.20      1.60      1.50      3.00      
3.00      2.30      4.00      2.30      2.50      3.50      
2.00      2.00      4.00      2.50      2.50      3.50      
3.50      2.50      5.00      1.50      1.50      4.00      
1.50      1.00      4.00      2.00      2.50      3.00      
1.40      1.20      3.50      1.70      2.00      2.20      
4.00      3.00      5.00      4.00      4.00      4.50      
2.50      1.30      3.50      1.70      1.70      2.70      
3.00      1.20      3.00      1.40      2.00      2.20      
3.50      3.00      3.00      2.00      2.00      5.00      

[interesting to note: the conclusions and values computed by the tool are exactly the same if I keep the original notation [e.g. 12.3 and not 2.30].

The ANOVA analysis conclusion is:

CODE
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Blocked ANOVA analysis

Number of listeners: 18
Critical significance:  0.05
Significance of data: 0.00E+000 (highly significant)
---------------------------------------------------------------
ANOVA Table for Randomized Block Designs Using Ratings

Source of         Degrees     Sum of    Mean
variation         of Freedom  squares   Square    F      p

Total              107         102.73
Testers (blocks)    17          23.75
Codecs eval'd        5          49.48    9.90   28.53  0.00E+000
Error               85          29.49    0.35
---------------------------------------------------------------
Fisher's protected LSD for ANOVA:   0.390

Means:

MPC_Q5   OGG6.00  LAME_V2  OGG5.99  OGG5.5   LAME_V3  
 3.84     3.26     2.47     2.31     2.17     1.89  

---------------------------- p-value Matrix ---------------------------

        OGG6.00  LAME_V2  OGG5.99  OGG5.5   LAME_V3  
MPC_Q5   0.004*   0.000*   0.000*   0.000*   0.000*  
OGG6.00           0.000*   0.000*   0.000*   0.000*  
LAME_V2                    0.430    0.137    0.004*  
OGG5.99                             0.481    0.034*  
OGG5.5                                       0.153    
-----------------------------------------------------------------------

MPC_Q5 is better than OGG6.00, LAME_V2, OGG5.99, OGG5.5, LAME_V3
OGG6.00 is better than LAME_V2, OGG5.99, OGG5.5, LAME_V3
LAME_V2 is better than LAME_V3
OGG5.99 is better than LAME_V3


And now, the “most statistically correct” tukey-parametric analysis one:

CODE
FRIEDMAN version 1.24 (Jan 17, 2002) http://ff123.net/
Tukey HSD analysis

Number of listeners: 18
Critical significance:  0.05
Tukey's HSD:   0.574

Means:

MPC_Q5   OGG6.00  LAME_V2  OGG5.99  OGG5.5   LAME_V3  
 3.84     3.26     2.47     2.31     2.17     1.89  

-------------------------- Difference Matrix --------------------------

        OGG6.00  LAME_V2  OGG5.99  OGG5.5   LAME_V3  
MPC_Q5     0.589*   1.378*   1.533*   1.672*   1.956*
OGG6.00             0.789*   0.944*   1.083*   1.367*
LAME_V2                      0.156    0.294    0.578*
OGG5.99                               0.139    0.422  
OGG5.5                                         0.283  
-----------------------------------------------------------------------

MPC_Q5 is better than OGG6.00, LAME_V2, OGG5.99, OGG5.5, LAME_V3
OGG6.00 is better than LAME_V2, OGG5.99, OGG5.5, LAME_V3
LAME_V2 is better than LAME_V3


According to the last analysis, lame –V3 and vorbis megamix1 –q 5,50/5,99 offers comparable performances (they are tied). In other word, I can't say that megamix is at -q 5,99 is superior to lame -V 3, even if 13 samples (72%) are favorable to megamix 5,99, one identical (6%) and four only (22%) favorable to lame V3. If I understand correctly, for me and the set of 18 tested samples, I should admit that lame is tied with vorbis even if this last one is superior on 72% of the tested samples! It’s totally insane in my opinion… There's maybe a problem somewhere, or are 18 samples still not enough?
The ANOVA analysis is slightly more acceptable: it concludes on megamix 5,99 superiority for the 18 samples, but still not on megamix 5,50 one (66% of favorable samples).

But both analysis concludes on:
1/ full MPC -Q5 superiority (even against Vorbis megamix1 -Q6
2/ megamix1 Q6 superiority on lame -V2 and V3 and on megamix Q5,50 and Q5,99
3/ LAME V2 > LAME V3

More schematically:
• ANOVA: MPC_Q5 > OGG_Q6 > OGG_Q5,99/Q5,00/MP3_V2/MP3_V3
• ANOVA: OGG_Q5,99 > LAME V3
• ANOVA: LAME_V2 > LAME V3

• TUKEY_PARAMETRIC: MPC_Q5 > OGG_Q6 > OGG_Q5,99/Q5,00/MP3_V2/MP3_V3
• TUKEY_PARAMETRIC: LAME_V2 > LAME V3


In other words, it means that for me, and after double blind tests on non-critical material:
- musepack --standard superiority is not a legend, and isn't infirmed by recent progress made by lame developers and vorbis people.
- lame --standard preset is still competitive against vorbis, at least up to q5,99, which still suffers from audible and sometimes irritating coarseness. Nevertheless, quality of lame MP3 quickly drops below this standard preset. It's interesting to note, in case of hardware playback.
- vorbis aoTuV/CVS 1.1 begins to be suitable for high quality at q 6,00, but absolutely not below this floor.


APPENDIX. SAMPLE LOCATION AND ABX LOGS

ABX logs are available here:
http://audiotests.free.fr/tests/2004.07/hq1/log/
The eight new log files are merged in one single archive

Samples are not uploaded. I could do it. Is someone interested?

This post has been edited by guruboolez: Dec 29 2005, 21:51
Go to the top of the page
+Quote Post
eagleray
post Aug 22 2004, 14:31
Post #92





Group: Members
Posts: 265
Joined: 15-December 03
Member No.: 10452



So many numbers...ouch.

Thany you Guruboolez for all your work. There is definitely a shortage of encoder comparison tests at relatively high bitrates, and no shortage of opinions.

One thing continues to bug me:

Someone with really good hearing, including the training to listen for artifacts, can do a valid abx comparison and produce results at a good confidence level.

Someone like me can not.

Am I better off using the encoder that the person with good hearing can identify? In other words, even if I can not objectively identify the differences in abx testing, is there some subjective additional level of enjoyment of the music, other than a possible placebo effect? Is there any way to verify this?

There is the final unfortunate truth: MP3 hardware support is universal, Ogg Vorbis hardware support is relatively limited along with the battery life issue, and MPC is confined to playback on computers.
Go to the top of the page
+Quote Post
ff123
post Aug 22 2004, 15:16
Post #93


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (guruboolez @ Aug 22 2004, 04:53 AM)
According to the last analysis, lame –V3 and vorbis megamix1 –q 5,50/5,99 offers comparable performances (they are tied). In other word, I can't say that megamix is at -q 5,99 is superior to lame -V 3, even if 13 samples (72%) are favorable to megamix 5,99, one identical (6%) and four only (22%) favorable to lame V3. If I understand correctly, for me and the set of 18 tested samples, I should admit that lame is tied with vorbis even if this last one is superior on 72% of the tested samples! It’s totally insane in my opinion… There's maybe a problem somewhere, or are 18 samples still not enough?


I verified with the bootstrap program:

http://ff123.net/bootstrap/

that statistically speaking, if you adjust for the fact that there are actually 15 comparisons with 6 codecs, then ogg5.99 must be considered tied to lamev3. The bootstrap (simulation) program is almost as good as one can do for adjusted p-values.

Nice comparison, guru.

CODE
                            Adjusted p-values
        OGG6.00  LAME_V2  OGG5.99  OGG5.5   LAME_V3
MPC_Q5   0.021*   0.000*   0.000*   0.000*   0.000*
OGG6.00    -      0.001*   0.000*   0.000*   0.000*
LAME_V2    -        -      0.633    0.367    0.021*
OGG5.99    -        -        -      0.633    0.128
OGG5.5     -        -        -        -      0.367

                            Means
MPC_Q5   OGG6.00  LAME_V2  OGG5.99  OGG5.5   LAME_V3
3.844    3.256    2.467    2.311    2.172    1.889


ff123
Go to the top of the page
+Quote Post
kuniklo
post Aug 22 2004, 17:28
Post #94





Group: Developer (Donating)
Posts: 193
Joined: 9-May 02
From: Emeryville, CA
Member No.: 2010



Thanks very much for taking all the time to do these comparisons Guru. So much has changed since all the original high-bitrate comparisons were made that it's very useful to get new data. I guess I'll continue using mpc myself. biggrin.gif
Go to the top of the page
+Quote Post
Pio2001
post Aug 23 2004, 00:44
Post #95


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



Thank you very much for you work and analysises !


QUOTE (guruboolez @ Aug 22 2004, 02:53 PM)
In other word, I can't say that megamix is at -q 5,99 is superior to lame -V 3, even if 13 samples (72%) are favorable to megamix 5,99, one identical (6%) and four only (22%) favorable to lame V3. [...]It’s totally insane in my opinion…
*


I don't see what's wrong with it. If you interpret is a an ABX test, you got Megamix superior to Lame with a score of 13/18. The p value is 0.048, which is already very borderline for a valid result.
But here, 6 codecs are compared, which gives a total of 15 possible codecs comparisons. If you are answering at random, it is perfectly expectable to get, among 15 possible 1-to-1 codecs comparisons, one of them positive, with p=1/15. This would be considered complete chance, with p clearly higher than 0.5, and not 1/15.
In the same way, the 13/18 result that you got has not a probability of being guessed equal to 0.048, but much higher. It says that it is equal to 0.633. So if this can happen more than one time out of two, I should be alble to reproduce it easily with random results.

First try with completely random numbers, generated by my calculator :

CODE
Joke1   Joke2   Joke3  Joke4  Joke5   Joke6
3.40    1.70    4.70   1.30   2.10    1.10
4.70    4.70    1.60   3.60   2.30    1.20
3.70    1.90    2.50   1.10   2.50    4.30
2.50    3.10    3.90   3.40   2.40    4.00
1.30    4.60    4.40   3.40   1.50    2.50
4.00    1.20    2.40   4.90   4.30    1.50
3.40    2.50    4.50   1.40   3.10    2.00
1.20    3.30    4.50   4.10   2.50    1.90
4.50    4.30    4.70   4.70   5.00    4.30
4.50    4.10    3.10   4.50   2.60    3.40
2.60    2.30    1.80   4.80   3.00    1.90
2.40    2.20    2.10   4.00   2.60    2.80
1.20    1.80    1.10   1.10   3.90    3.30
4.90    1.30    2.40   4.60   4.20    2.20
2.50    2.10    4.70   4.00   4.80    1.50
1.90    3.10    3.80   1.50   3.90    2.80
1.30    4.70    3.40   3.10   3.20    2.70
4.30    2.30    3.70   1.80   1.30    4.10


No score as good as 13/18.

Second try :

CODE
Joke1 Joke2 Joke3 Joke4 Joke5 Joke6
14    31    50    17    34    29
13    22    21    36    23    23
50    48    17    31    14    11
28    49    24    50    43    50
12    48    23    33    22    43
40    28    25    15    47    33
23    13    37    29    38    30
41    40    19    25    33    18
28    48    40    12    13    44
32    25    40    26    49    17
11    29    43    15    36    47
41    18    22    22    24    44
15    13    25    13    39    48
16    17    17    40    37    24
30    29    49    29    12    43
33    40    14    49    42    48
19    47    11    47    40    31
42    34    41    24    25    21


Here, you can see that Joke6 is better than Joke1 13 times out of 18, and with random numbers ! This is not an insane result. Two tries were enough for it to happen.
Go to the top of the page
+Quote Post
guruboolez
post Aug 23 2004, 09:42
Post #96





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



I don't understand what these random numbers should proove.
I've tested some codecs with 18 samples. By comparing two of these encoders, I saw that one is inferior to the other on 78% of the tested sample, and 'identical' on 6%. It should be very obvious that one is ABSOLUTELY inferior to the second, at least on the 18 tested samples.
Go to the top of the page
+Quote Post
Pio2001
post Aug 23 2004, 11:58
Post #97


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



If someone runs an ABX test of which I am the listener, and he plays 18 times A, and I say 78% of the times that it is B, is it obvious that B was absolutely played 78 % of the times ?

But again, this discussion only matters for the interpretation of the Anova and Tukey analyses. But here, you got some ABX results, whose meaning goes much beyond what Anova and Tukey says. We must consider the ABX results separately from the ABC/HR analyses, in order to draw a general conclusion.
Go to the top of the page
+Quote Post
ff123
post Aug 23 2004, 16:03
Post #98


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (guruboolez @ Aug 23 2004, 12:42 AM)
I don't understand what these random numbers should proove.
I've tested some codecs with 18 samples. By comparing two of these encoders, I saw that one is inferior to the other on 78% of the tested sample, and 'identical' on 6%. It should be very obvious that one is ABSOLUTELY inferior to the second, at least on the 18 tested samples.
*


The adjustment for multiple samples can be harsh. That's why it's good to keep the number of comparisons down to a minimum. If you had just compared Ogg5.99 against lameV3, it's likely you would have come up with a significant difference. But with so many comparisons, the statistical "noise" gets larger.

ff123
Go to the top of the page
+Quote Post
guruboolez
post Aug 23 2004, 19:58
Post #99





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



QUOTE (ff123 @ Aug 23 2004, 04:03 PM)
(...) That's why it's good to keep the number of comparisons down to a minimum. (...)
*


But is it really to the tester to adapt his test to the analysis tool? Or isn't it more logical to ask to the analysis tool to deal with the conditions of the test?

It's sounds like methodological problems introduced with VBR tests at a target bitrate: there's sometimes a big temptation to select specific samples (not too high, not to low) in order to match the targeted bitrate, rather than choosing the samples we really want to test, which could be more interesting. If a tester choose to avoid some samples for this reason, the risk is to limit the impact (and maybe the significance) of the test.

Same thing here. It's probably better to limit the number of comparison for many reasons. But on the other side, it'll be harder to have solid ideas about relative performances of different encoders.
With my test for exemple, I have now solid ideas about:
- big difference existing between vorbis -q6 and lower profile, including 5,99
- very limited difference between vorbis 5,50 and 5,99 (therefore, there's few thing to expect by increasing bitrate by 0.2...0.5 level)
- serious differences between lame --preset standard and -V2

If I had removed three contenders, keeping one lame setting and one vorbis quality level, the three previous conclusions wouldn't be possible. And if I had tested separately vorbis and lame in two different session, I couldn't seriously compare both results each others (such comparisons need at least the same low and high anchor, which make two separate tests with three contenders each + 2 anchor much longer than one single test with 6 contenders).


In other word, I don't think that it would be a good idea to adapt the conditions of any test to the conditions of the analysis tool. The analysis must be passive, without any incidence on the subject of the analysis.
Go to the top of the page
+Quote Post
ff123
post Aug 23 2004, 21:35
Post #100


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (guruboolez @ Aug 23 2004, 10:58 AM)
In other word, I don't think that it would be a good idea to adapt the conditions of any test to the conditions of the analysis tool. The analysis must be passive, without any incidence on the subject of the analysis.


That's fine. A tester can set up any test he likes, but the fact is that the test conditions affect the subsequent analysis. So you've got to be aware of this when you set up your test. In this particular case, if you really wanted to be certain that ogg5.99 is really better than lameV3 (for your ears and samples), then you should run another test with
just the two codecs to confirm it.

That's the way statistics works. You go into a test with your criteria for significance set prior to running the test (meaning you should choose which analysis you're going to run prior to the test as well; i.e., ANOVA or Tukey's, etc.). And then live with the results.

ff123
Go to the top of the page
+Quote Post

5 Pages V  « < 2 3 4 5 >
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 22nd July 2014 - 19:11