IPB

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
ABX infallible?, What about type I/II errors?
kjoonlee
post Apr 11 2006, 04:30
Post #1





Group: Members
Posts: 2526
Joined: 25-July 02
From: South Korea
Member No.: 2782



Hi, I have a question.

From http://www.hydrogenaudio.org/forums/index....showtopic=31945 :

QUOTE (AtaqueEG @ Apr 11 2006, 12:15 PM) *
QUOTE (woody_woodward @ Apr 10 2006, 01:13 PM) *
QUOTE (Jebus @ Apr 10 2006, 08:59 AM) *
You're on the wrong forum, buddy. ABX tests are infallible....
Infallible?? Well, if you say so. I stand corrected.
They are, and they are reproducible. And they are the most scientific way of proving such a statement.

One is entitled to his opinion, but once he decides to share it, it must be backed up by evidence.

I don't have a real firm grasp of statistics, but I remember hearing that there's a tradeoff between the possibility of type I errors and the possibility of type II errors. In other words, I thought there was a tradeoff between false positives and false negatives.

If it's true, how does the double-blind ABX method account for this? Or is that unnecessary?

Thank you in advance. smile.gif


--------------------
http://blacksun.ivyro.net/vorbis/vorbisfaq.htm
Go to the top of the page
+Quote Post
Shade[ST]
post Apr 11 2006, 05:01
Post #2





Group: Members
Posts: 1189
Joined: 19-May 05
From: Montreal, Canada
Member No.: 22144



ABX tests can only prove there is a difference between two tested materials. Since it cannot prove there is no difference, the number of false negatives is infinite, and thus the number of false positives is null.
Go to the top of the page
+Quote Post
Pio2001
post Apr 11 2006, 16:43
Post #3


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



You divide by zero !

Seriously, of course ABX test are not infaillible. This is not directly related to the type I or II errors, as their relationship relies on the assumption that listeners always give a given proportion of wrong answers.

Someone gave an excellent summary of the drawback of ABX testing in a french forum : http://chaud7.forum-gratuit.com/viewtopic....r=asc&start=450
However, since even for french native speakers the text is almost incomprehensible, I'll have to make a summary.

Most often, it is admitted that an event whose probability of not occuring is smaller than 1/20 is "statistically significant". No interpretation, this p value is the result of a mathematical calculus relying only on what have been observed. Former results from similar tests, the quality of the test, and other statistic calculations are not taken into account. These events have an influence on the probability that the observed difference is real.
  • Number of testers : Studies made with a small umber of listeners are more sensitive to mistakes occuring in the test setup. Wrong stimulus presented, mistakes copying the results etc. For this reason, when the result depends on one or two people, conclusions must be cautious.
  • Predictability level : there are more chances to have got a success after N tests have been performed, than performing only one test. For example, if we want to test something that has no effect, the result that we get will be decided by chance only. Imagine that 20 people run independant tests. According to chance, in average, one of them should get a "false positive" result, since a positive result is by definition something that occur no more than one time out of 20. The p calculation of each test does not take this into account.
  • Multiple comparisons : if we select two groups in the population, using one criterion, there will be less than 1 chance out of 20 to get a "statistical difference" between the two. However, if we consider 20 independant criterions, the probability to get a significant difference according to one of them is much higher than 1/20.
    For example, if people are asked to rate the "dynamics", "soundstage", and "coloration" of an encoder, the probability to get a false positive is about thrice as high as with one criterion only, since there are three possibilities for the event to occur. Once again, the p value associated with each comparison is inferior to the real probability to get a false positive.


The original text is much longer, with some repetitions, and other ideas that I didn't translate, because they are not directly related with ABX tests reliability.

I would like however to add an important point. The interpretation of the p value.
It is by convetion admitted that p<5 % is an interesting result, and p<1% a very significant one. This does not take into account the tested hypothesis itself.

If we are testing the existence of Superman, and get a positive answer, that is "Superman really exists because the probability of the null hypothesis is less than 5%". Must we accept the existence of Superman ? Is it an infaillible, scientific proof of its existence ?
No, it's just chance. Getting an event whose probability is less than 5% is not uncommon.
However, when a listening test about MP3 at 96 kbps gives a similar significant result, we accept the opposite conclusion ! That it was not chance. Why ?
Why does the same scientific result should be interpreted in two opposite ways ? This is because we always keep the most probable hypothesis. The conclusion of an ABX test is not the p value alone, it is its comparison with the subjective p value of the tested hypothesis.

Testing MP3 at 96 kbps, what do we expect ? Anything. We start with the assumption that the odds of success are 1/2. The ABX result then tells us that the odds of failure are less than 1/20. Conclusion, the success is the most probable hypothesis.
Testing the existence of Superman, what do we expect ? That he does not exists. We start with the assumption that the odds of success are less than one in a million. The ABX result then tells us that the odds of failure are less than 1/20. Conclusion, the failure is still the most probable hypothesis.

That's why, in addition with all the statistical bias already mentionned above we should not always take 1/20 or 1/100 are a target final p value. This is correct for tests where we don't expect a result more than another, but for tests where scientific knowledge already gives some information, smaller values can be necessary.
Personnaly, in order to test the existence of Superman, i'd rather target p<1/100,000,000 biggrin.gif
Go to the top of the page
+Quote Post
krabapple
post Apr 14 2006, 00:22
Post #4





Group: Members
Posts: 2519
Joined: 18-December 03
Member No.: 10538



excellent post, Pio. May I suggest that it be added to the 'What is a blind ABX test?" pinned thread, in General Audio?

(actually it seems to me that that pinned thread should be in *this* forum too)
Go to the top of the page
+Quote Post
cabbagerat
post Apr 14 2006, 07:34
Post #5





Group: Members
Posts: 1018
Joined: 27-September 03
From: Cape Town
Member No.: 9042



Pio's post should also be linked to in the FAQ, in my opinion.


--------------------
Simulate your radar: http://www.brooker.co.za/fers/
Go to the top of the page
+Quote Post
Pio2001
post Apr 14 2006, 08:46
Post #6


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



I added this text to the ABX topic : http://www.hydrogenaudio.org/forums/index....showtopic=16295

This way, I don't think that it is necessary to add a link in the FAQ
Go to the top of the page
+Quote Post
mnhnhyouh
post May 7 2006, 11:32
Post #7





Group: Members
Posts: 17
Joined: 30-April 06
Member No.: 30215



QUOTE (kjoonlee @ Apr 11 2006, 13:30) *
Hi, I have a question.

From http://www.hydrogenaudio.org/forums/index....showtopic=31945 :


I don't have a real firm grasp of statistics, but I remember hearing that there's a tradeoff between the possibility of type I errors and the possibility of type II errors. In other words, I thought there was a tradeoff between false positives and false negatives.

If it's true, how does the double-blind ABX method account for this? Or is that unnecessary?

Thank you in advance. smile.gif


A Type I error is to reject the null hypothesis when the null hypothesis is true.

A Type II error is to accept the null hypothesis when it is false. I use accept, when really it is failure to reject.

More clearly:

You are listening to two tracks that are identical. Yet you manage to identify track A correctly a statistically significantly number of times. You are then able to say they are different, but they are not. This is a Type II error.

You are listening to two different tracks, but you cannot tell the difference in a statistically significant manner, this is a Type II error.

So before somebody claims that a designated encoding rate is transparent, they should be calculating the power of their test, this will give them some indication of how often, given the varibility of their results, they will commit a Type II error. If they have insufficient replication, they may be claiming transparency with only a (making up numbers here) 1/16 chance of detecting a difference.

QUOTE
' date='Apr 11 2006, 14:01' post='381104']
ABX tests can only prove there is a difference between two tested materials. Since it cannot prove there is no difference, the number of false negatives is infinite, and thus the number of false positives is null.


I agree we cannont prove that two things are the same, but we can calculate our chance of making an error when we do not reject the null.

We use a 1/20 (5%) chance of making a Type 1 error as our base rate, but generally fail to determine the Type II error rate. We should.

h

This post has been edited by mnhnhyouh: May 7 2006, 11:33
Go to the top of the page
+Quote Post

Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 27th December 2014 - 18:35