IPB

Welcome Guest ( Log In | Register )

3 Pages V   1 2 3 >  
Reply to this topicStart new topic
Probability of passing a sequencial ABX test, Split from "Mad Challenge - My Results"
KikeG
post Nov 10 2003, 14:39
Post #1


WinABX developer


Group: Developer
Posts: 1578
Joined: 1-October 01
Member No.: 137



Edit : this discussion was started there.

I like viewing my ABX results during the test, and stopping when I have reached low enough p. That's what some people call a "sequential" ABX test. As Pio2001 says, for such a kind of test, getting a p = 5% is not enough pass the test. It seems (someone correct me if I'm wrong, please) that, for such a kind of sequential test, getting a p<1% is is enough for saying the the test has been passed.

This post has been edited by tigre: Nov 12 2003, 16:48
Go to the top of the page
+Quote Post
guruboolez
post Nov 10 2003, 14:55
Post #2





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



QUOTE (KikeG @ Nov 10 2003, 02:39 PM)
I like viewing my ABX results during the test, and stopping when I have reached low enough p. That's what some people call a "sequential" ABX test. As Pio2001 says, for such a kind of test, getting a p = 5% is not enough pass the test. It seems (someone correct me if I'm wrong, please) that, for such a kind of sequential test, getting a p<1% is is enough for saying the the test has been passed.

Does it mean that 5% is a complete useless value? Or does it mean that with 5-15%, there are still some (serious) presumptions about an audible difference?
I'm really interested about it.
I generally perform ABX test quickly, and doing some mistake I could avoid by being more meticulous (listening carefully to A, listening carefully to B, etc... then validate my choice). I'm sure that I can avoid most of them, and can obtain very good ABX score, because sometime I take the time for performing a precise, long and boring test. Quick tests gave me pval of 5...15%, and meticulous one are < 1%.
Go to the top of the page
+Quote Post
KikeG
post Nov 10 2003, 15:18
Post #3


WinABX developer


Group: Developer
Posts: 1578
Joined: 1-October 01
Member No.: 137



5% is valid only if either you have fixed in advance the whole number of trials to perform, or you don't look to the results until the whole test is finished (or both, obviously)

I wouldn't consider a p of 10% a very reliable indication of audible difference.

This post has been edited by KikeG: Nov 10 2003, 15:24
Go to the top of the page
+Quote Post
tigre
post Nov 10 2003, 15:22
Post #4


Moderator


Group: Members
Posts: 1434
Joined: 26-November 02
Member No.: 3890



Is there a rule of thumb for tests without fixed number of trials to get a more reallistic result?

Something like: "If the last trial was successful, it doesn't count."


--------------------
Let's suppose that rain washes out a picnic. Who is feeling negative? The rain? Or YOU? What's causing the negative feeling? The rain or your reaction? - Anthony De Mello
Go to the top of the page
+Quote Post
guruboolez
post Nov 10 2003, 15:25
Post #5





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



QUOTE (KikeG @ Nov 10 2003, 03:18 PM)
5% is valid only if either you have fixed in advance the number of trials to perform, or you don't look to the results until the test is finished (or both, obviously)

I wouldn't consider a p of 10% a very reliable indication of audible difference, either.

Should I understand that in your opinion, pval = 10% or pval = 95% mean the same thing? Or is there some nuance between "valid" statement and "invalid" results?

I'm not statistician, but an average user. For common sense, a test concluding on a difference with 10% guessing only is something to take with serious consideration (especially for high bitrates encoding, and for very short samples).
Go to the top of the page
+Quote Post
KikeG
post Nov 10 2003, 15:51
Post #6


WinABX developer


Group: Developer
Posts: 1578
Joined: 1-October 01
Member No.: 137



QUOTE (guruboolez @ Nov 10 2003, 03:25 PM)
Should I understand that in your opinion, pval = 10% or pval = 95% mean the same thing? Or is there some nuance between "valid" statement and "invalid" results?

I don't understand what you mean sorry. pval=5% (or 0.05) means 95% confidence value, and pval=10% (or 0.1) means 90% confidence value.

As a rule, a test is considered to be passed only if you achieve p<5% on a non-sequential test. It seems that p<1% is enough for sequential tests, in order to compensate for the effects Pio2001 talked about.

This post has been edited by KikeG: Nov 10 2003, 15:57
Go to the top of the page
+Quote Post
KikeG
post Nov 10 2003, 15:53
Post #7


WinABX developer


Group: Developer
Posts: 1578
Joined: 1-October 01
Member No.: 137



QUOTE (tigre @ Nov 10 2003, 03:22 PM)
Is there a rule of thumb for tests without fixed number of trials to get a more reallistic result?

Something like: "If the last trial was successful, it doesn't count."

I don't understand very well what you mean in this last sentence. I think I explained it at my previous posts. 5% is valid only under certain conditions. If not, you must go for at least 1%.

This post has been edited by KikeG: Nov 10 2003, 15:53
Go to the top of the page
+Quote Post
tigre
post Nov 10 2003, 16:10
Post #8


Moderator


Group: Members
Posts: 1434
Joined: 26-November 02
Member No.: 3890



QUOTE (KikeG @ Nov 10 2003, 04:53 PM)
I don't understand very well what you mean in this last sentence.

I'll try to explain better:

IMO especially for difficult samples it's unpredictable how many trials are needed, because one one hand performance often becomes better after a few trials (training effect), on the other hand at some point it starts to become worse because of fatigue / boredom / impatience.

Because of this I would prefer to perform tests this way:

I aim to reach a certain pval score and finish as I've reached it. Example:

I want to reach p = 0.1 or lower. My results:

0 of 1, p = 1.000
1 of 2, p = 0.750
2 of 3, p = 0.500
3 of 4, p = 0.313
4 of 5, p = 0.188
4 of 6, p = 0.344
5 of 7, p = 0.227
6 of 8, p = 0.145
7 of 9, p = 0.090

Since I haven't fixed the number of trials before the result isn't really 0.090 as you explained. My question is:

How would it be possible to get a valid result (p = 0.1 or lower) without fixing the number of trials before?

"If the last trial was successful, it doesn't count." would mean that one more trial needs to be done; if it is successful (-> 8/10), the result p = 0.090 is correct, otherwise the test will continue.


--------------------
Let's suppose that rain washes out a picnic. Who is feeling negative? The rain? Or YOU? What's causing the negative feeling? The rain or your reaction? - Anthony De Mello
Go to the top of the page
+Quote Post
ScorLibran
post Nov 10 2003, 16:18
Post #9





Group: Banned
Posts: 769
Joined: 1-July 03
Member No.: 7495



QUOTE (tigre @ Nov 10 2003, 10:10 AM)
I want to reach p = 0.1 or lower...

p = 0.1 = 10%.

I think you mean p = 0.01 = 1% for a sequential test, right?

(Forgive me if I'm not reading this correctly.)
Go to the top of the page
+Quote Post
Lev
post Nov 10 2003, 16:19
Post #10





Group: Members
Posts: 524
Joined: 7-November 02
From: Gloucester, UK
Member No.: 3716



QUOTE
Usually, a p <= 0.05 is considered a significant result. This is pretty close though, which to me indicates that there is a very good chance that more testing would result in a statistically significant result.

So, Gabriel scored p = 0.084, or, 17 out of 26.

To me, that is nowhere near significant whatsoever. On any given day, if I flip a coin 26 times, I could get 17 heads. Yet Gabriel's confidence level is ~92%, which seems extortionate. I guess its just the English I'm having trouble with - i.e. the 'confidence' word...

How is the P Value calculated, just out of interest?


--------------------
http://www.megalev.co.uk
Go to the top of the page
+Quote Post
tigre
post Nov 10 2003, 16:43
Post #11


Moderator


Group: Members
Posts: 1434
Joined: 26-November 02
Member No.: 3890



QUOTE (ScorLibran @ Nov 10 2003, 05:18 PM)
QUOTE (tigre @ Nov 10 2003, 10:10 AM)
I want to reach p = 0.1 or lower...

p = 0.1 = 10%.

I think you mean p = 0.01 = 1% for a sequential test, right?

It was just an example and I chose 0.1, not 0.01 to save space.


--------------------
Let's suppose that rain washes out a picnic. Who is feeling negative? The rain? Or YOU? What's causing the negative feeling? The rain or your reaction? - Anthony De Mello
Go to the top of the page
+Quote Post
tigre
post Nov 10 2003, 16:44
Post #12


Moderator


Group: Members
Posts: 1434
Joined: 26-November 02
Member No.: 3890



QUOTE (Lev @ Nov 10 2003, 05:19 PM)
How is the P Value calculated, just out of interest?

B) <- click!


--------------------
Let's suppose that rain washes out a picnic. Who is feeling negative? The rain? Or YOU? What's causing the negative feeling? The rain or your reaction? - Anthony De Mello
Go to the top of the page
+Quote Post
guruboolez
post Nov 10 2003, 17:33
Post #13





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



QUOTE (KikeG @ Nov 10 2003, 03:51 PM)
QUOTE (guruboolez @ Nov 10 2003, 03:25 PM)
Should I understand that in your opinion, pval = 10% or pval = 95% mean the same thing? Or is there some nuance between "valid" statement and "invalid" results?

I don't understand what you mean sorry. pval=5% (or 0.05) means 95% confidence value, and pval=10% (or 0.1) means 90% confidence value.

As a rule, a test is considered to be passed only if you achieve p<5% on a non-sequential test. It seems that p<1% is enough for sequential tests, in order to compensate for the effects Pio2001 talked about.

What I mean is: if a test is finished on a score xx/16, with a pval = 0.10, how will you consider this test? You're asking for pval=0.05 for considering a test as "passed", but with 0.10, or 0.15? Will you consider this score as bad? Useless? Without real signification? In other words, did a pval=0.15 (some errors on ABX) have the same meaning as a pval=0.95 (a lot of error during ABX)?
Go to the top of the page
+Quote Post
AstralStorm
post Nov 10 2003, 17:45
Post #14





Group: Members
Posts: 745
Joined: 22-April 03
From: /dev/null
Member No.: 6130



I'd very much appreciate an option (in ABC/HR and its Java counterpart) to clear ABX results after changing selected time,
as I like to use ABX to find differences as misses make the score go bad before I find the part I feel I'm able to ABX.
Maybe an option to clear the results? That would help to reduce warm-up effect.
(you can do the test any number of times before recording the results)


--------------------
ruxvilti'a
Go to the top of the page
+Quote Post
Gabriel
post Nov 10 2003, 17:54
Post #15


LAME developer


Group: Developer
Posts: 2950
Joined: 1-October 01
From: Nanterre, France
Member No.: 138



Based on the current discussion, I tryed an experiment:
Completely random guessing with abc/hr (not even listening to the files).
Result:
18 of 26, p = 0.038

I am wondering if there could be something wrong with the computations...
Go to the top of the page
+Quote Post
Gabriel
post Nov 10 2003, 17:58
Post #16


LAME developer


Group: Developer
Posts: 2950
Joined: 1-October 01
From: Nanterre, France
Member No.: 138



Still continuing the experiment:
54 of 85, p = 0.008
(still random)
....

72 of 114, p = 0.003
...

78 of 127, p = 0.006

That is incredible: I can randomly generate meaningfull results.

2 possibilities:
*I am gifted and am able to do some divination
*we can not trust the current results of abc/hr

This post has been edited by Gabriel: Nov 10 2003, 18:01
Go to the top of the page
+Quote Post
guruboolez
post Nov 10 2003, 18:01
Post #17





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



QUOTE (Gabriel @ Nov 10 2003, 05:54 PM)
Based on the current discussion, I tryed an experiment:
Completely random guessing with abc/hr (not even listening to the files).
Result:
18 of  26, p = 0.038

I am wondering if there could be something wrong with the computations...

I sometimes had the same results.
As Astral Storm, the lake of RESET function is annoying. Therefore, I artificially restart a test by reaching 100 trials, then performing 16 another tests. It's sometime amusing to see the pval score at 100 trials. With a good but not perfect basis (something as 18/26), the final pval of xx/100 is sometimes inferior to 0.05 !

There must be something wrong with pvalue, especially when many trials were performed.
Go to the top of the page
+Quote Post
Gabriel
post Nov 10 2003, 18:04
Post #18


LAME developer


Group: Developer
Posts: 2950
Joined: 1-October 01
From: Nanterre, France
Member No.: 138



....
119 of 200, p = 0.004

...

(am I god?)
Go to the top of the page
+Quote Post
ErikS
post Nov 10 2003, 18:15
Post #19





Group: Members
Posts: 757
Joined: 8-October 01
Member No.: 247



QUOTE (Gabriel @ Nov 10 2003, 06:04 PM)
....
119 of 200, p = 0.004

...

(am I god?)

Go and buy a lottery ticket while your luck still holds! smile.gif
Go to the top of the page
+Quote Post
guruboolez
post Nov 10 2003, 18:25
Post #20





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



QUOTE (Gabriel @ Nov 10 2003, 06:04 PM)
....
119 of 200, p = 0.004

...

(am I god?)

No, you missed 80 times the good answer, but pval tell you that there are few chances to guess. 7/8 is less significative than 60/100... In other words, if you're planning to perform a difficult ABX test, it's easier to obtain significant results by targetting 100 trials than 8.

Note that foobar2000 ABX component didn't compute pvalue after 20 trials.
Go to the top of the page
+Quote Post
guruboolez
post Nov 10 2003, 18:49
Post #21





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



On ABX tests, if you did one error each three trials, than you will have :

6/9 = 0.250
10/15 = 0.150 (15%)
20/30 = 0.049 (<5%)
30/45 = 0.018 (<2%)

more trials = more significant results.


It's good to know that. If you try to ABX something difficult, and to prove that you're right, better 50 trials than 16 ;-)
Go to the top of the page
+Quote Post
Moneo
post Nov 10 2003, 19:00
Post #22





Group: Developer
Posts: 501
Joined: 22-January 03
From: Netherlands
Member No.: 4684



QUOTE
119 of 200, p = 0.004

I wonder what does WinABX use to generate random numbers. This might mean that there is a deficiency in it.
Go to the top of the page
+Quote Post
Continuum
post Nov 10 2003, 19:17
Post #23





Group: Members
Posts: 473
Joined: 7-June 02
Member No.: 2244



QUOTE (Gabriel @ Nov 10 2003, 07:04 PM)
....
119 of 200, p = 0.004

...

(am I god?)

What version of ABCHR are you using? There was some doubt whether the random number generator used in older versions is reliable or not.

The result does indeed mean: The probability to score 119 or better out of 200 by guessing is 0.0043.
Go to the top of the page
+Quote Post
Continuum
post Nov 10 2003, 19:28
Post #24





Group: Members
Posts: 473
Joined: 7-June 02
Member No.: 2244



QUOTE (guruboolez @ Nov 10 2003, 03:55 PM)
Does it mean that 5% is a complete useless value? Or does it mean that with 5-15%, there are still some (serious) presumptions about an audible difference?
I'm really interested about it.

Then you could read the Statistics For Abx-thread (long!).

But to give you an idea how much the results are affected: Think of a guessing tester who stops the test as soon as he reaches 0.95 confidence or the maximal length ( =: m) of the test. The probability for him to pass the test are:

m=10 => p-val = 0.0508
m=20 => p-val = 0.0987
m=30 => p-val = 0.1295
m=50 => p-val = 0.1579
m=100 => p-val = 0.2021

See this excel sheet for reference.
Go to the top of the page
+Quote Post
ff123
post Nov 10 2003, 19:31
Post #25


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (Gabriel @ Nov 10 2003, 08:58 AM)
Still continuing the experiment:
54 of  85, p = 0.008
(still random)

Verified that this is the correct p using simulation:

http://ff123.net/abx/abx.php

If you were to repeat this 85 trial test many, many times, you would find that you can get a score of 54/85 or better (by guessing randomly), with a probability of 0.008.

I have put the output of ABC/HR several times through a random number generator "runs test" and it has passed.

During one of the previous versions, Hans Heijden thought the random number generation was suspicious (it showed moderate evidence against randomness in a runs test for his particular sequence), but when I tried it myself, it passed. I changed the random number generator for good measure, though, so as not to rely on random(). The built-in random function, at least with Visual C++ 6.0, did not appear to give completely random initial numbers when using time values which were fairly close together (on older versions of ABC/HR, I kludged this by initializing random() twice).

I purposely force the cumulative calculation of p in ABC/HR to prevent cherry picking, but one improvement that it really could use is the addition of p-value "profiles," to allow for statistically valid sequential testing to occur. A typical profile which Continuum and I analyzed would allow for a maximum of 28 trials.

ff123

Edit: ABC/HR also uses the Mersenne Twister to generate random numbers

This post has been edited by ff123: Nov 10 2003, 19:57
Go to the top of the page
+Quote Post

3 Pages V   1 2 3 >
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 28th August 2014 - 05:22