IPB

Welcome Guest ( Log In | Register )

3 Pages V  < 1 2 3 >  
Reply to this topicStart new topic
Probability of passing a sequencial ABX test, Split from "Mad Challenge - My Results"
schnofler
post Nov 10 2003, 19:34
Post #26


Java ABC/HR developer


Group: Developer
Posts: 175
Joined: 17-September 03
Member No.: 8879



Just a few thoughts on this whole discussion: In general I think there's no need to be all that dogmatic about the issue of "when can a test be considered meaningful?". You see, that's why all those programs compute the p-val instead of just saying "test passed!" or "test failed!". The p-val tells you exactly one thing: What is the chance of achieving this result (or an even better one) by simply guessing. Now, if someone writes "I did an ABX test, and I received a p-val of 0.1", then it's up to the reader to decide whether he considers this good enough or not.
You might very well say "Damn, if I hand a few headphones to 10 deaf monkeys, I have a nice 65% chance that at least one of them receives that result, so how the hell can this be meaningful?" or you might just say "Well, if I really had been guessing, there's a 90% chance to do worse, that should be enough".
Of course, that depends on the circumstances. So if you just do some quick private tests (this might apply to guruboolez's question), and you're pretty sure you hear something, and you don't need perfect results anyway, than even a p-val of 0.1 might be enough for you. On the other hand, if you're trying to prove in public that flac sounds worse than wav (or something like that), you better make sure you can support that with a strong p-val, if you expect people to believe your claims.

Now to some of the more recent posts:
QUOTE (Gabriel)
Still continuing the experiment:
54 of 85, p = 0.008
(still random)
....

72 of 114, p = 0.003
...

78 of 127, p = 0.006

That is incredible: I can randomly generate meaningfull results.

2 possibilities:
*I am gifted and am able to do some divination
*we can not trust the current results of abc/hr

QUOTE (Gabriel)
....
119 of 200, p = 0.004

...

Well, I'd have to put my money on the first possibility. All the p-vals are correct. If you did this test systematically (like always saying "X is A"), then there might of course be the third possibility, that the random number generator favors A over B.
QUOTE (guruboolez)
On ABX tests, if you did one error each three trials, than you will have :

6/9 = 0.250
10/15 = 0.150 (15%)
20/30 = 0.049 (<5%)
30/45 = 0.018 (<2%)

more trials = more significant results.

This is correct.
QUOTE (guruboolez)
It's good to know that. If you try to ABX something difficult, and to prove that you're right, better 50 trials than 16 ;-)

This is not. Actually these results make perfect sense. By guessing, you might very well guess two thirds of the trials correctly if you only do a few. But it's extremely improbable that you can maintain this two-thirds-streak for like 100 trials, if you really are only guessing. Conversely, if someone really manages to get two thirds right for 100 trials, you can be pretty sure he heard a difference.
So, if you can't hear any difference but you just really want to have a great ABX result, you should really do quite the opposite of what guruboolez suggests. If your result is already good at 16 trials, then by no means continue to do 50, you'll only mess it up wink.gif .
QUOTE (AstralStorm)
I'd very much appreciate an option (in ABC/HR and its Java counterpart) to clear ABX results after changing selected time,
as I like to use ABX to find differences as misses make the score go bad before I find the part I feel I'm able to ABX.
Maybe an option to clear the results? That would help to reduce warm-up effect.
(you can do the test any number of times before recording the results)

I don't think that's a good idea. It would make the results much less meaningful. If someone gets a p-val of 0.05 with one test, this is a pretty reliable result. But if he restarts the test 15 times, chances are he will get a p-val of 0.05 at least once (supposing he fixed the number of trials).
Also, I don't think the lack of a restart function poses much of a problem. You don't need to get a "perfect" score of 8/8 everytime. If you messed up some trials in the beginning, but after that you can hear the difference reliably, you can just do some more trials and the p-val will decrease rapidly. A short example: you started your test, and you can't hear a difference in the beginning. And on top, you have some serious bad luck, so you'll get only 2/8 correct (p-val of 0.96). But after that you can hear a difference very reliably, so you do some more tests (which probably will be much quicker than in the beginning), and you manage to get 15/16 correct. Summed up that's 17/24 with a respectable p-val of 0.03.

This post has been edited by schnofler: Nov 10 2003, 19:41
Go to the top of the page
+Quote Post
Moneo
post Nov 10 2003, 19:36
Post #27





Group: Developer
Posts: 501
Joined: 22-January 03
From: Netherlands
Member No.: 4684



The randomizer in WinABX does seem to be deficient. By repeatedly choosing a-b-a-b-a-b-... I've got 114/202 (pval=0.039). With foobar2000's ABX component (which uses Mersenne's twister to generate random numbers) I only got 106/202 with that strategy, which corresponds to a pval of ~0.25.

Edit: One might wonder why did I do 202 trials and not 200... well, I simply didn't stop in time wink.gif

This post has been edited by Moneo: Nov 10 2003, 19:38
Go to the top of the page
+Quote Post
guruboolez
post Nov 10 2003, 20:14
Post #28





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



QUOTE (schnofler @ Nov 10 2003, 07:34 PM)
This is not. Actually these results make perfect sense. By guessing, you might very well guess two thirds of the trials correctly if you only do a few. But it's extremely improbable that you can maintain this two-thirds-streak for like 100 trials, if you really are only guessing. Conversely, if someone really manages to get two thirds right for 100 trials, you can be pretty sure he heard a difference.

Good point. But it clearly means that we had to take care with pval. For exemple, when KikeG said that he would't trust (too much) pval > 0.05, this mean that if people want to convice him, it's better to send him a 30/45 than a 10/15. Or, differently, if you have difficulties to maintain a good concentration and achieve good ABX score on 16 trials, better than performing another test, you should resume the first one, and reaching the 45...50 trials. It supposes of course that the tester is able to maintain the two third right on 50 trials. I'm sure that I could do it with some difficult samples : when 16/16 is strictly impossible, 30/45 isn't too difficult (not for ABXing Flac & PCM of course ;-)). I often "failed" on ABX tests : I did three, four or five different sessions of 16 trials, and all were 11/16 or 12/16. If I had decided to merge the small tests in one big 60 trials test, conclusion would change, from "failed" to "succeed".

I'm agree with your first comment ("there's no need to be all that dogmatic about the issue of "when can a test be considered meaningful?"). ABX score are nothing without precise comments about conditions of the test. For exemple, I often had 10/12 tests on anchor-like encodings, but 12/12 for high quality lossy encodings. The first is so easy that I need 30 seconds for 12 trials (and doing stupid mistake - sometimes with keyboard shortcuts), and the second is so hard that I need 15 minutes to perform it, taking "breaks" in order to keep some fresh ears.

This post has been edited by guruboolez: Nov 10 2003, 20:15
Go to the top of the page
+Quote Post
tigre
post Nov 10 2003, 20:56
Post #29


Moderator


Group: Members
Posts: 1434
Joined: 26-November 02
Member No.: 3890



QUOTE (Continuum @ Nov 10 2003, 08:28 PM)
QUOTE (guruboolez @ Nov 10 2003, 03:55 PM)
Does it mean that 5% is a complete useless value? Or does it mean that with 5-15%, there are still some (serious) presumptions about an audible difference?
I'm really interested about it.

Then you could read the Statistics For Abx-thread (long!).

But to give you an idea how much the results are affected: Think of a guessing tester who stops the test as soon as he reaches 0.95 confidence or the maximal length ( =: m) of the test. The probability for him to pass the test are:

m=10 => p-val = 0.0508
m=20 => p-val = 0.0987
m=30 => p-val = 0.1295
m=50 => p-val = 0.1579
m=100 => p-val = 0.2021

See this excel sheet for reference.

Thanks. This is exactly the answer to my question. IMO this should be integrated in ABX utilities: You would have to enter the confidence you want to reach before and if you want to perform a fixed number of trials or to stop after a certain confidence / a maximum number of trials is reached.

Do you know how these "corrected" values are calculated?


--------------------
Let's suppose that rain washes out a picnic. Who is feeling negative? The rain? Or YOU? What's causing the negative feeling? The rain or your reaction? - Anthony De Mello
Go to the top of the page
+Quote Post
Continuum
post Nov 10 2003, 21:06
Post #30





Group: Members
Posts: 473
Joined: 7-June 02
Member No.: 2244



QUOTE (tigre @ Nov 10 2003, 09:56 PM)
Do you know how these "corrected" values are calculated?

I wrote the sheet so I hope I know it! laugh.gif

You can read the source (comes with macros) and try to figure out what means what.

Or look at this post, for a detailed (and hopefully more understandable) explanation.
Go to the top of the page
+Quote Post
Mac
post Nov 10 2003, 21:18
Post #31





Group: Members
Posts: 650
Joined: 28-July 02
From: B'ham UK
Member No.: 2828



QUOTE (Moneo @ Nov 10 2003, 06:36 PM)
The randomizer in WinABX does seem to be deficient, I've got 114/202. With foobar2000's ABX component I only got 106/202 with that strategy.

So with Foobar you had 52.5% correct guesses, and with WinABX you got 56.4% correct?

Unless my 1 minute google search was wrong, both of these are within the +/- 7.1% standard deviation you would expect in a correct/wrong scenario with 202 tests.

I think your claims about the deficiency of WinABXs randomness are unfounded.


--------------------
< w o g o n e . c o m / l o l >
Go to the top of the page
+Quote Post
AstralStorm
post Nov 10 2003, 21:28
Post #32





Group: Members
Posts: 745
Joined: 22-April 03
From: /dev/null
Member No.: 6130



Heh, 116/200 is nearly 35% chance of missing according to my calculator. tongue.gif

P-val calculator is certainly wrong.

This post has been edited by AstralStorm: Nov 10 2003, 21:31


--------------------
ruxvilti'a
Go to the top of the page
+Quote Post
Pio2001
post Nov 10 2003, 21:37
Post #33


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



The problem is obvious : in his random test, Gabriel is always more right than wrong ! There is no way for this to happen by chance. If the generated sequence is truly random, you should get sometimes more right than wrong, and sometimes more wrong than right.

People seem to consider high confidence to be common when the number of trial rises. No way ! High confidence is high confidence, and by definition, a common result has a low confidence ! This is the definition of "common" and "confidence".
The example assumes that a good result is got two times out of three. This is nearly impossible to maintain this just by chance. Sooner or later you'll get two wrong results out of three, and the confidence will collapse.

Maybe it would be interesting to see the logs with every choice of the program and the user. Either the random generator is bad, and there is a correlation between the user choices and the program choices. Note that even if the user chooses random answers, there can be a correlation, because people have a very bad idea of randomness, and when asked to perform random guessings, usually generate a uniform distribution of answers, rather than a random one. A human list doesn't fluctuate in the long term. A random list does. But note also that as long as the program is really random, all correlation must disappear, because comparing a random list with a non random one must lead to another random list.
The other hypothesis is that the total of success recorded by the program is wrong. Maybe if we check each answer we'll find 50 right answers out of 100 while the program counts 70 of them. The last hypothesis would be that in the final results, the program records a different list that it actually generated. Example : X is A. The user says X is B, the program records "Program : B user : B, right answer".

Mac, my probability courses are far away, but if I'm not mistaken, the probability to be outside the standard deviation is 2 %, which is OK, since we got here a 4 % probability (144 out of 202) for something inside. Can someone comfirm this ?
Go to the top of the page
+Quote Post
Continuum
post Nov 10 2003, 21:39
Post #34





Group: Members
Posts: 473
Joined: 7-June 02
Member No.: 2244



QUOTE (Mac @ Nov 10 2003, 10:18 PM)
QUOTE (Moneo @ Nov 10 2003, 06:36 PM)
The randomizer in WinABX does seem to be deficient, I've got 114/202. With foobar2000's ABX component I only got 106/202 with that strategy.

So with Foobar you had 52.5% correct guesses, and with WinABX you got 56.4% correct?

Unless my 1 minute google search was wrong, both of these are within the +/- 7.1% standard deviation you would expect in a correct/wrong scenario with 202 tests.

I think your claims about the deficiency of WinABXs randomness are unfounded.

I'm not sure what link exactly you are refering too. Anyway, the confidence 0.9608 for 114/202 is an exact value (which approximated with the normal distribution returns 0.954).

Maybe your "+/-"-interval is considering a 2/10 and an 8/10 result as equally important? This, however, is not how it's done in our case.
Go to the top of the page
+Quote Post
Continuum
post Nov 10 2003, 21:42
Post #35





Group: Members
Posts: 473
Joined: 7-June 02
Member No.: 2244



QUOTE (AstralStorm @ Nov 10 2003, 10:28 PM)
Heh, 116/200 is nearly 35% chance of missing according to my calculator. tongue.gif

P-val calculator is certainly wrong.

???

The correct confidence value is 0.98593!

What are calculating?
Go to the top of the page
+Quote Post
Pio2001
post Nov 10 2003, 21:47
Post #36


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



QUOTE (Continuum @ Nov 10 2003, 09:42 PM)
QUOTE (AstralStorm @ Nov 10 2003, 10:28 PM)
Heh, 116/200 is nearly 35% chance of missing according to my calculator. tongue.gif

P-val calculator is certainly wrong.

???

The correct confidence value is 0.98593!

My calculator agrees :

Go to the top of the page
+Quote Post
Mac
post Nov 10 2003, 21:57
Post #37





Group: Members
Posts: 650
Joined: 28-July 02
From: B'ham UK
Member No.: 2828



I was going on the standard deviation of a 202 trial binomial distribution as being 7.1, meaning any number of correct guesses between 94 and 108 is dead on target, and anything between 87 and 115 isn't completely unexpected. As both 106 (Foobar) and 114 (WinABX) both fell into this, I saw no problem with that.. I admit I forgot all my statistics work the day after the exam on it, so I could be wrong smile.gif


--------------------
< w o g o n e . c o m / l o l >
Go to the top of the page
+Quote Post
schnofler
post Nov 10 2003, 22:00
Post #38


Java ABC/HR developer


Group: Developer
Posts: 175
Joined: 17-September 03
Member No.: 8879



I think one little suggestion is necessary here: Please don't jump to conclusions. Two or three examples are not enough to conclude that some random number generator is faulty. Especially, if you don't do your tests carefully. Continuum's comments show that it's all too easy to "prove" that some program is faulty: Just press the buttons long enough, and it's pretty likely you dip below pval=0.05 at least once.

QUOTE (pio2001)
The problem is obvious : in his random test, Gabriel is always more right than wrong ! There is no way for this to happen by chance.

How did you conclude that? Maybe I am missing something here, but if I understand it correctly, Gabriel posted 4 intermediate results, out of 200! Certainly we can't conclude that he was always more right than wrong. And it's overwhelmingly probable that you will be more right than wrong 4 times in a 200 trials test.
Go to the top of the page
+Quote Post
Pio2001
post Nov 10 2003, 22:12
Post #39


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



You're right.

I'd like to see a graph of p (probability) versus n (number of trials). Does p decrease ? Does it constantly fluctuate and sometimes (not often) reach low values ?

...I forgot one thing about confidence. The confidence level that is needed must depend on the number of tests performed by someone. For example if I perform one test per day, accept a 5% result as valid, and pass one test out of two...
After 40 days, I get 20 successes, but 5 % is one chance out of 20 !
Thus it is very probable that one of my 20 correct results is flawed !
Go to the top of the page
+Quote Post
Moneo
post Nov 10 2003, 22:13
Post #40





Group: Developer
Posts: 501
Joined: 22-January 03
From: Netherlands
Member No.: 4684



QUOTE (Mac @ Nov 10 2003, 09:18 PM)
Unless my 1 minute google search was wrong, both of these are within the +/- 7.1% standard deviation you would expect in a correct/wrong scenario with 202 tests. 

Standard deviation alone does not give you the answer to the question if the behaviour that I have encountered wasn't normal. Instead, you need to perform a certain statistical test.

The basis of my statement that there seems to be a deficiency (note the 'seems', as when formally evaluated, my test would only be valid at ~92% confidence, which isn't generally considered high enough) is the following.

The probability of getting 113 or less trials correct by guessing is 0,960839995. Thus, the probability of getting 114 or more of them correct is less than 0,04.

Now, since I didn't expect the number to be higher or lower than the mean value of 101 beforehand, I must also include the event that I get 86 or less correct answers in the critical interval, making my statement valid only at 92% confidence.
QUOTE
I think your claims about the deficiency of WinABXs randomness are unfounded.

Well, you could help debunking them by performing a simple test.

Following the a-b-a-b-a-b-... strategy, do ~200 trials and post your results.

If you still doubt my methodology, I can write a formal mathematical description of my test.

This post has been edited by Moneo: Nov 10 2003, 22:19
Go to the top of the page
+Quote Post
Continuum
post Nov 10 2003, 22:14
Post #41





Group: Members
Posts: 473
Joined: 7-June 02
Member No.: 2244



QUOTE (Mac @ Nov 10 2003, 10:18 PM)
both of these are within the +/- 7.1% standard deviation you would expect in a correct/wrong scenario with 202 tests. 

I think I finally understand your calculation (blame my poor continuous probability knowledge tongue.gif), I believe, there are two things wrong:
1. +/- is uninteresting. We are only concerned about good results.
2. 7.1 (=sqrt(0.5*0.5*202)) is not a percentage but an absolute number, so 114 is well outside it.
Go to the top of the page
+Quote Post
AstralStorm
post Nov 10 2003, 22:30
Post #42





Group: Members
Posts: 745
Joined: 22-April 03
From: /dev/null
Member No.: 6130



Check with what probability you can get this result with a random generator (PRNG will probably suffice).
If you get the result in ~5% of the half of the guesses (in this case 101), then they're random (p=~0.5)

It's not that the test gets harder at 100th try than at 20th. (of course given the results so far at 10/20 or 50/100)

This post has been edited by AstralStorm: Nov 10 2003, 22:42


--------------------
ruxvilti'a
Go to the top of the page
+Quote Post
schnofler
post Nov 10 2003, 22:54
Post #43


Java ABC/HR developer


Group: Developer
Posts: 175
Joined: 17-September 03
Member No.: 8879



QUOTE (Moneo)
Well, you could help debunking them by performing a simple test.

Following the a-b-a-b-a-b-... strategy, do ~200 trials and post your results.

OK, I tried it once, decided that it's too much work (having to click 400 times for each result), wrote a little program which remote-controls WinABX, and here's a few results:

1. 88/200, pval=96.1%
2. 110/200, pval=8.9%
3. 92/200, pval=88.5%
4. 77/200, pval=99.9%
5. 120/200, pval=0.3%
6. 92/200, pval=88.5%
7. 104/200, pval=31%
8. 91/200, pval=91%
9. 99/200, pval=58.4%
10. 104/200, pval=31%

edit: I did a few more tests, using different strategies (choosing always A or always B), and they seem to indicate that there's no problem with WinABX's RNG.

This post has been edited by schnofler: Nov 10 2003, 23:01
Go to the top of the page
+Quote Post
Mac
post Nov 10 2003, 23:18
Post #44





Group: Members
Posts: 650
Joined: 28-July 02
From: B'ham UK
Member No.: 2828



Erg, I mixed myself up a little smile.gif When saying +/- I meant that a result of 80 out of 200 is identical to a result of 120 out of 200, as the likelihood of success and failure is equal. By 7.1% I meant 2 standard deviations away from the mean was 14.2, or 7.1% smile.gif

It seems schnofler beat me to the test, but here are my 2 results from WinABX:

Choosing all A: 98/200, p=63.8
Choosing all B: 101/200 p=47.2

The P value may be totally screwed, but I see no problems with the randomness of it, hence I stick with saying your claim is unfounded smile.gif


--------------------
< w o g o n e . c o m / l o l >
Go to the top of the page
+Quote Post
schnofler
post Nov 10 2003, 23:20
Post #45


Java ABC/HR developer


Group: Developer
Posts: 175
Joined: 17-September 03
Member No.: 8879



QUOTE (Mac)
Choosing all A: 98/200, p=63.8
Choosing all B: 101/200 p=47.2

The P value may be totally screwed

No, it's not.
Go to the top of the page
+Quote Post
Pio2001
post Nov 10 2003, 23:20
Post #46


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



Here's the graph for the Pval of my 200 answers :

Go to the top of the page
+Quote Post
Pio2001
post Nov 10 2003, 23:23
Post #47


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



Schnofler, could you post your program, of the log file for 2,000 trials (if the probabilities are not too long to be computed, or don't overflow) ? I'd like to plot a larger graph...

Edit : 2,000 should be enough

This post has been edited by Pio2001: Nov 10 2003, 23:24
Go to the top of the page
+Quote Post
Gabriel
post Nov 10 2003, 23:34
Post #48


LAME developer


Group: Developer
Posts: 2950
Joined: 1-October 01
From: Nanterre, France
Member No.: 138



another test:
15 of 24, p = 0.154
16 of 25, p = 0.115
17 of 26, p = 0.084
18 of 27, p = 0.061
19 of 28, p = 0.044
20 of 29, p = 0.031
21 of 30, p = 0.021

another one:
27 of 44, p = 0.087


I tryed a 140 choices set, and only 25 times during the test my p-value was .5 or higher. If it was random, should't it be moving around .5?
Go to the top of the page
+Quote Post
schnofler
post Nov 10 2003, 23:42
Post #49


Java ABC/HR developer


Group: Developer
Posts: 175
Joined: 17-September 03
Member No.: 8879



QUOTE (pio2001)
Schnofler, could you post your program, of the log file for 2,000 trials (if the probabilities are not too long to be computed, or don't overflow) ? I'd like to plot a larger graph...


I don't really understand what you mean, so here's a log file with 10 tests of 200 trials each, and this and
this one are log files with two giant tests of 2000 trials.

edit: I'm not so keen on posting the program itself, because that would mean I'd have to make it usable for anyone but myself wink.gif . (It's an absolutely awful hack. I wrote another program years ago, which does something similar, and I just replaced some parts to make it control WinABX).

This post has been edited by schnofler: Nov 10 2003, 23:49
Go to the top of the page
+Quote Post
Moneo
post Nov 10 2003, 23:46
Post #50





Group: Developer
Posts: 501
Joined: 22-January 03
From: Netherlands
Member No.: 4684



QUOTE (schnofler @ Nov 10 2003, 10:54 PM)
OK, I tried it once, decided that it's too much work (having to click 400 times for each result), wrote a little program which remote-controls WinABX, and here's a few results:

Nice work!

QUOTE
4. 77/200, pval=99.9%
5. 120/200, pval=0.3%


I'd say these two are a good indication that something is wrong with the PRNG.

Given that we wanted to test for an abnormal probability of a test returning pval of less than 1% or more than 99%, at a confidence level of 98% it can be claimed that it's bigger than 2% (which it should be equal to).

However, for the results to be statistically valid this value should have been set before the test...
Go to the top of the page
+Quote Post

3 Pages V  < 1 2 3 >
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 31st August 2014 - 11:59