IPB

Welcome Guest ( Log In | Register )

3 Pages V  < 1 2 3  
Reply to this topicStart new topic
Probability of passing a sequencial ABX test, Split from "Mad Challenge - My Results"
Moneo
post Nov 11 2003, 00:00
Post #51





Group: Developer
Posts: 501
Joined: 22-January 03
From: Netherlands
Member No.: 4684



Anyway, I'd like to know what did KikeG use to generate random numbers in his program.

My guess is that it was rand(), and in that case then I'd suggest replacing it with something better.
Go to the top of the page
+Quote Post
Pio2001
post Nov 11 2003, 00:24
Post #52


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



Thanks, Schnofler, this is exactly what I wanted. I've plotted your data :





Pval, in %, vs trial number, of Schnofler's data

The second graph behaves exactly how I would expect it to do (random), but the first one looks strange and needs an analysis. Maybe there is a mathematic explanation.

Actually, the % of right answers varies randomly, but the variations are slower as the number of trials grow. We must also take into account that the p value is very sensitive to the % of right answers. Once in some hundreds of trials, it can fall below 1 % or above 99%.
Defining the interval of % of right answers that leads to 1% < pval < 99 %, we should study the probability that the % of right answers stays outside this interval once it is out of it.
Go to the top of the page
+Quote Post
Pio2001
post Nov 11 2003, 00:28
Post #53


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



Note in the first graph, that if we only had the first 500 trials, we would have concluded that the Pval either decrease to 0 or increase to 100 % as the trial goes, and pushing the simulation to 1000 trials would have confirmed it. We would have deduced that for sure, once the Pval raises or falls, nothing can bring it back to 50 %. Pushing to 1500 trials would only have comfirmed this again, but... at 2000, it decreases again smile.gif

(That's what is called "listening fatigue", you know, after 1500 trials, I got tired and didn't hear the difference anymore laugh.gif But I'm sure I can reproduce the result for 2000 trials if I get one or two more coffees ...) ... that's what an audiophle could say, if the curve is inverted (0 % instead of 100 %)

Ah ! Random numbers rolleyes.gif !

This post has been edited by Pio2001: Nov 11 2003, 00:33
Go to the top of the page
+Quote Post
schnofler
post Nov 11 2003, 00:36
Post #54


Java ABC/HR developer


Group: Developer
Posts: 175
Joined: 17-September 03
Member No.: 8879



QUOTE (Pio2001)
The second graph behaves exactly how I would expect it to do (random), but the first one looks strange and needs an analysis. Maybe there is a mathematic explanation.

I sure hope there are some people on this board with a profound knowledge of stochastics wink.gif . The p-value is a pretty wicked function of the (hopefully) random 0-1-sequence the program should produce. It might be quite hard to phrase properties that graph should have. Well, I'm too tired now, anyway. Good night everyone.
Go to the top of the page
+Quote Post
phong
post Nov 11 2003, 02:27
Post #55





Group: Members
Posts: 346
Joined: 7-July 03
From: 15 & Ryan
Member No.: 7619



There are many "weak" pseudorandom number generators that have patterns when looking at a subset of the bits produced. In the case of the simplest, the lowest bit simply toggles resulting in a pattern of even-odd-even-odd. It a somewhat more sophisticated algorithm, perhaps there is a 60% chance of the number being even on an even trial and a 60% chance of an odd number on an odd trial. The pattern is not obvious at first, but becomes significant in a large number of trials in a situation such as this:
CODE
if (rand() % 2) {
   x = a;
} else {
   x = b;
}

Making a good pseudorandom number generator is a Hard Problem™; there are certainly many PhDs or at least Masters theses on the subject. Intel was nice enough to include a true random number generating widget on their more recent chipsets that generates true random numbers from thermal noise on a resistor. AMD also makes one available in their 76x series of chipsets. I do not know if equivalent/compatable implementation is/will be available on other chipsets/platforms.

In Linux the kernel keeps track of certain quasi-random events from the "real world" such as interrupt times, network traffic, time between keystrokes/mouse movements, etc. and stores it in an "entropy pool" which programs can draw from by reading from /dev/random. Entropy is a limited resource though, so occasionally some program requiring secure random numbers will stall waiting for more entropy (gnupg for example, will ask you to use the computer if it can't get enough entropy to generate a private key). If there is a hardware RNG is present and support is turned on in the kernel, it will get entropy from that on a regular basis. I do not know the equivalent for Windows (I assume CryptoAPI or something would have something like that).

So, getting good random numbers in a reliable (let alone portable) way is Hard.


--------------------
I am *expanding!* It is so much *squishy* to *smell* you! *Campers* are the best! I have *anticipation* and then what? Better parties in *the middle* for sure.
http://www.phong.org/
Go to the top of the page
+Quote Post
ff123
post Nov 11 2003, 02:58
Post #56


ABC/HR developer, ff123.net admin


Group: Developer (Donating)
Posts: 1396
Joined: 24-September 01
Member No.: 12



QUOTE (phong @ Nov 10 2003, 05:27 PM)
There are many "weak" pseudorandom number generators that have patterns when looking at a subset of the bits produced.  In the case of the simplest, the lowest bit simply toggles resulting in a pattern of even-odd-even-odd.  It a somewhat more sophisticated algorithm, perhaps there is a 60% chance of the number being even on an even trial and a 60% chance of an odd number on an odd trial.  The pattern is not obvious at first, but becomes significant in a large number of trials in a situation such as this:
CODE
if (rand() % 2) {
   x = a;
} else {
   x = b;
}

I noticed that the rand() function in Microsoft Visual C++ 6.0 has a bias if the code is written this way (even-odd algorithm). But if the integer outputs are binned into 10 ranges (representing digits 0-9) then the bias disappears. So in addition to the double initialization kludge, I had to be careful not to use the even-odd algorithm. All-in-all, I can't say that I'm impressed by the rand() function.

The Mersenne twister now used in ABC/HR doesn't have this problem, but I still don't rely on an even-odd algorithm.

ff123
Go to the top of the page
+Quote Post
Pio2001
post Nov 11 2003, 05:05
Post #57


Moderator


Group: Super Moderator
Posts: 3936
Joined: 29-September 01
Member No.: 73



Here's how the p value behaves, according to the binomial table

http://perso.numericable.fr/laguill2/pictu...s/binomial5.png

The yellow aera on the left is the aera where pval < 5 %, and the blue aera on the right is the one where pval > 95 %. Each intermediate zone is 5 % wide.
The number of successes in a sequencial random ABX test starts from the bottom of the graph, and at each step up, it randomly goes one step to the right or one step to the left.
In each point, pval represents the probability of being to the left of the point by chance. Thus at any time, there is an equal chance of being in any zone. Since all zones gather at the center of the graph, the probability of being there is high.

Here's the same graph, but with 1% wide bands. Left aera : pval < 1%, right aera, pval >99%.

http://perso.numericable.fr/laguill2/pictu...s/binomial1.png

Here, I added a third graph on it.

http://perso.numericable.fr/laguill2/pictu...mial1scotch.png

It represents the same thing, but with 10 % wide bands. All central bands are white. We can see that if, at the 50th trial, pval gets inferior to 1%, there is one chance out of 5 for it to stay below 1 % for the 50 next trials, because the 1% line of the "1%" graph stays on the right of the 20% line of the "10 %" graph until trial number 100.

The pval table is too small to simulate it for 2000 trials.
Go to the top of the page
+Quote Post
KikeG
post Nov 11 2003, 09:33
Post #58


WinABX developer


Group: Developer
Posts: 1578
Joined: 1-October 01
Member No.: 137



QUOTE (Continuum @ Nov 10 2003, 07:28 PM)
But to give you an idea how much the results are affected: Think of a guessing tester who stops the test as soon as he reaches 0.95 confidence or the maximal length ( =: m) of the test. The probability for him to pass the test are:

m=10 => p-val = 0.0508
m=20 => p-val = 0.0987
m=30 => p-val = 0.1295
m=50 => p-val = 0.1579
m=100 => p-val = 0.2021

See this excel sheet for reference.

So, according to this table, going for a 0.99 confidence (0.01 or 1% pvalue), one would have a probability of getting it by chance on 40 trials of 0.0327 (3.27%), isn't it? So, I'm right thinking that a person who passed this test on 40 trials would also have an over 95% confidence value of being hearing a true difference.

I know little about excel macros, sorry, so what would be the value for a 100 trials test?
EDIT: I think figured it out in the VBA code, and rerun the calculations. It would be 0.05162, so I guess this would not pass the 95% confidence value required. The max. no. of trials allowed to pass this test would be 93, with a pval of passing it by chance of 0.04989. 94 trials would be 0.05039, over 5%.

So, for a 16 trials max. sequential test, it would be enough to get a "calculated" p<3%.

I don't really know much about statistics, I'm afraid.

This post has been edited by KikeG: Nov 11 2003, 10:16
Go to the top of the page
+Quote Post
KikeG
post Nov 11 2003, 09:36
Post #59


WinABX developer


Group: Developer
Posts: 1578
Joined: 1-October 01
Member No.: 137



About WinABX PRNG, from v0.4 it also uses the Mersenne twister generator. However, it uses an even-odd algorithm. Maybe should I change this? What version did you test?

This post has been edited by KikeG: Nov 11 2003, 09:59
Go to the top of the page
+Quote Post
Moneo
post Nov 11 2003, 11:11
Post #60





Group: Developer
Posts: 501
Joined: 22-January 03
From: Netherlands
Member No.: 4684



QUOTE (KikeG @ Nov 11 2003, 09:36 AM)
About WinABX PRNG, from v0.4 it also uses the Mersenne twister generator. However, it uses an even-odd algorithm. Maybe should I change this? What version did you test?

I have tested the latest version, but I only did one test.

Could you try to reproduce this behaviour (abnormally high or low results when a-b-a-b-... pattern is followed) with the current implementation and when 0 or 1 is chosen by comparing the pseudorandom number to a half of its maximum value?

Maybe it's a good idea to implement a cryptographically secure PRNG for WinABX?
Go to the top of the page
+Quote Post
KikeG
post Nov 11 2003, 11:58
Post #61


WinABX developer


Group: Developer
Posts: 1578
Joined: 1-October 01
Member No.: 137



QUOTE (Moneo @ Nov 11 2003, 11:11 AM)
Could you try to reproduce this behaviour (abnormally high or low results when a-b-a-b-... pattern is followed) with the current implementation and when 0 or 1 is chosen by comparing the pseudorandom number to a half of its  maximum value?

I tried with v0.41 version, using the a-b-a-b strategy, and got 95/200, p=78%. But now I've updated the algorith, not using the even-odd algorith and using same algorithm as abc/hr, and then tried again. The funny thing is that I got 108/200, p=14%, and during this test I got a score as good as 78/134, p=3.5%. But as Continuum and Pio2001 have explained, this is not significant in this kind of test. I tried again using a "A" always strategy, and got 104/200,p=31%.

Edit: tried again an a-b-a-b strategy with last version, and got 99/200, p=58.4%. I guess that's what happens with random numbers: they're random, and a score more "perfect" than another (foobar vs WinABX) doesn't mean anything unless many tests or may trials are averaged...

Another edit: A passed test (going for p<5%) out of 20 random tests is just what statistics predict, even 1 passed test out of 10 is possible it you are a bit lucky. What is quite unlikely is to pass just a single random test. However, it's possible if you are very lucky.

The updated version (v0.42) is available at http://www.kikeg.arrakis.es/winabx/winabx.zip

It would be good if you tested it same way you did with the old version.

QUOTE
Maybe it's a good idea to implement a cryptographically secure PRNG for WinABX?

I don't know it this would have any advantage for the issues at discussion.

This post has been edited by KikeG: Nov 11 2003, 12:40
Go to the top of the page
+Quote Post
Moneo
post Nov 11 2003, 12:49
Post #62





Group: Developer
Posts: 501
Joined: 22-January 03
From: Netherlands
Member No.: 4684



QUOTE (KikeG @ Nov 11 2003, 11:58 AM)
I tried with v0.41 version, using the a-b-a-b strategy, and got 95/200, p=78%. But now I've updated the algorith, not using the even-odd algorith and using same algorithm as abc/hr, and then tried again. The funny thing is that I got 108/200, p=14%, and during this test I got a score as good as 78/134, p=3.5%. But as Continuum and Pio2001 have explained, this is not significant in this kind of test. I tried again using a "A" always strategy, and got 104/200,p=31%.

Yes, neither of these are statistically significant.
QUOTE
It would be good if you tested it same way you did with the old version.

Maybe you could post the sources for both old and new random number generation routines, including the initial seeding? It isn't exactly fun to click the mouse 400 times, and I don't know windows programming well enough to write an application that would control WinABX smile.gif
QUOTE
QUOTE
Maybe it's a good idea to implement a cryptographically secure PRNG for WinABX?

I don't know it this would have any advantage for the issues at discussion.

It would essentially rule out a possibility of "cheating" by noticing any patterns in the prng.
Go to the top of the page
+Quote Post
KikeG
post Nov 11 2003, 15:37
Post #63


WinABX developer


Group: Developer
Posts: 1578
Joined: 1-October 01
Member No.: 137



QUOTE (Moneo @ Nov 11 2003, 12:49 PM)
Maybe you could post the sources for both old and new random number generation routines, including the initial seeding?

PRNG used is publically available, it's a Mersenne Twister PRNG that implements MT19937. Extracted from source code:

"This is a Mersenne Twister pseudorandom number generator
with period 2^19937-1 with improved initialization scheme,
modified on 2002/1/26 by Takuji Nishimura and Makoto Matsumoto."

Inicialization in all versions is at program startup, and everytime the test files are reloaded in the newest versions (I guess the later isn't necessary):

QUOTE
void InitSeed(void)
{
   init_genrand((unsigned long)time(NULL));
}



RN generation in v0.40 and v0.41 was:

QUOTE
int Rand(int base)
{
   return genrand_int32()%base;
}


and in new v0.42 is:

QUOTE
#define MAX_GENRAND_REAL 0xffffffff

int Rand(int base)
{
   return (int)((genrand_int32()/(MAX_GENRAND_REAL+1.0))*base);
}



QUOTE
It isn't exactly fun to click the mouse 400 times, and I don't know windows programming well enough to write an application that would control WinABX


Ok, I was talking in general, not only you. However I guess results won't be very different, on simulation over thousands of trials they give very similar results.

QUOTE
It would essentially rule out a possibility of "cheating" by noticing any patterns in the prng.


I think this would be really difficult to notice, if there were any.

This post has been edited by KikeG: Nov 11 2003, 17:35
Go to the top of the page
+Quote Post
schnofler
post Nov 11 2003, 16:10
Post #64


Java ABC/HR developer


Group: Developer
Posts: 175
Joined: 17-September 03
Member No.: 8879



Ok, I tested the new version, here you go:

1. 100/200, pval=52.8%
2. 90/200, pval=93.1%
3. 90/200, pval=93.1%
4. 103/200, pval=36.2%
5. 83/200, pval=99.3%
6. 107/200, pval=17.9%
7. 101/200, pval=47.2%
8. 89/200, pval=94.8%
9. 102/200, pval=41.6%
10. 85/200, pval=98.6%

And some longer tests:

1. 985/2000, pval=75.5%
2. 1012/2000, pval=30.3%
2. 1001/2000, pval=49%
Go to the top of the page
+Quote Post
tigre
post Nov 12 2003, 15:18
Post #65


Moderator


Group: Members
Posts: 1434
Joined: 26-November 02
Member No.: 3890



the 3 following posts have been moved from Upsampling output, Any theoretical advantages?
__________________________________________________________

QUOTE (KikeG @ Nov 12 2003, 03:58 PM)
I'm not so sure about that, it could be significant due to the p=3.3% reached during the test on just 11 trials.

I'm quite sure that Continuum's thoughts he talked about here and here are correct, so without fixing the trial number before starting, reaching 9/11 doesn't mean "probability you are guessing" is 3.3 %, rather something higher like > 5%.


--------------------
Let's suppose that rain washes out a picnic. Who is feeling negative? The rain? Or YOU? What's causing the negative feeling? The rain or your reaction? - Anthony De Mello
Go to the top of the page
+Quote Post
KikeG
post Nov 12 2003, 15:35
Post #66


WinABX developer


Group: Developer
Posts: 1578
Joined: 1-October 01
Member No.: 137



QUOTE (tigre @ Nov 12 2003, 03:18 PM)
I'm quite sure that Continuum's thoughts he talked about here and here are correct, so without fixing the trial number before starting, reaching 9/11 doesn't mean "probability you are guessing" is 3.3 %, rather something higher like > 5%.

Yes, I took this into account. According to Continuum's corrected p-values table, the probability of passing the test going for a p below around 3.5% on just 11 trials, would be very close to 5%, so this result would have some significance. That's why I say that if lucpes repeats the test and gets a similarly good score, it would be a quite reliable indication of that he heard a difference. For example, if he got 9/11 again, that would be 18/22, an "uncorrected" p=0.2%. That would be quite below the required 5% even if corrected (please someone correct me if I'm wrong, I'm still not totally confident on my interpretation and extrapolation of the table data).

Edit: At first, it seems that these conclusions agree with ff123 "decision" table at the long ABX statistics thread: http://www.hydrogenaudio.org/forums/index....indpost&p=30785

However, I should spend more time trying to understand better and verify all these things, time that I lack now.

This post has been edited by KikeG: Nov 12 2003, 15:50
Go to the top of the page
+Quote Post
tigre
post Nov 12 2003, 16:42
Post #67


Moderator


Group: Members
Posts: 1434
Joined: 26-November 02
Member No.: 3890



Continuum's corrected p-values are only correct with the following preconditions:

- before the test, a maximum number of trials (N) is fixed.
- A p-value (P) is set (like 0.05).
-- If P is reached, the test stops immediately as "passed", the "probability you're guessing" is the "corrected p-value" from Continuum's table
-- If after N trials P isn't reached, the test is "failed".

If the precondition is something like "I try to reach p = 0.01. I don't know how many runs I'll perform, but if I'm frustrated I'll give up", it's hard to impossible to tell the true "probability you're guessing".

If Lucpes had said before the test "I want to get [/b]p=0.04[/b] or better and perform not more than [/b]13[/b] trials", he would have stopped after reaching 9/11 = 0.033. According to Continuum's table this would mean a corrected p-value of 0.063 => "Probability you're guessing" = 6.3%

If he had said "I want to get p=0.05 or better and perform not more than 40 trials", he would have stopped after reaching 9/11 = 0.033. => corrected p-value: 0.145 => "Probability you're guessing" = 14.5%

From these 2 examples (best case vs. worst case covered by the table) we see that without defining preconditions it's hard to interprete the result.


--------------------
Let's suppose that rain washes out a picnic. Who is feeling negative? The rain? Or YOU? What's causing the negative feeling? The rain or your reaction? - Anthony De Mello
Go to the top of the page
+Quote Post
KikeG
post Nov 12 2003, 17:15
Post #68


WinABX developer


Group: Developer
Posts: 1578
Joined: 1-October 01
Member No.: 137



QUOTE (tigre @ Nov 12 2003, 04:42 PM)
Continuum's corrected p-values are only correct with the following preconditions:

- before the test, a maximum number of trials (N) is fixed.
- A p-value (P) is set (like 0.05).
-- If P is reached, the test stops immediately as "passed", the "probability you're guessing" is the "corrected p-value" from Continuum's table
-- If after N trials P isn't reached, the test is "failed".

...

If Lucpes had said before the test "I want to get [/b]p=0.04[/b] or better and perform not more than [/b]13[/b] trials", he would have stopped after reaching 9/11 = 0.033. According to Continuum's table this would mean a corrected p-value of 0.063 => "Probability you're guessing" = 6.3%

But, given that he had stopped when his p=3.3%, does it really matter what p he wanted to reach before the test started? And does it matter what number of trials max. he was planning to perform? These things were just in his mind, and I'd say final results don't depend on these. I'd like Continuum to explain this, because I'm not really confident on my interpretation, and I'm a little bit confused right now and maybe I'm skipping something.

Anyway, I'm not saying he passed the test, I guess because he didn't stop at the p=3.3% point, and even if he had, the corrected p could be slightly over 5%. But if he repeated the test and got similarly good results, I think one could say he passed without much doubt.

This post has been edited by KikeG: Nov 12 2003, 17:21
Go to the top of the page
+Quote Post
Continuum
post Nov 12 2003, 18:51
Post #69





Group: Members
Posts: 473
Joined: 7-June 02
Member No.: 2244



I'm not a statistics guru, but here is what I think on this topic (I can try to ask someone more experienced in this area, next week):

It all depends on the question you ask. Keep in mind, that we cannot calculate the probability the listener was guessing.

First consider a fixed test like the one tigre describes. We can exactly calculate the probability that a guessing tester would pass this test (-> "corrected p-val"). In statistics results are usually considered significant when this value is below 0.05 (or for stricter tests 0.01).

In the first situation however, we loose information about when the test is stopped. For example, we could say that a 6/6 result is better than a 7/8, and one could rephrase the question to: "What is the probability for a guessing tester to achieve a 'better' score (in our peculiar ordering)?"

Think of the following situation: A listener decides on a "classic p-val" (as displayed in the program) he wants to achieve, say 0.05, and stops the test as soon as this value is reached. The maximal trial number is not fixed at this moment -- but it would not change the strategy of a guessing listener anyway!
Let's say he reached 0.04 at 20 trials. The probability for a guessing tester to score a better result, that is to reach 0.05 with at most 20 trials is 0.098.

In how far this is a sensible thing to do, I don't know.
Go to the top of the page
+Quote Post

3 Pages V  < 1 2 3
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 17th September 2014 - 22:42