Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Statistics For Abx (Read 36011 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Statistics For Abx

Reply #100
Damn, I hope this is the last time I summarize what's going to happen!

1. The test will automatically stop if the following points are reached:

6 of 6
10 of 11
10 of 12
14 of 17
14 of 18
17 of 22
17 of 23
20 of 27
20 of 28

2. The program will display overall alpha values after each of the above stop points has been achieved.  Also, the overall alpha values will be displayed regardless of whether the test stops or not at the following (look) points:  trials 6, 12, 18, 23, and 28.

(The earlier the test is terminated when the listener passes, the lower the overall alpha is.)

3. The program will display the number correct after each trial is completed.

4. The test will automatically stop if 9 incorrect are achieved.

Have I missed anything?  This has sure been one humbling exercise.

ff123

Statistics For Abx

Reply #101
Quote
I still don't see the problem with calculating and displaying an overall alpha.

10 of 12 is 0.0295
11 of 12 is 0.0171
12 of 12 is impossible

Given that a listener must have terminated if he achieved 6 of 6. The procedure (and therefore the exact odds of getting to any particular point) are completely prescribed now.

Ahh! I think I finally understand your total alpha calculation.
Example: alpha(10/12) = P(6/6) + P(10/12 and not 6/6) + P(11/12 and not 6/6) + P(12/12 and not 6/6)
Is this correct?

Well, then it is possible to calculate a total alpha at each step; still, I'd be very careful with conclusions drawn from this value.

Statistics For Abx

Reply #102
Quote
Ahh! I think I finally understand your total alpha calculation.
Example: alpha(10/12) = P(6/6) + P(10/12 and not 6/6) + P(11/12 and not 6/6) + P(12/12 and not 6/6)
Is this correct?

Well, then it is possible to calculate a total alpha at each step; still, I'd be very careful with conclusions drawn from this value.

Here is how I think of it (in simulation terms):

alpha (10/12):  the probability that a listener will end up with a score of 6/6, 10/12, 11/12, or 12/12 (impossible), given that the listener must stop after achieving a score of 6/6.  Clearly, to get to the 10/12 and 11/12, the listener cannot have scored 6/6.  12/12 is not achievable using this scheme.  So yes, your formula looks like how the simulation works.

You can also use your spreadsheet to calculate these numbers.  If you do, you'll see that as the number of trials goes up, the total alpha increases at each stopping point until it reaches a value just under 0.05 at 20/28.  This is consistent with the idea of allowing stopping points:  basically, the listener gets multiple chances to pass the test, with his chances of passing getting better as the test progresses.

ff123

Statistics For Abx

Reply #103
ok, so let say someone passes the test with a score of 10/11. He would have two pieces of objective information from which to assess the "confidence" that he really heard a difference:

1) the a priori confidence of the test as a whole, ~95% (this would be true for any passing score)

2) the overall alpha (maybe better called the p-value in this context?), ~0.018

Clearly one may be tempted to declare they had passed with 98.2% confidence... but, as Continuum showed in his previous thread, this would not be an accurate statement. As long as this misinterpretation is not made, the overall alpha should help to differentiate passing scores, if one wishes to do so.

Statistics For Abx

Reply #104
Quote
ok, so let say someone passes the test with a score of 10/11. He would have two pieces of objective information from which to assess the "confidence" that he really heard a difference:

1) the a priori confidence of the test as a whole, ~95% (this would be true for any passing score)

2) the overall alpha (maybe better called the p-value in this context?), ~0.018

Clearly one may be tempted to declare they had passed with 98.2% confidence... but, as Continuum showed in his previous thread, this would not be an accurate statement. As long as this misinterpretation is not made, the overall alpha should help to differentiate passing scores, if one wishes to do so.

Wait, I don't think it's a misinterpretation to say that someone has passed with 98.2% confidence if he scores 10 of 11.  For the same reason I don't think it's a misinterpretation to say that someone who scores 6/6 passes with 98.4% confidence.

Near the beginning of the test, the confidence is higher because there are fewer ways to achieve a passing score, but as you get more chances to pass as you attempt more trials, the confidence decreases, until at the end, if you score 20 of 28, you have a 95.1% confidence.

This is just another way of saying:  how probable is it that one has achieved a passing score if he gets 10 of 11, given all the possible ways of getting to this score (including passing with a score of 6/6)?

ff123

Statistics For Abx

Reply #105
Perhaps this is just a matter of semantics. But it is important to note the following (using the notation from that previous thread):

P(G | 10/11) != P(10/11 or better | G) = 0.018

The 6/6 score may be a special case because there is no "or better" part.

Anyhow, I'm really not sure what the "right" wording would be for the interpretation of the p-val. From my stats book it looks like you could safely say that the result was statistically significant "at the 0.018 level of probability". (Intentionally vague I think).

Maybe you could also say that you had passed the 98.2% confidence test... but somehow that doesn't seem right (you just happened to get a result right on the edge of the corresponding "rejection region" for the null hypothesis).

I am, however, quite sure you would be perfectly accurate in saying that you had passed the 95% confidence test.

Statistics For Abx

Reply #106
Quote
Maybe you could also say that you had passed the 98.2% confidence test... but somehow that doesn't seem right (you just happened to get a result right on the edge of the corresponding "rejection region" for the null hypothesis).

I don't think of 10/11 as the edge of the rejection region.  The fact that one is allowed to continue the test if 10/12 is not achieved pushes the true edge of rejection out to trial 28.

The high confidence in the results early on in the test are needed to be allowed to have the option of stopping early and still have 95% confidence if trials continue out to 28.

ff123

Statistics For Abx

Reply #107
Quote
Quote
Maybe you could also say that you had passed the 98.2% confidence test... but somehow that doesn't seem right (you just happened to get a result right on the edge of the corresponding "rejection region" for the null hypothesis).

I don't think of 10/11 as the edge of the rejection region.  The fact that one is allowed to continue the test if 10/12 is not achieved pushes the true edge of rejection out to trial 28.

10/11 is on the edge of the rejection region for 98.2% confidence, clearly not for 95% confidence (it's well inside).

btw, I agree that passing in the earlier trials gives a higher confidence that a type-1 error hasn't occurred. I just think the exact interpretation of the p-value is not trivial. In particular, I would say that a score of 10/11 does not mean the probability that the listener was guessing is exactly 0.018. This is what the equation above says also... although the validity of that hasn't actually been proved or disproved anywhere here, yet (but I'm quite certain it's true) 

Statistics For Abx

Reply #108
I may have missed something somewhere in this thread, but what was the reason for not using the sequential data analysis methods discussed at the beginning of the thread?  It would seem that since they're explicitly designed for sequential data analysis that they'd avoid most of the problems with look windows and such, and allow termination at any point with a robust calculation of confidence levels (or at least robust to the extent that the authors of the methods proved them to be so).

Statistics For Abx

Reply #109
Quote
btw, I agree that passing in the earlier trials gives a higher confidence that a type-1 error hasn't occurred. I just think the exact interpretation of the p-value is not trivial. In particular, I would say that a score of 10/11 does not mean the probability that the listener was guessing is exactly 0.018. This is what the equation above says also... although the validity of that hasn't actually been proved or disproved anywhere here, yet (but I'm quite certain it's true)

Suppose that probability of getting a trial correct is 0.6 instead of 0.5.  Then the probability of getting 10 of 11 is 0.061 instead of 0.018.

So yes, I would say that given the way the test is performed, the probability that the listener was guessing with a score of 10/11 is exactly 0.018.

ff123

Statistics For Abx

Reply #110
Quote
I may have missed something somewhere in this thread, but what was the reason for not using the sequential data analysis methods discussed at the beginning of the thread?  It would seem that since they're explicitly designed for sequential data analysis that they'd avoid most of the problems with look windows and such, and allow termination at any point with a robust calculation of confidence levels (or at least robust to the extent that the authors of the methods proved them to be so).

I haven't looked at it extensively, but after a bit of fiddling with the formulas, I found I couldn't get the minimum number of trials down below 9 before the test is declared to have been passed.  This is a big disadvantage compared with the 28 trial profile, which allows one to stop at trial 6.

ff123

Edit:  But maybe by properly choosing beta and p1 values, I could make the Wald test more palatable as far as minimum trials go.  I probably need to read up on this more.

 

Statistics For Abx

Reply #111
Quote
I may have missed something somewhere in this thread, but what was the reason for not using the sequential data analysis methods discussed at the beginning of the thread?  It would seem that since they're explicitly designed for sequential data analysis that they'd avoid most of the problems with look windows and such, and allow termination at any point with a robust calculation of confidence levels (or at least robust to the extent that the authors of the methods proved them to be so).

If you want the option to stop after every trial, you have to increase the number of minimum trials (as ff123 points out) or the required "traditional" confidence at each point.

Example: The probability to pass an "traditional" 0.95-test by guessing when one's allowed to stop at every point up to 30 is 0.129! (you can test this with my Excel-sheet from above)

That's why the look profiles (like the 28-test) are a good compromise between information, early termination and high confidence.

Statistics For Abx

Reply #112
Quote
Quote
btw, I agree that passing in the earlier trials gives a higher confidence that a type-1 error hasn't occurred. I just think the exact interpretation of the p-value is not trivial. In particular, I would say that a score of 10/11 does not mean the probability that the listener was guessing is exactly 0.018. This is what the equation above says also... although the validity of that hasn't actually been proved or disproved anywhere here, yet (but I'm quite certain it's true)

Suppose that probability of getting a trial correct is 0.6 instead of 0.5.  Then the probability of getting 10 of 11 is 0.061 instead of 0.018.

So yes, I would say that given the way the test is performed, the probability that the listener was guessing with a score of 10/11 is exactly 0.018.

I think shday is correct here. We have no proof whatsoever that the probability that the listener was guessing is the same as the calculated p-val.
Example: If you compare two identical files, the probability that the listener is guessing is 1, while his p-val (probability of scoring the same or a better result) will be lower in most cases. The p-val gives only an indication.

But this is purely semantics and interpretation.

BTW: this is exactly what the previous thread was about.

Statistics For Abx

Reply #113
Ok, I think I finally understand the distinction you guys are making:

It's the difference between asking:

"What is the probability that the listener was guessing, given his score?" vs. "What is the probability that a listener gets a certain score, given that he is guessing?"

The value I plan to pop out for the stop points and the look points is the answer to the latter question.  Probably I should change the text in the ABX dialog box to be semantically correct.

ff123

Statistics For Abx

Reply #114
Yes, that's it! 

Statistics For Abx

Reply #115
I think what this thread has really taught me is to pay close attention to what the other guy is saying because he has something valuable to say.

ff123