Understanding ABX Test Confidence Statistics
Reply #33 – 2015-02-02 13:17:45
You've many times got a perfect 20 out of 20 correct trials by guessing? I believe you should immediately buy a lotto ticket as you are on eof the luckiest people I have ever met. Can you tell us th eprobablility of doing what you claim you did? Nope, but nice trolling. I've done enough trials to push the probability of guessing below 5%. If the probability is just below 5% then you only need 20 tries to get - according to you - a true positive result, which is of course nonsense. For a perfect score you obviously need more luck, and much more luck if you want to do 20/20. Or you simply cheat. 100% probability for any number of trials. Yep, I guess we have seen an example of Arny cheating with his ABX results - just random guesses but it takes a lot more effort to cheat & get an overall positive score. I'm sure there might be some bothered to go to this trouble but I doubt there are many. I have no idea what you're talking about, your sentence does not make any sense, and it smells like trolling.Yea, yea, it's a confidence level - so what? Yeah, so "i.e your results are not false positives." is wrong.A "real" difference is an agreed audible difference that is recognised - such as are already established in JNDs You mean like ringing at 21+ kHz which an old guy with severe high frequency hearing loss can "hear" on his laptop with some in-ears that additionally roll off before ~18 kHz? More seriously, I see what you mean, but everyone hears differently. You cannot expect everyone (or even many if we're talking about audiophiles with shot ears here) to hear something at just noticeable levels. At least not without some additional "help".what low anchors are included inside an ABX test? For example adding low bitrate mp3 in a lossless-mp3 comparison.Could you express that better - it's cryptic In an online test targeted at a typical audience I think it is fair to assume that people who deliberately send in unsuccessful test results will be far overshadowed by successful test results that were not arrived at honestly. It is pretty easy to find clues if you're dealing with single persons (e.g. seeing amir evade trivial questions for several pages - that he filled with noise - until finally admitting or making really lame excuses), but not if many people send in their results. As for the statistics: hypothesis testing . The main interest lies in the positive results that could confirm the alternative hypothesis. In order to find the truth you have to ensure that there are no cheaters and no false positives. That's why the scientific method has reproducibility built into it. When scientists found that neutrinos moved faster than light they did not accept it, even after they had replicated the results they kept on searching for reasons for false positives for months . A few months before they found the error an independent replication failed, effectively refuting the earlier results anyway. It is extremely simple to introduce something that causes false positives, even when you have a team of highly skilled and honest people... Even worse, the error could potentially stay undetected and it gets worse when conclusions are drawn that are accepted by gullible people. We either have to have plausible explanations or many independent replications of an experiment to make solid conclusions.How would an online test look like where you can calculate specificity for each participant? I would really be interested in your answer. Easy enough - you would have known audible differences randomly added to one of the files during the listening trials & the ABX software would record this trial result as a control result. So if this difference was not identified then a false negative is recorded. The analysis of such stats for many tests would be interesting I tried to do something similar with amir - he first ignored for a couple of days, then finally refused with an excuse. Because if you do this smart, then you could also detect cheaters - at least the not so smart ones. The problem is that this would not be an ABX test as we know it. You wouldn't just have 2 files (original, modified) but 4 (original, modified, low anchor, fake modified). But again, the false negatives are far less interesting than the false positives ones. To reduce false negatives we have listener training, low anchor test files ...That's the usual answer given - "Huh, our tests might be bad & unreliable but look at the sighted test, it's worse". Pretty scientific attitude Again, if you guys aren't interested in the sensitivity of your tests then why should anybody pay any credence to the results? Not at all. Sighted tests are not worse, they are completely worthless for determining small audible differences. They are not a worse alternative, they are not even an alternative. I do not think that e.g. ABX is bad or unreliable at all - if you have honest participants. Your above test doesn't help much at all, because a cheater would always correctly identify. Well almost always, because introducing some error or external influence ("dog barked here", "wife interrupted here" ...) makes it look more real.