Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Understanding ABX Test Confidence Statistics (Read 52282 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Understanding ABX Test Confidence Statistics

Reply #25
Just a trivial answer to to point out the stupidity of such a claim: simply not properly entering the intended choice can result in a false negative (or a false positive).

Regardless, CC's post was more thought provoking.

Understanding ABX Test Confidence Statistics

Reply #26
I do not understand this at all. If you have a sighted test where you believe the two items under test ought to sound identical (but in fact they don't) - surely that could quite easily generate false negatives?


He probably has never seen an honest person without financial interest post a review where said person posted that he/she could not hear a difference.
"I hear it when I see it."

Understanding ABX Test Confidence Statistics

Reply #27
Nope. I've multiple times made a perfect score by just randomly clicking. Now imagine what a spectrum analyzer does...
You've many times got a perfect 20 out of 20 correct  trials by guessing? I believe you should immediately buy a lotto ticket as you are on eof the luckiest people I have ever met. Can you tell us th eprobablility of doing what you claim you did?

Quote
An online ABX test only works if you have honest participants that will not only not cheat but also point out and accept problems with the test files (like the time offset in the AVS AIX test files).
Yep, I guess we have seen an example of Arny cheating with his ABX results - just random guesses but it takes a lot more effort to cheat & get an overall positive score. I'm sure there might be some bothered to go to this trouble but I doubt there are many.

Quote
Nope. You really should read up on statistics again.
Yea, yea, it's a confidence level - so what?


Quote
What is a "real" difference? A measurable difference certainly does not mean that there's an audible difference anyway.
A "real" difference is an agreed audible difference that is recognised - such as are already established in JNDs

Quote
And nope, there is concern given to false negatives, for example by including low anchors in test files.
what low anchors are included inside an ABX test?
Quote
But again, in an online test you can only assume that people try their best and also list the equipment they actually used.

If this is not the case then it still does matter less than false positives, because we do not accept the null hypothesis anyway. Again, read up on statistics.
Could you express that better - it's cryptic


Quote
It certainly takes faith to accept (positive) results of demonstrably dishonest people.

But that aside, it seems you are trivializing this. Black/white kinda thinking, as you did above.
How would an online test look like where you can calculate specificity for each participant? I would really be interested in your answer.

Easy enough -  you would have known audible differences randomly added to one of the files during the listening trials & the ABX software would record this trial result as a control result. So if this difference was not identified then a false negative is recorded. The analysis of such stats for many tests would be interesting 
Quote
It's hard enough (= impossible) to get honest people doing the required number of trials and sending in their results regardless of success.


Nope.
No faith required.

Where you need faith, or let's better call it gullibility, is with the (most of the time) positive dishonest sighted listening tests.

That's the usual answer given - "Huh, our tests might be bad & unreliable but look at the sighted test, it's worse". Pretty scientific attitude
Again, if you guys aren't interested in the sensitivity of your tests then why should anybody pay any credence to the results?

Understanding ABX Test Confidence Statistics

Reply #28
This is as ridiculous a statement as I have heard & shows your lack of knowledge & understanding - there's no such thing as a false negative in sighted tests

I do not understand this at all. If you have a sighted test where you believe the two items under test ought to sound identical (but in fact they don't) - surely that could quite easily generate false negatives?

Again, a complete fail - if you "believe" that there's no audible difference but you hear a difference (because one exists) how could that possibly be a false negative, doh? Do you guys think logically or ......?

Understanding ABX Test Confidence Statistics

Reply #29
Just a trivial answer to to point out the stupidity of such a claim: simply not properly entering the intended choice can result in a false negative (or a false positive).

Regardless, CC's post was more thought provoking.

What are you talking about? Do you even know what a false negative means?

Understanding ABX Test Confidence Statistics

Reply #30
A false negative is a true positive called as a negative. What's your definition?

And how do you reconcile *demanding* adherence to e.g., ITU standards from everyone else, with your championing of your and Amir's decidedly non-compliant methods?

Understanding ABX Test Confidence Statistics

Reply #31
A false negative is a true positive called as a negative. What's your definition?
Yes, it's a known audible difference which fails to be noticed.

Quote
And how do you reconcile *demanding* adherence to e.g., ITU standards from everyone else, with your championing of your and Amir's decidedly non-compliant methods?
Because an overall positive result in an ABX test signifies a confidence level of 95% or greater that the result is not random i.e that the tester correctly identified differences. What significance do false negatives have in this scenario? OK, it would still be nice to know the specificity of the test but would that reduce the 95% confidence level result?

If, on the other hand, a null result is returned - what might be the adjustment on the results of knowing the level of false negatives for that test?

Understanding ABX Test Confidence Statistics

Reply #32
Doing a quick search, I see that this topic has arisen before & still the denial continues:
Quote
ABX removes bias that could produce false positives but it does not remove bias towards false negatives. Subjects biased towards not hearing a difference can throw off your results by not listening and guessing randomly.


Followed by this doozy of a comment from ArnyK "While the potential for false negatives is real, the actual incidence of this seems small." He has no evidence, no basis for such a statement a she simply doesn't know the incidence of false negatives in ABX tests. Very scientific  and the rest of the thread shows the same typical, unscientific thinking.

Understanding ABX Test Confidence Statistics

Reply #33
You've many times got a perfect 20 out of 20 correct  trials by guessing? I believe you should immediately buy a lotto ticket as you are on eof the luckiest people I have ever met. Can you tell us th eprobablility of doing what you claim you did?

Nope, but nice trolling.
I've done enough trials to push the probability of guessing below 5%. If the probability is just below 5% then you only need 20 tries to get - according to you - a true positive result, which is of course nonsense. For a perfect score you obviously need more luck, and much more luck if you want to do 20/20. Or you simply cheat. 100% probability for any number of trials.


Yep, I guess we have seen an example of Arny cheating with his ABX results - just random guesses but it takes a lot more effort to cheat & get an overall positive score. I'm sure there might be some bothered to go to this trouble but I doubt there are many.

I have no idea what you're talking about, your sentence does not make any sense, and it smells like trolling.


Yea, yea, it's a confidence level - so what?

Yeah, so "i.e your results are not false positives." is wrong.


A "real" difference is an agreed audible difference that is recognised - such as are already established in JNDs

You mean like ringing at 21+ kHz which an old guy with severe high frequency hearing loss can "hear" on his laptop with some in-ears that additionally roll off before ~18 kHz?
More seriously, I see what you mean, but everyone hears differently. You cannot expect everyone (or even many if we're talking about audiophiles with shot ears here) to hear something at just noticeable levels. At least not without some additional "help".

what low anchors are included inside an ABX test?

For example adding low bitrate mp3 in a lossless-mp3 comparison.


Could you express that better - it's cryptic

In an online test targeted at a typical audience I think it is fair to assume that people who deliberately send in unsuccessful test results will be far overshadowed by successful test results that were not arrived at honestly.
It is pretty easy to find clues if you're dealing with single persons (e.g. seeing amir evade trivial questions for several pages - that he filled with noise - until finally admitting or making really lame excuses), but not if many people send in their results.

As for the statistics: hypothesis testing.
The main interest lies in the positive results that could confirm the alternative hypothesis. In order to find the truth you have to ensure that there are no cheaters and no false positives.
That's why the scientific method has reproducibility built into it. When scientists found that neutrinos moved faster than light they did not accept it, even after they had replicated the results they kept on searching for reasons for false positives for months. A few months before they found the error an independent replication failed, effectively refuting the earlier results anyway.

It is extremely simple to introduce something that causes false positives, even when you have a team of highly skilled and honest people... Even worse, the error could potentially stay undetected and it gets worse when conclusions are drawn that are accepted by gullible people. We either have to have plausible explanations or many independent replications of an experiment to make solid conclusions.


Quote
How would an online test look like where you can calculate specificity for each participant? I would really be interested in your answer.

Easy enough -  you would have known audible differences randomly added to one of the files during the listening trials & the ABX software would record this trial result as a control result. So if this difference was not identified then a false negative is recorded. The analysis of such stats for many tests would be interesting

I tried to do something similar with amir - he first ignored for a couple of days, then finally refused with an excuse.
Because if you do this smart, then you could also detect cheaters - at least the not so smart ones.

The problem is that this would not be an ABX test as we know it. You wouldn't just have 2 files (original, modified) but 4 (original, modified, low anchor, fake modified). But again, the false negatives are far less interesting than the false positives ones. To reduce false negatives we have listener training, low anchor test files ...


That's the usual answer given - "Huh, our tests might be bad & unreliable but look at the sighted test, it's worse". Pretty scientific attitude
Again, if you guys aren't interested in the sensitivity of your tests then why should anybody pay any credence to the results?

Not at all.
Sighted tests are not worse, they are completely worthless for determining small audible differences. They are not a worse alternative, they are not even an alternative.

I do not think that e.g. ABX is bad or unreliable at all - if you have honest participants.

Your above test doesn't help much at all, because a cheater would always correctly identify. Well almost always, because introducing some error or external influence ("dog barked here", "wife interrupted here" ...) makes it look more real.
"I hear it when I see it."

Understanding ABX Test Confidence Statistics

Reply #34
I know you dismiss any ABX test lacking positive controls/ hidden reference anchors,


I agree that given their propensity for false negatives (IME a small price to pay for the much-needed control over false positives), listening tests need positive controls.



This is typical of the skewed approach to audio DBTs - you are happy that they are a small price to pay but you have no idea how skewed the results are towards false negatives.


When a post includes such clear claims of knowledge of my personal state of mind with this degree of precision and certainty, the only logical response is to recognize the obvious debilitating difficulties inherent in that position and move on.

Understanding ABX Test Confidence Statistics

Reply #35
That's the usual answer given - "Huh, our tests might be bad & unreliable but look at the sighted test, it's worse". Pretty scientific attitude
Again, if you guys aren't interested in the sensitivity of your tests then why should anybody pay any credence to the results?


That's a false statement of the real situation.  The real situation is that science knows how to make DBTs the gold standard for reliability and have done so many times.

There is very common far worse answer: "Our tests are known to be biased and unreliable but we base just about everything we do on them and advise others to do the same".

That is the de facto position of most people who writer for or publish high end ragazines, sell high end audio gear, and in particular those who sell high end DACs.

Understanding ABX Test Confidence Statistics

Reply #36
Sighted tests are not worse, they are completely worthless for determining small audible differences. They are not a worse alternative, they are not even an alternative.


Totally agreed.


Quote
I do not think that e.g. ABX is bad or unreliable at all - if you have honest participants.


Honesty is a prerequisite for any scientific endeavor, but the role of adequate knowledge cannot be understated.

In the beginning we underestimated the difficulty of doing good listening tests simply because we gave too much (any at all) credibility to sighted evaluations which were the standard up until that point. 


Understanding ABX Test Confidence Statistics

Reply #37
In the beginning we underestimated the difficulty of doing good listening tests simply because we gave too much (any at all) credibility to sighted evaluations which were the standard up until that point.

Yes, this seems to be in general a major problem for (less rational) audiophiles.

The line of reasoning is like this: xyz people have heard a difference. So many people cannot err. Therefore, there must be an audible difference, and any test that does not come up with the same results is flawed.
They do not seem to realize that the same way of thinking can be applied to all kinds of nonsense ranging from supernatural abilities to alien abductions. Well... I'm afraid to say that some are consistent in their thinking and actually do believe in these things.
"I hear it when I see it."

Understanding ABX Test Confidence Statistics

Reply #38
I know you dismiss any ABX test lacking positive controls/ hidden reference anchors,


I agree that given their propensity for false negatives (IME a small price to pay for the much-needed control over false positives), listening tests need positive controls.
This is typical of the skewed approach to audio DBTs - you are happy that they are a small price to pay but you have no idea how skewed the results are towards false negatives. So what if 90% of audio DBTs were found to be suffering from an unacceptable level of false negatives - would this be a small price?

Quote
I have problems with people who rant and rave about this issue as an ABX-only problem when it is inherent in any listening test.  The only reason why nobody says much about the false negatives in sighted evaluation is that they are washed out by the very many false positives.
This is as ridiculous a statement as I have heard & shows your lack of knowledge & understanding - there's no such thing as a false negative in sighted tests 

Quote
I have more problems with people who won't recognize that false negtatives are a problem that is easy enough to manage.
yes, it's easy to manage by including hidden controls within the test but have you ever done so & can give us the stats on false negatives for any such test you've administered? I don't even think I have ever seen you give a practical, sensible approach to how these controls could be included in a test?

ok so you want some tests inside the abx to control that the examinees isn't failing on purpose or out of boredom, bad faith... on principle why not. it wouldn't really be detrimental as long as it doesn't complicate stuff too much for the layman person to still use the test.
but how do you suggest to do that? personally I don't know how to do it. let's say the ABX makes 3files instead of 2, one with a slight EQ or a delay or added noise or whatever it is that people should be able to hear. then the guy of bad faith (Arny apparently is your best friend), would hear it, recognize it for what it is and answer correctly before going back to randomly clicking on the rest of the tracks just to make fun of you. I don't see how to create an automated system to foolproof the test against false negative? conscious or subconscious.

I have no problem thinking that abx isn't perfect, and in fact, most of the weaknesses of abx, I've learned from reading Arny himself warning against them. that's how much bad faith the guy puts on the internet to blindly promote ABX and make billions out of the abx plugin on foobar. Arny, the only guy on the net with a clear money agenda(sorry my sarcasm leaks out when I read too much BS).
anyway, at the moment I believe ABX is still one of the most effective and easily accessible way to do many subjective hearing tests. if you have a more reliable, free, and worldwide accessible method, please pretty please, tell me all about it. I'm all for the latest good stuff. as long as it's not some sighted listening middle age nonsense, I'm all for it.

 

Understanding ABX Test Confidence Statistics

Reply #39
Btw, here is another example for amir statistics™:

---

You can only do 6 out of 10 trials? (P = 37.7%)
No problem!
Just pick similarly bad results from 9 other persons and you get:

60/100 trials, P = 2.8%

2.8% < 5%



PS: You can improve the results by including better random results. Just 2 more randomly correct trials pushes P down to about 1%!!!

---
"I hear it when I see it."

Understanding ABX Test Confidence Statistics

Reply #40
You've many times got a perfect 20 out of 20 correct  trials by guessing? I believe you should immediately buy a lotto ticket as you are one of the luckiest people I have ever met. Can you tell us the probability of doing what you claim you did?

Nope, but nice trolling.
I've done enough trials to push the probability of guessing below 5%. If the probability is just below 5% then you only need 20 tries to get - according to you - a true positive result, which is of course nonsense. For a perfect score you obviously need more luck, and much more luck if you want to do 20/20. Or you simply cheat. 100% probability for any number of trials.
So now you are going back on what you said "I've multiple times made a perfect score by just randomly clicking" - now your claim has changed - please make up your mind.

Understanding ABX Test Confidence Statistics

Reply #41
That's the usual answer given - "Huh, our tests might be bad & unreliable but look at the sighted test, it's worse". Pretty scientific attitude
Again, if you guys aren't interested in the sensitivity of your tests then why should anybody pay any credence to the results?

Not at all.
Sighted tests are not worse, they are completely worthless for determining small audible differences. They are not a worse alternative, they are not even an alternative.

I do not think that e.g. ABX is bad or unreliable at all - if you have honest participants.

Your above test doesn't help much at all, because a cheater would always correctly identify. Well almost always, because introducing some error or external influence ("dog barked here", "wife interrupted here" ...) makes it look more real.

You are so hung up on the cheating to get positive results - why does ArnyK's cheating to get a null result not also apply - he didn't listen, just randomly hit keys?


Understanding ABX Test Confidence Statistics

Reply #42
That's the usual answer given - "Huh, our tests might be bad & unreliable but look at the sighted test, it's worse". Pretty scientific attitude
Again, if you guys aren't interested in the sensitivity of your tests then why should anybody pay any credence to the results?


That's a false statement of the real situation.  The real situation is that science knows how to make DBTs the gold standard for reliability and have done so many times.

There is very common far worse answer: "Our tests are known to be biased and unreliable but we base just about everything we do on them and advise others to do the same".

That is the de facto position of most people who writer for or publish high end ragazines, sell high end audio gear, and in particular those who sell high end DACs.

You have no clue of the type II errors in your results - no clue how to run a test & yet you claim it's reliability & a gold standard. Give me a break!

Understanding ABX Test Confidence Statistics

Reply #43
ok so you want some tests inside the abx to control that the examinees isn't failing on purpose or out of boredom, bad faith... on principle why not. it wouldn't really be detrimental as long as it doesn't complicate stuff too much for the layman person to still use the test.
Good - a rational voice, at last!
Quote
but how do you suggest to do that? personally I don't know how to do it. let's say the ABX makes 3files instead of 2, one with a slight EQ or a delay or added noise or whatever it is that people should be able to hear. then the guy of bad faith (Arny apparently is your best friend), would hear it, recognize it for what it is and answer correctly before going back to randomly clicking on the rest of the tracks just to make fun of you. I don't see how to create an automated system to foolproof the test against false negative? conscious or subconscious.
But you miss the point that this control file has no other audible feature in it (other than the inserted artefact that should is agreed audible, if the tester is paying attention)  that distinguishes it from the files that  feature. So the tester has no way of knowing if a file is a control or not except by listening to it. This will catch ArnyK's cheating but it will also catch any fatigue, loss of focus, inappropriate test procedure, expectation bias, etc - the myriad of reasons why false negatives arise.

In other words the control file (which is randomly inserted into the test every random number of trials) sounds exactly like the two files under test except it has some audible artefact added to it. To the test subject, this is a random event & there is no other warning that the control file is being used on this trial   

Quote
I have no problem thinking that abx isn't perfect, and in fact, most of the weaknesses of abx, I've learned from reading Arny himself warning against them. that's how much bad faith the guy puts on the internet to blindly promote ABX and make billions out of the abx plugin on foobar. Arny, the only guy on the net with a clear money agenda(sorry my sarcasm leaks out when I read too much BS).
anyway, at the moment I believe ABX is still one of the most effective and easily accessible way to do many subjective hearing tests. if you have a more reliable, free, and worldwide accessible method, please pretty please, tell me all about it. I'm all for the latest good stuff. as long as it's not some sighted listening middle age nonsense, I'm all for it.
All I'm asking for is the rate of false negatives in the test results. It shouldn't be seen as a threat to anybody if the test is reliable. So why is it seen this way?

Understanding ABX Test Confidence Statistics

Reply #44
There is no harm in failing to reject the null hypothesis.

Nice bail of straw, BTW.  Does it helps lubricate the constant belittling and insistence that bad principles are good?

Understanding ABX Test Confidence Statistics

Reply #45
All I'm asking for is the rate of false negatives in the test results. It shouldn't be seen as a threat to anybody if the test is reliable. So why is it seen this way?

I don't think anyone is seeing this as a threat. However, you do not seem like taking into account what it would mean if the tests accounted for false positives and false negatives in the same way. If you had three of those tests, and (hypothetically) one would yield an overall confirmation, while two yield a rejection, then you would have to go for the majority.

If, on the other hand, the three tests didn't account for false negatives, and only attempted to control false positives, as was discussed here, then one successful test out of three would suggest that the null hypothesis can be rejected, albeit with perhaps a rather low confidence.

So my suspicion is that you want to have your cake and eat it. That's often the case when people with vested interests play the statistics interpretation game. There simply are a lot of possibilities to fool oneself without it being blatantly obvious.

Understanding ABX Test Confidence Statistics

Reply #46
So now you are going back on what you said "I've multiple times made a perfect score by just randomly clicking" - now your claim has changed - please make up your mind.

You should go back and read again what I said and what I did not.

Also, I wrote this not to demonstrate a problem with ABX protocol, but your lack of understanding of statistics.


You are so hung up on the cheating to get positive results - why does ArnyK's cheating to get a null result not also apply - he didn't listen, just randomly hit keys?

Please go back and read my previous replies including the suggestion to read up on hypothesis testing.

And the whole point of ABX is to reduce false positives, which in a perfect world would include all cheating attempts.


Good - a rational voice, at last!

What was irrational in the previous posts, excluding yours?
"I hear it when I see it."

Understanding ABX Test Confidence Statistics

Reply #47
All I'm asking for is the rate of false negatives in the test results. It shouldn't be seen as a threat to anybody if the test is reliable. So why is it seen this way?

I don't think anyone is seeing this as a threat. However, you do not seem like taking into account what it would mean if the tests accounted for false positives and false negatives in the same way. If you had three of those tests, and (hypothetically) one would yield an overall confirmation, while two yield a rejection, then you would have to go for the majority.

If, on the other hand, the three tests didn't account for false negatives, and only attempted to control false positives, as was discussed here, then one successful test out of three would suggest that the null hypothesis can be rejected, albeit with perhaps a rather low confidence.

So my suspicion is that you want to have your cake and eat it. That's often the case when people with vested interests play the statistics interpretation game. There simply are a lot of possibilities to fool oneself without it being blatantly obvious.

If the test results showed that the control file difference was not heard in the majority of cases then the test is suffering from type II errors & it's results are very marginal.

Let's take some concrete examples -
If a subject doesn't bother to listen but hits random selections - he will most likely produce a null result (given 20 trials or more). This isn't a valid null result but how do we know this from the results. The only possibility is to have controls embedded in the test that test for false negatives. 
Similarly, if a subject has decided beforehand that what he is about to test cannot possibly show any audible differences, then his expectation bias can easily prevent him from hearing differences that really exist. He will produce a null result.
If he has become tired & stopped listening or lost focus, etc - all these produce random results & we don't know from the overall results what percentage of them are due to actually listening & hearing no difference as opposed to not listening for one of many reasons. The only way we can get a feel for this is to include controls & extrapolate from their results.

Is this not something of concern to objective scientists, looking for the truth?

Understanding ABX Test Confidence Statistics

Reply #48
You have no clue of the type II errors in your results - no clue how to run a test & yet you claim it's reliability & a gold standard. Give me a break!


In the face of such  omniscience, I must move on.

Understanding ABX Test Confidence Statistics

Reply #49
So now you are going back on what you said "I've multiple times made a perfect score by just randomly clicking" - now your claim has changed - please make up your mind.

You should go back and read again what I said and what I did not.

Also, I wrote this not to demonstrate a problem with ABX protocol, but your lack of understanding of statistics.
I quoted exactly what you said - a perfect score - not that you guessed correctly a couple of times but that you git a perfect score. It's you who needs to read up on stats, not me

Quote
You are so hung up on the cheating to get positive results - why does ArnyK's cheating to get a null result not also apply - he didn't listen, just randomly hit keys?

Please go back and read my previous replies including the suggestion to read up on hypothesis testing.

And the whole point of ABX is to reduce false positives, which in a perfect world would include all cheating attempts.
No, it wouldn't include the type of cheating ArnyK did - didn't listen, just choose randomly.

It doesn't include any testing for type II errors. You are only worried about any cheating which will give positive overall results, not cheating (or lack of attention focus, etc) which will give a null overall result. You have no objectivity about this test whatsoever. As long as the test keeps giving null results, your unquestioningly happy - it's only when the odd positive result arises that there's a flurry of activity to examine the test & then only focussed on eliminating further any positive results