Understanding ABX Test Confidence Statistics

Topic: Understanding ABX Test Confidence Statistics (Read 51887 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

Understanding ABX Test Confidence Statistics

Reply #50 – 2015-02-02 17:04:39

Quote from: jkeny on 2015-02-02 16:51:19

Let's take some concrete examples -
If a subject doesn't bother to listen but hits random selections - he will most likely produce a null result (given 20 trials or more). This isn't a valid null result but how do we know this from the results.

For the 3rd time, a null result does not prove the null hypothesis.
When you fail to pass a test, whether genuinely or not, it does not prove that no difference can be heard.

Btw, you can combine such results into significant results by using a special kind of statistics.

Quote from: jkeny on 2015-02-02 16:51:19

The only possibility is to have controls embedded in the test that test for false negatives.

Much more importantly you should test for false positives.

Quote from: jkeny on 2015-02-02 16:51:19

Similarly, if a subject has decided beforehand that what he is about to test cannot possibly show any audible differences, then his expectation bias can easily prevent him from hearing differences that really exist. He will produce a null result.
If he has become tired & stopped listening or lost focus, etc - all these produce random results & we don't know from the overall results what percentage of them are due to actually listening & hearing no difference as opposed to not listening for one of many reasons. The only way we can get a feel for this is to include controls & extrapolate from their results.

Is this not something of concern to objective scientists, looking for the truth?

This has been done here @HA in the multiformat listening tests for years. Failure to correctly identify the low anchors results in dismissal of the submitted results.

The tools used are freely available.
Again, the problem is not with the tests but with the clueless people.

Understanding ABX Test Confidence Statistics

Reply #51 – 2015-02-02 17:04:57

Quote from: Arnold B. Krueger on 2015-02-02 16:56:53

Quote from: jkeny on 2015-02-02 15:35:42
You have no clue of the type II errors in your results - no clue how to run a test & yet you claim it's reliability & a gold standard. Give me a break!

In the face of such omniscience, I must move on.

I doubt your lack of contribution will be noticed - so far it hasn't

Understanding ABX Test Confidence Statistics

Reply #52 – 2015-02-02 17:05:03

Quote from: jkeny on 2015-02-02 15:44:26

All I'm asking for is the rate of false negatives in the test results. It shouldn't be seen as a threat to anybody if the test is reliable. So why is it seen this way?

The above is a complete and total false claim. In fact its author just said:

"You have no clue of the type II errors in your results - no clue how to run a test & yet you claim it's reliability & a gold standard. Give me a break!"

Its a false claim because its author is doing far more than he just claimed. For openers there is the slight matter of a raft of public personal insults that have been posted under his name.

For example he just posted this insult: "I doubt your lack of contribution will be noticed - so far it hasn't."

Mr. Keny haven't you figured out that it is irrational to ask for an estimate of the rate of false negatives of someone who you truly believe does not know what it is?

In fact, I have a pretty good estimate of both the false negatives and false positives in most experiments. For example, the percentage of false positives in any sighted evaluation of credible DACs is very close to 100%.

Either there are two people posting here with the same name, or the one person who posts with that name is unaware of the posts that he just made.

My take - we're dealing with posts made in anger (fear of financial loss ) and desire for financial gain (owns a high end audio business) that are designed to be read and not responded to.

Understanding ABX Test Confidence Statistics

Reply #53 – 2015-02-02 17:11:42

Quote from: jkeny on 2015-02-02 16:51:19

Let's take some concrete examples -

If a subject doesn't bother to listen but hits random selections - he will most likely produce a null result (given 20 trials or more). This isn't a valid null result but how do we know this from the results. The only possibility is to have controls embedded in the test that test for false negatives.

Agreed. There are many ways to detect this kind of problem.

Quote

Similarly, if a subject has decided beforehand that what he is about to test cannot possibly show any audible differences, then his expectation bias can easily prevent him from hearing differences that really exist. He will produce a null result.

Agreed. There are many ways to detect this kind of problem.

Quote

If he has become tired & stopped listening or lost focus, etc - all these produce random results & we don't know from the overall results what percentage of them are due to actually listening & hearing no difference as opposed to not listening for one of many reasons. The only way we can get a feel for this is to include controls & extrapolate from their results.

Agreed. There are many ways to detect this kind of problem. Perhaps more to the point, there are ways to keep it from happening in the first place.

Quote

Is this not something of concern to objective scientists, looking for the truth?

So much so that any number effective means for detecting and even avoiding these problems have been put into practice for almost half a century.

Listing them out like this suggests a lack of background in the literature of reliable subjective audio testing.

Understanding ABX Test Confidence Statistics

Reply #54 – 2015-02-02 17:26:31

Quote from: jkeny on 2015-02-02 16:51:19

Is this not something of concern to objective scientists, looking for the truth?

I don't know why you are answering to my post since your reply shows that you haven't read or understood it, and you are not actually referring to its content. You are merely rambling on over a point that is fairly basic.

Most people here appear to have long understood that a test that doesn't control false negatives can't be used to prove the negative. @xnor has just repeated this point: A test that failed to reject the null hypothesis is not automatically a proof of the null hypothesis. Such a test is basically a failure, nothing more. That's a perfectly normal and acceptable outcome, and does not at all speak against the test design. The only thing that needs to be borne in mind is that such a test result does not prove inaudibility. That's all.

What I wanted to point out is that this is actually an advantage for you audiophiles who are convinced that all sorts of things are audible. It leaves the possibility that audibility can be shown with just one good test.

If on the other hand all the tests had included control of false negatives, as you demand, you would have to conduct enough good tests to tip the majority in your favor, because tests with negative outcome would carry the same weight as the ones with positive outcome. Something tells me that you wouldn't want go along with that. I believe you would still claim that a single positive test would constitute a sort of proof against all earlier negative tests. I may be wrong in my suspicion, but you haven't shown to be particularly even-handed so far.

Understanding ABX Test Confidence Statistics

Reply #55 – 2015-02-02 17:27:13

Quote from: jkeny on 2015-02-02 15:33:05

You are so hung up on the cheating to get positive results - why does ArnyK's cheating to get a null result not also apply - he didn't listen, just randomly hit keys?

There was no cheating or hitting of random keys. There was an objection to the rate at which the unknowns were listened to, but when I can hear differences I often obtain positive results while working at that speed.

Furthermore, as I said I had previously run the same unknowns a little while earlier and worked very slowly - with negative results. I would have posted those results except I inadevertantly threw them away, partially because I was working with beta test software and was not yet effective at managing test logs.

I've done thousands of ABX tests over the past 40 or so years, and now I'm being judged, and every ABX test ever done is being judged because someone doesn't like the outcome of the last one I did.

Furthermore, I've been complaining publicly for weeks before that test that I don't want to do ABX tests right now because I know (from obtaining too many what I perceive to be false negatives in recent tests) that my hearing is not very good right now.

That one fact falsifies the many claims that false negatives aren't being properly managed. If there was improper management of false negatives, Mr. Keny public postings rise to near the top of the list of malefactors.

I was publicly repeatedly bullied until I did this test against my better judgement. Then its outcome was misinterpreted in order to publicly humiliate me. Talk about a lack of good faith!

Understanding ABX Test Confidence Statistics

Reply #56 – 2015-02-02 18:07:50

Quote from: xnor on 2015-02-02 17:04:39

Quote from: jkeny on 2015-02-02 16:51:19
Let's take some concrete examples -
If a subject doesn't bother to listen but hits random selections - he will most likely produce a null result (given 20 trials or more). This isn't a valid null result but how do we know this from the results.

For the 3rd time, a null result does not prove the null hypothesis.
When you fail to pass a test, whether genuinely or not, it does not prove that no difference can be heard.

So what - that's not what we are talking about. The specificity of the test is what is being discussed & as a result, the reliability of the test

Quote

Btw, you can combine such results into significant results by using a special kind of statistics.

Quote from: jkeny on 2015-02-02 16:51:19
The only possibility is to have controls embedded in the test that test for false negatives.

Much more importantly you should test for false positives.

false positives are already catered for in the test. So you want to ignore type II errors then.

Quote

Quote from: jkeny on 2015-02-02 16:51:19
Similarly, if a subject has decided beforehand that what he is about to test cannot possibly show any audible differences, then his expectation bias can easily prevent him from hearing differences that really exist. He will produce a null result.
If he has become tired & stopped listening or lost focus, etc - all these produce random results & we don't know from the overall results what percentage of them are due to actually listening & hearing no difference as opposed to not listening for one of many reasons. The only way we can get a feel for this is to include controls & extrapolate from their results.

Is this not something of concern to objective scientists, looking for the truth?

This has been done here @HA in the multiformat listening tests for years. Failure to correctly identify the low anchors results in dismissal of the submitted results.

The tools used are freely available.
Again, the problem is not with the tests but with the clueless people.

Are you talking about pre-testing using anchors or hidden anchors within the test procedure?

Understanding ABX Test Confidence Statistics

Reply #57 – 2015-02-02 18:11:18

Quote from: Arnold B. Krueger on 2015-02-02 17:27:13

I was publicly repeatedly bullied until I did this test against my better judgement. Then its outcome was misinterpreted in order to publicly humiliate me.

If you could manage to get over it, you might find this is completely irrelevant to the point you should be driving home:

A failed test does not prove the null hypothesis.

Instead, your overblown ego is allowing you to be hooked with bait that even the most indiscriminate dog would refuse to take while starving.

Understanding ABX Test Confidence Statistics

Reply #58 – 2015-02-02 18:11:47

Quote from: Arnold B. Krueger on 2015-02-02 17:05:03

Quote from: jkeny on 2015-02-02 15:44:26
All I'm asking for is the rate of false negatives in the test results. It shouldn't be seen as a threat to anybody if the test is reliable. So why is it seen this way?

The above is a complete and total false claim. In fact its author just said:

"You have no clue of the type II errors in your results - no clue how to run a test & yet you claim it's reliability & a gold standard. Give me a break!"

Its a false claim because its author is doing far more than he just claimed. For openers there is the slight matter of a raft of public personal insults that have been posted under his name.

For example he just posted this insult: "I doubt your lack of contribution will be noticed - so far it hasn't."

Mr. Keny haven't you figured out that it is irrational to ask for an estimate of the rate of false negatives of someone who you truly believe does not know what it is?

In fact, I have a pretty good estimate of both the false negatives and false positives in most experiments. For example, the percentage of false positives in any sighted evaluation of credible DACs is very close to 100%.

An are these Type II errors? Please explain!

Understanding ABX Test Confidence Statistics

Reply #59 – 2015-02-02 18:13:06

Quote from: jkeny on 2015-02-02 18:07:50

So what - that's not what we are talking about.

That is precisely what we are talking about. You either don't understand or are trying to brush it under the carpet because it undermines just about everything coming from your keyboard.

Understanding ABX Test Confidence Statistics

Reply #60 – 2015-02-02 18:19:11

Quote from: xnor on 2015-02-02 13:17:45

Sighted tests are not worse, they are completely worthless for determining small audible differences. They are not a worse alternative, they are not even an alternative.

But wait.... I thought the difference between hi rez and rebook, between different DACs, between amps, is NIGHT AND DAY? I thought veils were lifted?

That's what most of the print and online commenters/bloggers/reviewers keep saying, have been saying, for years and years and years. It's what Neil Young & Co say. It what Meridian really, really wants to say.

So if all Amir, jKeny, et al are detecting (assuming no cheating) are *small differences*, what am I to think? What's the big deal?

So confused!

Understanding ABX Test Confidence Statistics

Reply #61 – 2015-02-02 18:25:19

Quote from: pelmazo on 2015-02-02 17:26:31

Quote from: jkeny on 2015-02-02 16:51:19
Is this not something of concern to objective scientists, looking for the truth?

I don't know why you are answering to my post since your reply shows that you haven't read or understood it, and you are not actually referring to its content. You are merely rambling on over a point that is fairly basic.

Most people here appear to have long understood that a test that doesn't control false negatives can't be used to prove the negative. @xnor has just repeated this point: A test that failed to reject the null hypothesis is not automatically a proof of the null hypothesis. Such a test is basically a failure, nothing more. That's a perfectly normal and acceptable outcome, and does not at all speak against the test design. The only thing that needs to be borne in mind is that such a test result does not prove inaudibility. That's all.

What I wanted to point out is that this is actually an advantage for you audiophiles who are convinced that all sorts of things are audible. It leaves the possibility that audibility can be shown with just one good test.

If on the other hand all the tests had included control of false negatives, as you demand, you would have to conduct enough good tests to tip the majority in your favor, because tests with negative outcome would carry the same weight as the ones with positive outcome. Something tells me that you wouldn't want go along with that. I believe you would still claim that a single positive test would constitute a sort of proof against all earlier negative tests. I may be wrong in my suspicion, but you haven't shown to be particularly even-handed so far.

I'm not talking about advantage or disadvantage to either test.
If we included false positive controls in sighted listening tests then they would also give us more information about the specificity of that listening test - how prone it was to type I errors. So it does work both ways & would be useful in both tests.

However, you are happy to shout that sighted tests produce false positives & not to be trusted.
I'm saying the same could be true about blind tests - they produce false negatives - we simply don't know.

It's not about whether a null result proves inaudibility or not - it's about why would anyone bother to run a test that was flawed in this way.
This is the scientific discussion section, right? Let's have some scientific discussion then - is a blind test of any value? How do we know? Why bother with one if it's prone to type II errors?

Understanding ABX Test Confidence Statistics

Reply #62 – 2015-02-02 18:55:46

Quote from: greynol on 2015-02-02 18:13:06

Quote from: jkeny on 2015-02-02 18:07:50
So what - that's not what we are talking about.

That is precisely what we are talking about. You either don't understand or are trying to brush it under the carpet because it undermines just about everything coming from your keyboard.

No, what's being discussed is whether it is worth anybody wasting their time doing a test that most likely contains many type II errors which basically invalidates the null results usually returned.

Understanding ABX Test Confidence Statistics

Reply #63 – 2015-02-02 19:08:17

Quote from: greynol on 2015-02-02 18:11:18

Quote from: Arnold B. Krueger on 2015-02-02 17:27:13
I was publicly repeatedly bullied until I did this test against my better judgement. Then its outcome was misinterpreted in order to publicly humiliate me.

If you could manage to get over it, you might find this is completely irrelevant to the point you should be driving home:

A failed test does not prove the null hypothesis.

The obviousness of that point seems to be so great that repeating it should be beneath any of us.

Understanding ABX Test Confidence Statistics

Reply #64 – 2015-02-02 19:10:02

Quote from: Arnold B. Krueger on 2015-02-02 17:27:13

Quote from: jkeny on 2015-02-02 15:33:05
You are so hung up on the cheating to get positive results - why does ArnyK's cheating to get a null result not also apply - he didn't listen, just randomly hit keys?

There was no cheating or hitting of random keys. There was an objection to the rate at which the unknowns were listened to, but when I can hear differences I often obtain positive results while working at that speed.

Rubbish - nobody can listen, decide & use a mouse to select a button in 1 second which is what you took for how many of your trials. Or 2 seconds - how many trials did you do at 2 seconds each? 3 seconds 4 seconds. You just didn't listen & you admitted as much on AVS

Quote

Furthermore, as I said I had previously run the same unknowns a little while earlier and worked very slowly - with negative results. I would have posted those results except I inadevertantly threw them away, partially because I was working with beta test software and was not yet effective at managing test logs.

I've done thousands of ABX tests over the past 40 or so years, and now I'm being judged, and every ABX test ever done is being judged because someone doesn't like the outcome of the last one I did.

Yes, that is the excuse you used on AVS - you got confused with the logs - after thousands of ABX tests you would have thought you might be well used to a log file being recorded on disk. But don't be so apologetic, I think it is one of the most useful ABX results, to date - it shows exactly how someone randomly pressing keys & not listening gives a guaranteed null result, rather than a disqualified test

Quote

Furthermore, I've been complaining publicly for weeks before that test that I don't want to do ABX tests right now because I know (from obtaining too many what I perceive to be false negatives in recent tests) that my hearing is not very good right now.

That one fact falsifies the many claims that false negatives aren't being properly managed. If there was improper management of false negatives, Mr. Keny public postings rise to near the top of the list of malefactors.

There is no management of false negatives in these tests - if you had been smart enough to have spend a bit more time on each trial (but still not listened), you could have presented your results & no one would know your cheat. But you didn't & you were rightly exposed as a result.

Quote

I was publicly repeatedly bullied until I did this test against my better judgement. Then its outcome was misinterpreted in order to publicly humiliate me. Talk about a lack of good faith!

Don't be so hard on yourself - consider it as taking one for the team - you've provided a great example of just what's wrong with ABX testing

Understanding ABX Test Confidence Statistics

Reply #65 – 2015-02-02 19:10:31

Quote from: jkeny on 2015-02-02 18:55:46

most likely contains many type II errors

I'm under the strong impression that this insistence from you is nothing more than borne out of convenience.

You clearly have no means to substantiate it, so what's the point in making it? The proper/honest/intelligent/respectable position is to admit that you don't know the likelihood.

Understanding ABX Test Confidence Statistics

Reply #66 – 2015-02-02 19:13:19

Quote from: jkeny on 2015-02-02 18:55:46

No, what's being discussed is whether it is worth anybody wasting their time doing a test that most likely contains many type II errors which basically invalidates the null results usually returned.

I agree that discussing such a test, which is exactly a figment of someone's overheated, fueled by greed imagination should be beneath us all.

Let's get back to the almost universally employed non-test (non-level-matched, non-time-synched sighted evaluation) that is the basis for the commercial proliferation of trivial components like DACs as a high end audio product. What do you propose to do about that travesty?

Understanding ABX Test Confidence Statistics

Reply #67 – 2015-02-02 19:23:31

...apparently consider it acceptable science and promote its spread to other fields of investigation.

Pharmaceutical trials, perhaps?

Understanding ABX Test Confidence Statistics

Reply #68 – 2015-02-02 19:27:38

Quote from: jkeny on 2015-02-02 19:10:02

Quote from: Arnold B. Krueger on 2015-02-02 17:27:13
Quote from: jkeny on 2015-02-02 15:33:05
You are so hung up on the cheating to get positive results - why does ArnyK's cheating to get a null result not also apply - he didn't listen, just randomly hit keys?

There was no cheating or hitting of random keys. There was an objection to the rate at which the unknowns were listened to, but when I can hear differences I often obtain positive results while working at that speed.
Rubbish - nobody can listen, decide & use a mouse to select a button in 1 second which is what you took for how many of your trials. Or 2 seconds - how many trials did you do at 2 seconds each? 3 seconds 4 seconds. You just didn't listen & you admitted as much on AVS

Just a second. You now claim it is impossible to conduct a trial in 1 second, and then claim that I posted an ABX log showing that I actually did such a thing? There's an obvious inconsistency here!

If there is an ABX log showing that that I did such a thing, then it is obviously possible!

Given the very many false claims I've seen here lately, I'd like to see evidence of that claim. I'm blocked from AVS so I can't get the evidence myself.

Understanding ABX Test Confidence Statistics

Reply #69 – 2015-02-02 19:30:46

Hi John,

Looks like you've had a lot pent up for those 10 years!

Quote from: jkeny on 2015-02-02 19:10:02

you've provided a great example of just what's wrong with ABX testing

John, this includes your and Amirs non-ITU ABX hobbyist online tests, yes?
Out of curiosity, were those audible effects you describe here ascertained with ITU type ABX tests? No mention of confidence statistics. Amirs $50k amp "tests" clearly were not. Actually they weren't even blind.

cheers,

AJ

Understanding ABX Test Confidence Statistics

Reply #70 – 2015-02-02 19:42:22

Quote from: jkeny on 2015-02-02 18:25:19

It's not about whether a null result proves inaudibility or not - it's about why would anyone bother to run a test that was flawed in this way.
This is the scientific discussion section, right? Let's have some scientific discussion then - is a blind test of any value? How do we know? Why bother with one if it's prone to type II errors?

Of couse a blind test is of value. If the chance of type I errors is controlled, and the test comes out positive (the null hypothesis could be rejected), the test can be regarded as strong evidence for the claim that was tested. That's useful by anyone's definition, isn't it? In fact, the science you are insisting on is full of examples of just such tests.

A blind test that may suffer from type II errors and didn't yield a positive result basically doesn't say anything at all. That is what is meant when one says "the null hypothesis couldn't be rejected". It means that we haven't made an advance. That's the only case when you are right: There was no value in the test. But you can only say that after the fact, when you know the result. Before the test there was at least a chance it could result in something useful.

Isn't that enough to show that tests suffering from type II errors can still be of value? Heck, they've proven that numerous times!

You only need to control type II errors if you intend to interpret a negative result in any other way except to ignore it.

A test that controls neither type I nor type II errors can't be used for any kind of interpretation, it can only be ignored regardless of the outcome. That's what sighted tests are.

Understanding ABX Test Confidence Statistics

Reply #71 – 2015-02-02 19:54:19

Quote from: greynol on 2015-02-02 19:10:31

Quote from: jkeny on 2015-02-02 18:55:46
most likely contains many type II errors

I'm under the strong impression that this insistence from you is nothing more than borne out of convenience.

You clearly have no means to substantiate it, so what's the point in making it? The proper/honest/intelligent/respectable position is to admit that you don't know the likelihood.

What do you think "most likely" means?
And you have no evidence to refute this suspicion!!

What I'm saying is - why not improve the test & then I (& others) wouldn't have such suspicions & we would be better able to believe that these tests might be worthwhile..

A recent flurry of activity happened when Amir produced his positive ABX results - all directed towards further tightening of the elimination of false positives - inclusion of test signals (controls) to expose IMD that might be caused on some audio equipment when handling signals >20KHz. Redoing of the test to correct some differences between the tracks. Foobar ABX was even re-written with an eye towards tighter control on false positives.

In all this activity/analysis, not once was false negatives controls thought of

Understanding ABX Test Confidence Statistics

Reply #72 – 2015-02-02 20:10:48

Quote from: pelmazo on 2015-02-02 19:42:22

Quote from: jkeny on 2015-02-02 18:25:19
It's not about whether a null result proves inaudibility or not - it's about why would anyone bother to run a test that was flawed in this way.
This is the scientific discussion section, right? Let's have some scientific discussion then - is a blind test of any value? How do we know? Why bother with one if it's prone to type II errors?

Of couse a blind test is of value. If the chance of type I errors is controlled, and the test comes out positive (the null hypothesis could be rejected), the test can be regarded as strong evidence for the claim that was tested. That's useful by anyone's definition, isn't it? In fact, the science you are insisting on is full of examples of just such tests.

You mean if it overcomes the massive skew towards producing false negatives & actually succeeds in discriminating two files/devices?

Quote

A blind test that may suffer from type II errors and didn't yield a positive result basically doesn't say anything at all. That is what is meant when one says "the null hypothesis couldn't be rejected". It means that we haven't made an advance. That's the only case when you are right: There was no value in the test. But you can only say that after the fact, when you know the result. Before the test there was at least a chance it could result in something useful.

There's a difference between an invalid test & a null result - you seem not to be aware of this

Quote

Isn't that enough to show that tests suffering from type II errors can still be of value? Heck, they've proven that numerous times!

You only need to control type II errors if you intend to interpret a negative result in any other way except to ignore it.

A test that controls neither type I nor type II errors can't be used for any kind of interpretation, it can only be ignored regardless of the outcome. That's what sighted tests are.

Again, it's not a contest between sighted & blind listening - can you not get that into your head? It's a discussion on how invalid blind testing is - it's greatly skewed towards false negatives & hence towards a null result. A test that is biased & only controls for type I errors is not worth bothering with - it's of no real value.

Understanding ABX Test Confidence Statistics

Reply #73 – 2015-02-02 20:15:27

Quote from: jkeny on 2015-02-02 19:54:19

A recent flurry of activity happened when Amir produced his positive ABX results - all directed towards further tightening of the elimination of false positives

Get out of here!

Understanding ABX Test Confidence Statistics

Reply #74 – 2015-02-02 20:15:40

Quote from: jkeny on 2015-02-02 19:54:19

What I'm saying is - why not improve the test & then I (& others) wouldn't have such suspicions & we would be better able to believe that these tests might be worthwhile..

What I'm saying is that it is clear that some have obvious strong financial interests in discrediting the decades of hard and effective work that have been done by professionals and amateurs to address all extant reasonable concerns about DBTs.

Whether it is greed or lack of proper education that has made them use personal attacks and false claims to further their interests is not clear, but there is plenty of evidence relating to both influences.

Notice