Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: What's the problem with double-blind testing? (Read 243137 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

What's the problem with double-blind testing?

I happened to pick up an issue of stereophile at a record store I visited and I was pretty shocked to see a seemingly intelligent person in the correspondence section bashing double blind testing as being unreliable.  I'm afraid I don't understand his angle of attack.  I don't see how anything could be a more reliable test of sound quality differences than a properly conducted double blind listening test. 

I'm almost afraid to read the rest of the magazine if this is the kind of letter they think is worth publishing.  Is there an audio magazine that isn't filled with this kind of thinking?

What's the problem with double-blind testing?

Reply #1
some people can actually detect specific codecs by their sounds, which defeats the purpose, but there is no problem, its as scientific as you can get.

Does this person also believe the earth was created in 6 days too?
Chaintech AV-710

What's the problem with double-blind testing?

Reply #2
Quote
I happened to pick up an issue of stereophile at a record store I visited and I was pretty shocked to see a seemingly intelligent person in the correspondence section bashing double blind testing as being unreliable.  I'm afraid I don't understand his angle of attack.  I don't see how anything could be a more reliable test of sound quality differences than a properly conducted double blind listening test. 

I'm almost afraid to read the rest of the magazine if this is the kind of letter they think is worth publishing.  Is there an audio magazine that isn't filled with this kind of thinking?
[a href="index.php?act=findpost&pid=335553"][{POST_SNAPBACK}][/a]


Keep in mind that if Stereophile's audience realizes that double-blinds are totally valid and acknowledges the importance of that realization, all of Stereophile's ad revenue will evaporate as companies that make massively over-priced cable (I bet you can find several full-page ads for cable in whatever issue you happen to be reading) and other hocus-pocus products go out of business like medieval humans dying of the black plague.

Therefore it is in the best economic interest of Stereophile and all its writers to encourage the entirely wrong-headed thinking that double-blind testing is unreliable, and as apparently unscrupulous people they do so and will continue to do so.

Edit: Hello goon
Edit2: I too would like to know if there is an audio magazine that is not in love with products of dubious worth.

What's the problem with double-blind testing?

Reply #3
The closest thing I have seen to a legitimate criticism fielded when discussing double blind tests was a discussion about how far the results were interpreted. Of course, this didn't address the validity of the testing method at all, but only the misinterpretation or overinterpretation of the results. And even that doesn't seem to be a problem very often.

What's the problem with double-blind testing?

Reply #4
Quote
Quote
I happened to pick up an issue of stereophile at a record store I visited and I was pretty shocked to see a seemingly intelligent person in the correspondence section bashing double blind testing as being unreliable.  I'm afraid I don't understand his angle of attack.  I don't see how anything could be a more reliable test of sound quality differences than a properly conducted double blind listening test. 

I'm almost afraid to read the rest of the magazine if this is the kind of letter they think is worth publishing.  Is there an audio magazine that isn't filled with this kind of thinking?
[a href="index.php?act=findpost&pid=335553"][{POST_SNAPBACK}][/a]


Keep in mind that if Stereophile's audience realizes that double-blinds are totally valid and acknowledges the importance of that realization, all of Stereophile's ad revenue will evaporate as companies that make massively over-priced cable (I bet you can find several full-page ads for cable in whatever issue you happen to be reading) and other hocus-pocus products go out of business like medieval humans dying of the black plague.

Therefore it is in the best economic interest of Stereophile and all its writers to encourage the entirely wrong-headed thinking that double-blind testing is unreliable, and as apparently unscrupulous people they do so and will continue to do so.

Edit: Hello goon
Edit2: I too would like to know if there is an audio magazine that is not in love with products of dubious worth.
[a href="index.php?act=findpost&pid=335556"][{POST_SNAPBACK}][/a]



That's the answer. Stereophile as a magazine is not unique in its reliance on advertising revenue. Since all magazines rely on advertising, and advertisers in this scenario rely on selling things like cables that cannot be double blind tested positively, there is unlikely to be any "high end" audio mags willing to promote double blind testing as a valid procedure for evaluation. The mags and the advertisers have a sort of symbiotic relationship; they need each other. DBTs are bad for business all round.

Maybe there are some more modern computer/digital audio oriented magazines out there that have a different segment of advertisers who are not threatened by DBT, but I don't know of any.

What's the problem with double-blind testing?

Reply #5
The Audio Critic is notably pro-DBT.

What's the problem with double-blind testing?

Reply #6
Interesting publication. I like it, too bad its not in full production anymore.  Thanks much for the link. Anybody have any more quality sources?

What's the problem with double-blind testing?

Reply #7
The writer of the letter in stereophile is incorrect in saying double blind tests are unreliable but he has a point. I do abx tests in foobar to compare different files a lot but there just are too many songs to abx to make sure all of one formats files are as good as the other ex. mp3 vs aac vs wav, etc... The thing I always notice though is that when using the analytical side of the brain(abx testing) only certain parts of the music is being focused and the emotional side of the music isn't being analyzed, therefore the goosebump factor of the sound the way the music delivers emotion cannot be analyzed! Its very hard to have the left and right side of the brain working at the same time.

My Example for this is when doing an abx of one particular song, I passed the test in foobar with flying colors(100% on abx test) but thought the differences were not really significant enough to make me keep the wav files from the cds and kept only the mp3s. After a few weeks of listening to the mp3s and really getting to know and love the songs better, I tried just popping the cd in for a listen... wow, goosebumps.... The parts I loved in the songs i was listening to gave me goosebumps for the first time. Now thats music! The feeling is lost on some parts of the mp3. And that was with my pc using sennheisers, not even my high end home rig.

Its not totally reliable to trust abx testing for determining the enjoyment one gets from his music because human memory can only remember only a certain number of seconds at a time like in an abx test. Its good for making codecs for eliminating artifacts but abx testing a few seconds of one song does not determine the accuracy of the dynamics of the whole piece. All the buildup of sound, the emotion! Bash me now but I have been an audiophile for more than 15 years and I am only 31, and have been a music lover since 5 years old. I know what I'm talking about. Listening to a bad stereo system, I can hear differences between different files for example but I don't really tend to care whether I am listening to the mp3 or the wav version. But with a very involving music system, the mp3 can sometimes put a damper on the fun factor of the song.

Also, stereophile loves posting "gray area" letters. I love that magazine. They exaggerate things sometimes but overall they are very accurate the way they subjectively describe the audio quality of equipment.

What's the problem with double-blind testing?

Reply #8
The "goosebump", emotional factor can be caused by placebo effect, you can't be sure that is not the cause. People has used same principle (emotional response and the like) to say cables sound different, for example.

You can do long-term blind tests too, not just the quick-switch, short snippet, typical abx test.

What's the problem with double-blind testing?

Reply #9
Quote
The "goosebump", emotional factor can be caused by placebo effect, you can't be sure that is not the cause. People has used same principle (emotional response and the like) to say cables sound different, for example.

You can do long-term blind tests too, not just the quick-switch, short snippet, typical abx test.
[a href="index.php?act=findpost&pid=335627"][{POST_SNAPBACK}][/a]

What do you mean by long term abx? how do I do this? Play entire songs? actually can you give me a link that explains placebo. thx

What's the problem with double-blind testing?

Reply #10
Quote
The Audio Critic is notably pro-DBT.
[{POST_SNAPBACK}][/a]

Thanks. I really like their sample article [a href="http://www.theaudiocritic.com/downloads/article_1.pdf]"The Ten Biggest Lies in Audio"[/url].
Over thinking, over analyzing separates the body from the mind.

What's the problem with double-blind testing?

Reply #11
Quote
actually can you give me a link that explains placebo. thx[a href="index.php?act=findpost&pid=335628"][{POST_SNAPBACK}][/a]

The word placebo, when used in audio circles, is a little different from the medical term, where people can actually benefit a little from dummy treatments.

According to ff123, it's closer to expectation bias, where a person's expectation can influence his perception, if I understand correctly.

What's the problem with double-blind testing?

Reply #12
Well, I think expectation effects in listening tests actually make people perceive the sound as truly different, so they cause an equivalent effect to the placebo effect in medicine. I mean, both them cause a real effect: patients think that they are having a medicine, and that cures them. Listeners think that their equipment improves or deteriorates the sound, and that makes them truly perceive these effects.

What's the problem with double-blind testing?

Reply #13
Quote
I happened to pick up an issue of stereophile at a record store I visited and I was pretty shocked to see a seemingly intelligent person in the correspondence section bashing double blind testing as being unreliable.  I'm afraid I don't understand his angle of attack.  I don't see how anything could be a more reliable test of sound quality differences than a properly conducted double blind listening test. 

I'm almost afraid to read the rest of the magazine if this is the kind of letter they think is worth publishing.  Is there an audio magazine that isn't filled with this kind of thinking?
[a href="index.php?act=findpost&pid=335553"][{POST_SNAPBACK}][/a]


By all means read it..it's always good for a laugh.  I have the current issue at home and am waiting for a free hour or two to enjoy it.

I've interacted with editor John Atkinson online and very briefly in person -- he's mastered the art of balancing himself on a rehtorical tightrope between scientific evaluation of claims (he has a background in physics) and audiofoolery.  This manifests itself in his magazine as the schizophrenic devotion to engineering jargon and inclusion of bench measurements -- sometimes extremely detailed -- accompanying sighted 'reviews' that are literally whimsical.  Atkinson knows who his subscriber base is and I doubt he's about to alienate them by embracing valid scientific methods for subjective comparison...ever.  The purely 'subjective' experience will ALWAYS trump ANY measurement or scientific method, to that crowd.

But Stereophile is a model of restraint compared to The Absolute Sound, which is absolutely off the deep end, and has been forever.

Peter Aczel's 'The Audio Critic' (now online only) is the closest thing to a purely objective audio magazine you'll find.  (www.theaudiocritic.com).  Lesser lights include The Sensible Sound (about half of the people there seem to drink the hi-end cool-aid) and Sound & Vision (David Ranada is a DBT advocate, Ian MAsters is a no-nonsense objectivist, Ken Pohlmann is a god of digital, but the mag as a whole seems ever more tilted towards marketing hype -- I don't see much by Ranada in there recently)

What's the problem with double-blind testing?

Reply #14
Quote
Well, I think expectation effects in listening tests actually make people perceive the sound as truly different, so they cause an equivalent effect to the placebo effect in medicine. I mean, both them cause a real effect: patients think that they are having a medicine, and that cures them. Listeners think that their equipment improves or deteriorates the sound, and that makes them truly perceive these effects.
[a href="index.php?act=findpost&pid=335745"][{POST_SNAPBACK}][/a]


The audio placebo effect also makes people 'feel' differently, and that doubtless has some neurophysiological correlate -some measureable brain activity correlated to the feeling -- so it's 'real' in that sense.  But the point is it's entirely self-generated -- it is not due to some real difference between sounds.  (Just as the beneficial medical placebo effect isn't 'due', in any meaningful way, to the sugar pill itself -- it's due to the idea that the sugar pill was medicine.)

What's the problem with double-blind testing?

Reply #15
Quote
The writer of the letter in stereophile is incorrect in saying double blind tests are unreliable but he has a point. I do abx tests in foobar to compare different files a lot but there just are too many songs to abx to make sure all of one formats files are as good as the other ex. mp3 vs aac vs wav, etc... The thing I always notice though is that when using the analytical side of the brain(abx testing) only certain parts of the music is being focused and the emotional side of the music isn't being analyzed, therefore the goosebump factor of the sound the way the music delivers emotion cannot be analyzed! Its very hard to have the left and right side of the brain working at the same time.

My Example for this is when doing an abx of one particular song, I passed the test in foobar with flying colors(100% on abx test) but thought the differences were not really significant enough to make me keep the wav files from the cds and kept only the mp3s. After a few weeks of listening to the mp3s and really getting to know and love the songs better, I tried just popping the cd in for a listen... wow, goosebumps.... The parts I loved in the songs i was listening to gave me goosebumps for the first time. Now thats music! The feeling is lost on some parts of the mp3. And that was with my pc using sennheisers, not even my high end home rig.

Its not totally reliable to trust abx testing for determining the enjoyment one gets from his music because human memory can only remember only a certain number of seconds at a time like in an abx test. Its good for making codecs for eliminating artifacts but abx testing a few seconds of one song does not determine the accuracy of the dynamics of the whole piece. All the buildup of sound, the emotion! Bash me now but I have been an audiophile for more than 15 years and I am only 31, and have been a music lover since 5 years old. I know what I'm talking about. Listening to a bad stereo system, I can hear differences between different files for example but I don't really tend to care whether I am listening to the mp3 or the wav version. But with a very involving music system, the mp3 can sometimes put a damper on the fun factor of the song.

Also, stereophile loves posting "gray area" letters. I love that magazine. They exaggerate things sometimes but overall they are very accurate the way they subjectively describe the audio quality of equipment.
[a href="index.php?act=findpost&pid=335617"][{POST_SNAPBACK}][/a]


The point of double-blind testing is to make sure that you're getting goosebumps over something you actually hear, and not something you expect to hear.  The whole point is to try to evaluate what you're hearing without knowing whether it's the original or the lossy encode.  That shouldn't remove the "emotional" part of it, just make sure that you're reacting based upon something that is really there, and not what you want to hear or what you expect to hear.

What's the problem with double-blind testing?

Reply #16
There is nothing inherently bad about double blind.

However, 'double blind' is not a test method per se, it's a test methodology requirement (or super class)  to rule out bias.

ABCHR is a rough outline of a test methodology in sensory evaluation.

So is ABX.

Hoever, even they can be set up (i.e. implemented) improperly and used to produce useless tests (that prove nothing).

For example, if you set up a test for ABX with persons who are CONVINCED they will hear NO difference, you have a strong tester bias that a simple ABX test method (even when double blind) does not protect against.

There are many inherently problematic issues about sensory evaluation of stimuli at the treshold of detection.

Currently, most set ups crudely ignore this and often err on the side of ruling out type II errors and as such, increasing the likelihood of type I errors in the test setup. This is esp. true of most statistical "analysis" (it's not really analysis, it's crude mechanistic calculation) of the test data, when you look at most of the published studies.

The problem is that many people fail to make a conceptual distinction between the following concepts:

- Double blind test
- ABX test
- Sensory evaluation test

They are not all interchangeable and correct procedure in one, does not necessarily translate to a correct procedure in another.

What's the problem with double-blind testing?

Reply #17
Tests that try to distinguish very small effects *and* aim for small type I and type II errors either require lots of listeners or lots of trials.  See my test sensitivity calculator:

http://ff123.net/export/TestSensitivityAnalyzer.xls

Column "D" shows the Proportion of Distinguishers, that is, the proportion of listeners in a randomly sampled group of people who would be expected to hear a difference in an audio comparison.  The smaller the proportion of distinguishers, the smaller the effect.  If the proportion of distinguishers is 30% (a pretty small effect), you'd need 119 listeners out of which 69 have to make a correct response to limit both the type I and type II errors to 0.05.  I assume this would be the criteria that would satisfy both the "objective" and the "subjective" crowds.

But obviously there is already a huge problem in trying to assemble that many listeners or in attempting such a large number of trials.

ff123

What's the problem with double-blind testing?

Reply #18
There is another problem similar to wine blind tests: it is very hard to believe that wine experts can't recognize the origin of the samples by the taste. Although those tests are not abx, that bias is also present in sound abx tests for the testers are, in general, experts in codecs.

What's the problem with double-blind testing?

Reply #19
I don't see how that is a problem. The goal of an ABX test is to determine if there is an audible difference, it is not intended to determine the quality of a track. The answer is "yes" or "no", not "good" or "bad" as in wine tests.
"We cannot win against obsession. They care, we don't. They win."

What's the problem with double-blind testing?

Reply #20
Quote
Edit: Hello goon

[a href="index.php?act=findpost&pid=335556"][{POST_SNAPBACK}][/a]


How in the world of intuition did you figure that bit out? I don't see no stairs!

What's the problem with double-blind testing?

Reply #21
Quote
Tests that try to distinguish very small effects *and* aim for small type I and type II errors either require lots of listeners or lots of trials.  See my test sensitivity calculator:

http://ff123.net/export/TestSensitivityAnalyzer.xls

Column "D" shows the Proportion of Distinguishers, that is, the proportion of listeners in a randomly sampled group of people who would be expected to hear a difference in an audio comparison.  The smaller the proportion of distinguishers, the smaller the effect.  If the proportion of distinguishers is 30% (a pretty small effect), you'd need 119 listeners out of which 69 have to make a correct response to limit both the type I and type II errors to 0.05.  I assume this would be the criteria that would satisfy both the "objective" and the "subjective" crowds.

But obviously there is already a huge problem in trying to assemble that many listeners or in attempting such a large number of trials.

ff123
[{POST_SNAPBACK}][/a]



Interesting.  Check out this AES preprint about a comparison SACD and DVD-A , whihc seems to be derived from a German student's masters thesis:

[a href="http://www.hfm-detmold.de/eti/projekte/diplomarbeiten/dsdvspcm/aes_paper_6086.pdf]http://www.hfm-detmold.de/eti/projekte/dip..._paper_6086.pdf[/url]

Four out of 45 listeners comparing stereo recordings passed an ABX test.  None of 100 listeners (include the four mentioned above) passed an ABX using surround material.  The material listened to by each of the four 'successful' listeners was different.  Also, they all onl;y heard the difference using headphones.  There may have been some switch noise affecting the results.

Can you make any evaluation of the senstivity of these tests?

What's the problem with double-blind testing?

Reply #22
Quote
Quote
The writer of the letter in stereophile is incorrect in saying double blind tests are unreliable but he has a point. I do abx tests in foobar to compare different files a lot but there just are too many songs to abx to make sure all of one formats files are as good as the other ex. mp3 vs aac vs wav, etc... The thing I always notice though is that when using the analytical side of the brain(abx testing) only certain parts of the music is being focused and the emotional side of the music isn't being analyzed, therefore the goosebump factor of the sound the way the music delivers emotion cannot be analyzed! Its very hard to have the left and right side of the brain working at the same time.

My Example for this is when doing an abx of one particular song, I passed the test in foobar with flying colors(100% on abx test) but thought the differences were not really significant enough to make me keep the wav files from the cds and kept only the mp3s. After a few weeks of listening to the mp3s and really getting to know and love the songs better, I tried just popping the cd in for a listen... wow, goosebumps.... The parts I loved in the songs i was listening to gave me goosebumps for the first time. Now thats music! The feeling is lost on some parts of the mp3. And that was with my pc using sennheisers, not even my high end home rig.

Its not totally reliable to trust abx testing for determining the enjoyment one gets from his music because human memory can only remember only a certain number of seconds at a time like in an abx test. Its good for making codecs for eliminating artifacts but abx testing a few seconds of one song does not determine the accuracy of the dynamics of the whole piece. All the buildup of sound, the emotion! Bash me now but I have been an audiophile for more than 15 years and I am only 31, and have been a music lover since 5 years old. I know what I'm talking about. Listening to a bad stereo system, I can hear differences between different files for example but I don't really tend to care whether I am listening to the mp3 or the wav version. But with a very involving music system, the mp3 can sometimes put a damper on the fun factor of the song.

Also, stereophile loves posting "gray area" letters. I love that magazine. They exaggerate things sometimes but overall they are very accurate the way they subjectively describe the audio quality of equipment.
[a href="index.php?act=findpost&pid=335617"][{POST_SNAPBACK}][/a]


The point of double-blind testing is to make sure that you're getting goosebumps over something you actually hear, and not something you expect to hear.  The whole point is to try to evaluate what you're hearing without knowing whether it's the original or the lossy encode.  That shouldn't remove the "emotional" part of it, just make sure that you're reacting based upon something that is really there, and not what you want to hear or what you expect to hear.
[a href="index.php?act=findpost&pid=335756"][{POST_SNAPBACK}][/a]


Well, with mp3, even 320. I have passed foobar abx with many samples from my own cds compared to the original with flying colors. What I am saying is I cannot get goosebumps while I am in the analyzing mode cause I am not really listening to the music but to different aspects of the sound while doing the abx test. I do hear the differences and although they are not really that big from an analysis standpoint, they become bigger when listening for enjoyment. Some songs will just lose some life when encoded to mp3.

What's the problem with double-blind testing?

Reply #23
Quote
Interesting.   Check out this AES preprint about a comparison SACD and DVD-A , whihc seems to be derived from a German student's masters thesis:

http://www.hfm-detmold.de/eti/projekte/dip..._paper_6086.pdf
Four out of 45 listeners comparing stereo recordings passed an ABX test.  None of 100 listeners (include the four mentioned above) passed an ABX using surround material.  The material listened to by each of the four 'successful' listeners was different.  Also, they all onl;y heard the difference using headphones.   There may have been some switch noise affecting the results.

Can you make any evaluation of the senstivity of these tests?
[a href="index.php?act=findpost&pid=335793"][{POST_SNAPBACK}][/a]


Proponent of one format vs the other:  "Aha, 4 people definitely heard a difference!" and "That low-level crackling sound when switching didn't affect the test's validity."

Someone in the no-difference camp:  "Aha, 141 out of 145 20-run ABX tests did not show a difference!"

Choose your interpretation.

ff123

What's the problem with double-blind testing?

Reply #24
Quote
Quote
The "goosebump", emotional factor can be caused by placebo effect, you can't be sure that is not the cause. People has used same principle (emotional response and the like) to say cables sound different, for example.

You can do long-term blind tests too, not just the quick-switch, short snippet, typical abx test.
[a href="index.php?act=findpost&pid=335627"][{POST_SNAPBACK}][/a]

What do you mean by long term abx? how do I do this? Play entire songs?
[a href="index.php?act=findpost&pid=335628"][{POST_SNAPBACK}][/a]

Well, this is a two-sided sword, because long-term DBTs are indeed theoretically no problem - but in practice, most ABX-tools aren't designed for this purpose, which makes a bit more difficult to setup.

It basically goes like this - any abx-test may take as long as you want - there is no limit on how long the timespan has to be between trials - there is no limit in how long the entire test takes - and there is no limit on the number of trials.

What this means is, that you could for example do a 6months long DBT, where you just pick one of your fav-albums, and just go your normal daily schedule.... BUT the application which you use for playback should make it invisible to you if the original or the encoded version is currently played. Then, during those 6 months, when you just normally listen to that fav-album, and think "this is the encoded version!", then you just click a button, and afterwards continue your normal daily life. Over the course of the whole 6 months, you repeat that a dozen times.... and afterwards you end up with a longterm DBT which has so many trials, that its accuracy is incredibly high.

The only thing to keep in mind in that case is that there should be no way for you during those 6 months to see the intermediate results. Another way would be to not limit the test by time, but instead define beforehand on the number of trials..... i.e. say "this test will end after 100 trials were done". If you go with the predefined number of trials approach, then you are not allowed to end the test before you complete the 100 trials, else the results will be invalid.
I am arrogant and I can afford it because I deliver.