Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: Issues with Blind-Testing Headphones and Speakers (Read 21697 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

Issues with Blind-Testing Headphones and Speakers

New topic. Things become a bit more complex and controversial if you are attempting to level match things that fundamentally have markedly different frequency responses, such as speakers and headphones. Say for example one speaker has a stronger level of bass than the other one being tested. Do you level match them using pink noise? Do you level match them at 1 kHz? Do you instead level match them at 500Hz? Do you use a narrow band of noise centered where the ear is most sensitive, around 3.5 to 4 kHz? Do you use A-weighting when you conduct any of these level matches? You will get very different results depending one which exact method you use when the products have certain frequency response deviations between them.


I always use a 1 kHz sine. Now I have doubts if that is the best method. Should mention that I don't ABX speakers or amplifiers, just cd players, md players and other sources. Should I use, I don't know, ReplayGain?
marlene-d.blogspot.com

Issues with Blind-Testing Headphones and Speakers

Reply #1
New topic. Things become a bit more complex and controversial if you are attempting to level match things that fundamentally have markedly different frequency responses, such as speakers and headphones. Say for example one speaker has a stronger level of bass than the other one being tested. Do you level match them using pink noise? Do you level match them at 1 kHz? Do you instead level match them at 500Hz? Do you use a narrow band of noise centered where the ear is most sensitive, around 3.5 to 4 kHz? Do you use A-weighting when you conduct any of these level matches? You will get very different results depending one which exact method you use when the products have certain frequency response deviations between them.


I always use a 1 kHz sine. Now I have doubts if that is the best method. Should mention that I don't ABX speakers or amplifiers, just cd players, md players and other sources. Should I use, I don't know, ReplayGain?

Since you seem to only A/B things that have fairly flat frequency responses, I don't think you have much to worry about. It is when they vary greatly, as do speakers and headphones, that any method becomes questionable in my mind.  This doesn't sit well with some people I've noticed, AES published scientists I mean, so they talk themselves into thinking their method is perfectly valid and beyond reproach, regardless of the specific comparison.

Issues with Blind-Testing Headphones and Speakers

Reply #2
Since you seem to only A/B things that have fairly flat frequency responses, I don't think you have much to worry about. It is when they vary greatly, as do speakers and headphones, that any method becomes questionable in my mind.  This doesn't sit well with some people I've noticed, AES published scientists I mean, so they talk themselves into thinking their method is perfectly valid and beyond reproach, regardless of the specific comparison.


I´ve always assumed that using a sine at 1 kHz would be sufficient to my needs, using exactly the reasoning you provided (flat frequency response). So I can sleep safe again

I´ve got no idea about loudspeakers but I have a pretty good idea about headphones. Judging from wildly differing frequency responses of several models I assume it´s close to impossible to level-match them using sines or noise. Wouldn´t something akin to ReplayGain be a better idea there? Something that takes into account how we perceive music? Or does that defeat the purpose of a DBT because it already kind of "pre-selects" certain differences?
marlene-d.blogspot.com

Issues with Blind-Testing Headphones and Speakers

Reply #3
[I´ve got no idea about loudspeakers but I have a pretty good idea about headphones. Judging from wildly differing frequency responses of several models I assume it´s close to impossible to level-match them using sines or noise. Wouldn´t something akin to ReplayGain be a better idea there? Something that takes into account how we perceive music?

Whose weighting curve do you use though? There are many besides just A-weighting. Do you also install an eardrum probe mic under the headphone cushion to monitor the actual level at the ear drum so the weighting curve can vary depending on the volume the listener happens to choose? That varies too.


Notice even if using just one standard, ISO 226, the curve changes depending on the selected level at the ear.

Issues with Blind-Testing Headphones and Speakers

Reply #4
I see. So there are just too many variables. Which IMO also means, that one could influence a result towards a preferred outcome? Example: Harman might use this to discredit competitors during their loudspeaker tests by using weighing models better suited to their own products. Wrong?
marlene-d.blogspot.com

Issues with Blind-Testing Headphones and Speakers

Reply #5
I'm of the mind that one should accept that level matching things like speakers and headphones is so tricky, problematic, and questionable, that "fair, level matched comparisons" simply can't be done. This doesn't stop researchers at Harmon, etc. from publishing papers where they say "and we level matched" and nobody but me seems to blink an eye. "Our trained listeners preferred speaker A over B due to tonal balance. We level matched the two so that couldn't have influenced their selection" Um. now wait a minute. If you level matched them using A-weighting and speaker A has a broad peak at 4kHz that speaker B doesn't, I'd hardly describe the two as being "level matched" using that method because the very concept of what is "correct in all situations"  itself is nebulous and controversial.


The original Toole & Olive work used B-weighting and pink noise,  always monophonic.  I don't recall them ever writing  "We level matched the two so that couldn't have influenced their selection".  For recent headphone preference work, Welti and Olive report that relative loudness was normalized as per ITU BS.1770 : "Algorithms to measure audio programme loudness and true-peak audio level" recommendations.  Sean Olive posts to HA ,as well as to his own blog, so people might want to pose questions to him directly rather than insinuate and/or come to premature conclusions here.

Issues with Blind-Testing Headphones and Speakers

Reply #6
I've never thought of it that way, Cavaille, up until now at least, but I guess that is technically possible.

I have other problems with some of the Harman studies. Take for instance their, *ahem*, "blind" comparison of headphones. Although as I understand it they attached plastic handles to the headphones so listeners could adjust them for position and comfort without touching the headphones' main body, so as to not disclose the identity by finger touch alone [Good!], do people honestly think they wouldn't be able to tell by the feel on the head/ears when, for example, they were using the massive and triply heavy, circular ear cushioned Audeze (600g) compared to the markedly smaller, lighter (193g), and oval ear cushioned Bose, they tested?

I suspect you could ask any 10-year-old "Based on their appearance alone, which of these 6 headphones do you suspect costs the most?" and they'd go with the giant, round eared Audeze nine out of ten times.

The headphones were obscured from view during the actual testing, true, but how they feel pressing against the head and ears was not. Just because that is difficult/impossible to account for doesn't prove it isn't a potential problem which needs to be controlled.*

Sources for weight and shape:
http://www.bose.com/controller?url=/shop_o...rt_15/index.jsp
http://www.audeze.com/products/headphones/lcd-2

I'm confident Dr. Olive will see this momentarily and correct me if I'm wrong.

*[If I were testing people's headphone response curve preferences I would have simulated them electrically via outboard EQ and had the test subjects wear exactly the same pair of headphone for all comparisons. Not to say that the concept of whatever is "proper" level matching isn't still debatable and up for grabs using my unorthodox method.]

Issues with Blind-Testing Headphones and Speakers

Reply #7
I'm of the mind that one should accept that level matching things like speakers and headphones is so tricky, problematic, and questionable, that "fair, level matched comparisons" simply can't be done. This doesn't stop researchers at Harmon, etc. from publishing papers where they say "and we level matched" and nobody but me seems to blink an eye. "Our trained listeners preferred speaker A over B due to tonal balance. We level matched the two so that couldn't have influenced their selection" Um. now wait a minute. If you level matched them using A-weighting and speaker A has a broad peak at 4kHz that speaker B doesn't, I'd hardly describe the two as being "level matched" using that method because the very concept of what is "correct in all situations"  itself is nebulous and controversial.


The original Toole & Olive work used B-weighting and pink noise,  always monophonic.  I don't recall them ever writing  "We level matched the two so that couldn't have influenced their selection".

I was paraphrasing, of course, but the concept is implied by the use of the word "normalized" in one of their bullet points, from your first link:

"Relative loudness differences normalized (ITU-R 1770 )", at least for that study it is referring to. No?


Everyone agrees level matching is important. Where people differ is how to go about doing it or if it even can be done.

Issues with Blind-Testing Headphones and Speakers

Reply #8
The original Toole & Olive work used B-weighting and pink noise,  always monophonic.  I don't recall them ever writing  "We level matched the two so that couldn't have influenced their selection".  For recent headphone preference work, Welti and Olive report that relative loudness was normalized as per ITU BS.1770 : "Algorithms to measure audio programme loudness and true-peak audio level" recommendations.  Sean Olive posts to HA ,as well as to his own blog, so people might want to pose questions to him directly rather than insinuate and/or come to premature conclusions here.


I find a lot of research Sean Olive has been doing, extremely fascinating (in part, because the results feel similar to my own experiences / preferences). However, he works for a company whose intent it is to sell products. Therefore, I don´t think that careful questions regarding ethics are uncalled for. I don´t think that their sole reason for doing this is the good of all mankind.

I have other problems with some of the Harman studies. Take for instance their, *ahem*, "blind" comparison of headphones. Although as I understand it they attached plastic handles to the headphones so listeners could adjust them for position and comfort without touching the headphones' main body, so as to not disclose the identity by finger touch alone [Good!], do people honestly think they wouldn't be able to tell by the feel on the head/ears when, for example, they were using the massive and triply heavy, circular ear cushioned Audeze (600g) compared to the markedly smaller, lighter (193g), and oval ear cushioned Bose, they tested?


If memory serves me right (and please, correct me in case I´m wrong)... wasn´t there a study where frequency responses of certain headphones were applied to just one model in order to mimic their sonic signature? If so, that would exclude the need to change headphones.
marlene-d.blogspot.com

Issues with Blind-Testing Headphones and Speakers

Reply #9
I've never thought of it that way, Cavaille, up until now at least, but I guess that is technically possible.

I have other problems with some of the Harman studies. Take for instance their, *ahem*, "blind" comparison of headphones. Although as I understand it they attached plastic handles to the headphones so listeners could adjust them for position and comfort without touching the headphones' main body, so as to not disclose the identity by finger touch alone [Good!], do people honestly think they wouldn't be able to tell by the feel on the head/ears when, for example, they were using the massive and triply heavy, circular ear cushioned Audeze (600g) compared to the markedly smaller, lighter (193g), and oval ear cushioned Bose, they tested?

I suspect you could ask any 10-year-old "Based on their appearance alone, which of these 6 headphones do you suspect costs the most?" and they'd go with the giant, round eared Audeze nine out of ten times.

The headphones were obscured from view during the actual testing, true, but how they feel pressing against the head and ears was not. Just because that is difficult/impossible to account for doesn't prove it isn't a potential problem which needs to be controlled.*

Sources for weight and shape:
http://www.bose.com/controller?url=/shop_o...rt_15/index.jsp
http://www.audeze.com/products/headphones/lcd-2



You're saying that the feel of the headphones could cause bias --  in which direction?  IOW which phones do you , or your 10-year-old,  predict will perform best and worst in the listener self-report? Would you be able to predict their ranking (from best to worst?  And how would you square that with Olive's finding that an objectively 'neutral' performance correlated best with subjective preference?

Issues with Blind-Testing Headphones and Speakers

Reply #10
I find a lot of research Sean Olive has been doing, extremely fascinating (in part, because the results feel similar to my own experiences / preferences). However, he works for a company whose intent it is to sell products.



He does now (he's also the president of the AES).  But the loudspeaker preference work with Toole was first undertaken when they worked at the National Research Council of Canada. 

And the 'winner and loser' brands/models are never identified in the published research.  Hard to see how this could be gaming the system when the reader doesn't know how well Harman/JBL's product (if any) did. 

And it would be churlish to dun Olive/Toole for *applying* the knowledge gained from such research.

Issues with Blind-Testing Headphones and Speakers

Reply #11
I find a lot of research Sean Olive has been doing, extremely fascinating (in part, because the results feel similar to my own experiences / preferences). However, he works for a company whose intent it is to sell products. Therefore, I don´t think that careful questions regarding ethics are uncalled for. I don´t think that their sole reason for doing this is the good of all mankind.


They do research so that they can develop, sell/offer better products, and make that research publicly available. How is that unethical?
"I hear it when I see it."

Issues with Blind-Testing Headphones and Speakers

Reply #12
You're saying that the feel of the headphones could cause bias --  in which direction?  IOW which phones do you , or your 10-year-old,  predict will perform best and worst in the listener self-report?

I'm not saying a headphone's size and weight are the only things which would weigh in a person's decision, albeit not consciously, but the basic prejudice would flow like this:

Big and heavy electronic gizomos are the expensive ones = best quality ones.

Correct me if I'm wrong but weren't the Audeze, which are the heaviest and largest in the study, IIRC, also ranked as "the best/most accurate"?

I can give you an anecdotal example where a company used this well known bias.

In the late 80's/early 90's I was a Denon dealer and they were one of the top names in CD players back then. For some reason there was a top of the line unit we were throwing in the trash [because it had fallen from a shelf and had a bashed in corner, making it toast, for example].  Out of curiosity I dismantled the remains and discovered hidden away from view, completely obscured by the main internal circuit board, was a very thick metal slab of steel, a metal plate, much heavier than almost all other brands' entire CD design, heck it was even heavier than some of their own receivers. It wasn't attached to anything and served no electrical or heat dissipation purpose. I'm confident the sole purpose was to make the entire product's heft and "bulid quality" seem greater and served no other function.

[I think it was the DCD-3300 or DCD-3000, IIRC, but I'm not 100% sure at this point.]

Plus think of how many forum reviews of amps and receivers we've all read where they felt it to be useful to describe the unit's heft, which they note with some pride, as if that should convey to us something about its quality.

Issues with Blind-Testing Headphones and Speakers

Reply #13
And the 'winner and loser' brands/models are never identified in the published research.  Hard to see how this could be gaming the system when the reader doesn't know how well Harman/JBL's product (if any) did.


Considering Harman uses the data in their marketing and advertising, at least sometimes, I'd hardly call it a secret though, even if it wasn't technically spelled out in the original paper:

Harman Marketing page

Issues with Blind-Testing Headphones and Speakers

Reply #14
You're saying that the feel of the headphones could cause bias --  in which direction?  IOW which phones do you , or your 10-year-old,  predict will perform best and worst in the listener self-report?

I'm not saying a headphone's size and weight are the only things which would weigh in a person's decision, albeit not consciously, but the basic prejudice would flow like this:

Big and heavy electronic gizomos are the expensive ones = best quality ones.

Correct me if I'm wrong but weren't the Audeze, which are the heaviest and largest in the study, IIRC, also ranked as "the best/most accurate"?



I don't know, and neither do you.  That's my point.  But it's not just which is ranked best, it's also how they are ranked.  By your hypothesis, they should rank from best to worst according to weight or 'feel''.  Are they?



Issues with Blind-Testing Headphones and Speakers

Reply #15
And the 'winner and loser' brands/models are never identified in the published research.  Hard to see how this could be gaming the system when the reader doesn't know how well Harman/JBL's product (if any) did.


Considering Harman uses the data in their marketing and advertising, at least sometimes, I'd hardly call it a secret though, even if it wasn't technically spelled out in the original paper:

Harman Marketing page


Again, why would they NOT use their research in *marketing* if the results favored their product? 

A correlation between good 'sound power' metrics and listener preference is what the 1980s CNRC and subsequent Harman 'academic' work reported; Harman incorporates this into their loudspeaker design so that by 2004, one of their comparatively low-cost Infinity loudspeaker bests the other three designs in a DBT involving both trained and untrained listeners.  This is evidence of biased research? Or is it proof of concept?

Other speaker mfrs are free to design their products according to the CNRC guidelines.  And some have.

Issues with Blind-Testing Headphones and Speakers

Reply #16
By your hypothesis, they should rank from best to worst according to weight or 'feel''.

False. I never said that. I said it could partly influence a decision just as easily as doing a sighted study where the listeners were free to see what headphone they were wearing.

I'm not saying a headphone's size and weight are the only things which would weigh in a person's decision, albeit not consciously....

Issues with Blind-Testing Headphones and Speakers

Reply #17
By your hypothesis, they should rank from best to worst according to weight or 'feel''.

False. I never said that. I said it could influence a decision just as easily as doing a sighted study where the listeners were free to see what headphone they were wearing.



That's not unreasonable, but it's stipulated (especially the 'as easily' part) rather than demonstrated.  Toole and Olive have studied, and published on,  factors influencing loudspeaker preference.  I'm not aware that Olive  et al. quantified the relative impact of headphone feel, though I haven't read all their papers on this subject.  Have you?

If it *has* had a similarly powerful effect as, say, price, then you really should be able to predict what the likely preference ranking trend is, when this variable is not controlled for.


I'm not saying a headphone's size and weight are the only things which would weigh in a person's decision, albeit not consciously....


Which is why I wrote  weight or 'feel'.  (I really don't know what else you could be referring to, beyond that, in a visually 'blinded' protocol)

Issues with Blind-Testing Headphones and Speakers

Reply #18
If it *has* had a similarly powerful effect as, say, price, then you really should be able to predict what the likely preference ranking trend is, when this variable is not controlled for.

False. When you do sighted comparisons, or ones where the test subjects might be able to identify the headphones by say touch or feel, you never know if their (possibly subconscious) expectation bias plays almost no part in the decision process, a minor part, or a major part, all you know is it may have had some impact.

Olive/Welti seem to agree with me that touch/feel might be an influence, hence their decision to add small external handles to the phones so that the users' positioning for comfort and seal could be accomplished without their fingers identifying the brand they were wearing, they just overlooked how the different shape of the ear cushions and the product's overall weight might influence the listeners as to the actual IDs, or at least allowed them to sense: "Wow, this one is heavier and larger than the rest, I wonder if that means it's pricy, too".

Issues with Blind-Testing Headphones and Speakers

Reply #19
You're saying that the feel of the headphones could cause bias --  in which direction?  IOW which phones do you , or your 10-year-old,  predict will perform best and worst in the listener self-report?

I'm not saying a headphone's size and weight are the only things which would weigh in a person's decision, albeit not consciously, but the basic prejudice would flow like this:

Big and heavy electronic gizomos are the expensive ones = best quality ones.

Correct me if I'm wrong but weren't the Audeze, which are the heaviest and largest in the study, IIRC, also ranked as "the best/most accurate"?



I don't know, and neither do you. 

"Olive and Welti didn't reveal the scores of the individual headphones, but having measured all of the headphones myself (except in the case of the LCD2, but I've measured the similar LCD3), it looks to me like the clearly preferred headphone was the Audeze LCD2" - Brent Butterworth, Sound and Vision Magazine http://www.soundandvision.com/content/bigg...udio-story-2012


"We admitted up front that comfort/tactile factors were not eliminated from the test, and in that sense the test wasn't blind. However, our listeners didn't know which brands and models headphones were being tested so unless they could recognize the specific brand/model by its weight/comfort alone, their judgments weren't influenced by brand, price,etc" -  S.Olive

http://www.hydrogenaud.io/forums/index.php?showtopic=100538 post #8

They were influenced by size and weight. Consumers, even audio naïve ones like ten-year-olds, equate large and heavy with pricy/high quality and that potentially influenced them, even if it wasn't the only factor. I also suspect the listeners were shown what headphones the were going to compare beforehand, and could then later correlate that to the feel they experienced, although I don't know for sure.

Perhaps they also thought, "Gee, I bet that big one with the fancy wood trim and the big, round cushions will be a stand out!"  Just a thought.

From Jurassic Park:

Donald Gennaro: [Tim pops up wearing a pair of night vision goggles] Hey, where'd you find that?

Tim: In a box under my seat.

Donald Gennaro: Are they heavy?

Tim: Yeah.

Donald Gennaro: Then they're expensive, put 'em back.

Issues with Blind-Testing Headphones and Speakers

Reply #20
They were influenced by size and weight. Consumers, even audio naïve ones like ten-year-olds, equate large and heavy with pricy/high quality and that potentially influenced them, even if it wasn't the only factor. I also suspect the listeners were shown what headphones the were going to compare beforehand, and could then later correlate that to the feel they experienced, although I don't know for sure.

Perhaps they also thought, "Gee, I bet that big one with the fancy wood trim and the big round cushions is a stand out!"  Just a thought.

Sure, you cannot eliminate the ear pads so there's always gonna be some influence on the comfort, but so will the freaking sound. If you tried to add weight to the headphones you'd probably get some problems with seal, fit etc.

I don't think they were shown the headphones beforehand. If they had known which headphone they were listening to, then why didn't they rank the AKG (Harman) K550 better? I mean why would they rank a Bose higher?
Also, the K550's a pretty heavy headphone too yet it was ranked below lighter ones... like the Bose one (which is 100g + weight of K550 cable lighter).


... and there's the correlation of the measurements. You would have a (stronger) point if the Audeze headphone measured worse than the other headphones, but it doesn't.
The Audeze has the lowest distortion, most consistent performance across different heads, good frequency response ...
"I hear it when I see it."

Issues with Blind-Testing Headphones and Speakers

Reply #21
They were influenced by size and weight. Consumers, even audio naïve ones like ten-year-olds, equate large and heavy with pricy/high quality and that potentially influenced them, even if it wasn't the only factor. I also suspect the listeners were shown what headphones the were going to compare beforehand, and could then later correlate that to the feel they experienced, although I don't know for sure.

Perhaps they also thought, "Gee, I bet that big one with the fancy wood trim and the big round cushions is a stand out!"  Just a thought.

If they had known which headphone they were listening to, then why didn't they rank the AKG (Harman) K550 better?

Please link to this ranking you speak of.

Quote
You would have a (stronger) point if the Audeze headphone measured worse than the other headphones, but it doesn't.
I have no idea how you figure that.

Issues with Blind-Testing Headphones and Speakers

Reply #22
I have no idea how you figure that.

If weight had a strong influence then we would see a ranking that correlates with the weight, but instead we see a ranking that correlates with the measurements.

Also, they did much more research than just "blind"-testing 5 headphones for preference.. They tested several equalization curves, equalization based on measurements of loudspeakers in their reference listening-room, bass-treble balance, compared trained listeners' to kids' preferences, compared their new target curves with HD800, LCD2 ...

Given that for some of these tests a single headphone with different EQ curves applied was used, and those test results mirrored the preference of the initial 5 headphone test, you simply cannot argue that weight or pads had a significant effect.
"I hear it when I see it."

Issues with Blind-Testing Headphones and Speakers

Reply #23
If weight had a strong influence...

Never said "strong". I said, "it may have had some impact." [bold emphasis in original post I'm quoting]

Quote
If they had known which headphone they were listening to, then why didn't they rank the AKG (Harman) K550 better?

Still waiting for that ranking by brand you referred to, by the way.

Issues with Blind-Testing Headphones and Speakers

Reply #24
Never said "strong".

So? You implied how weight may have skewed the results. I explained what we should have seen if weight had had a significant influence.

The thing is that the mood of the listeners also had an influence, as did the air temperature, humidity ... but those are (also) insignificant points.


Quote
Still waiting for that ranking by brand you referred to, by the way.

Can be found if you follow your own link.
"I hear it when I see it."