IPB

Welcome Guest ( Log In | Register )

2 Pages V   1 2 >  
Reply to this topicStart new topic
Other Listening Test Methodologies?, Methods other than ABX
UltimateMusicSno...
post Sep 20 2013, 14:13
Post #1





Group: Members
Posts: 55
Joined: 17-September 13
Member No.: 110129



Has anyone come across other listening test methodologies they thought were highly revealing, useful, and rigorous? In a few papers I've seen (mostly AES) there are methodological details that vary here and there, but mostly researchers are doing double-blind tests of direct comparisons between two files.

There could be a role, for example, in tests of the form usually seen in medicine. Instead of taking one person and exposing them to both stimuli as a single data point, take a lot of persons, put them in different groups, and then collect Likert-scale data on their response to one class of stimuli. You'd probably want 100-300 persons per group, and of course expose them still double-blind, but only to one format.

This would present a calibration problem, of course, since who's to say what each number on the Likert scale represents. Still, with careful random sampling and a large sample, it's possible that the variations in judgment here would wash out in a large sample n (or they might add so much noise that no significant result is obtained). Analysis would look for systematic/significant differences between groups.

As a variation, a long-term experiment might ask a subject to listen to a single exposure of at least 10 minutes or so each day, and rate it on a Likert scale, then come back on successive days until at least 30 and preferably over 100 data points are obtained. Still lots of statistical noise here, but that's one of the points of large sample sets.

In the past such tests would have had huge practical barriers, but I'll bet it could be done now on the Internet.

This post has been edited by UltimateMusicSnob: Sep 20 2013, 14:13
Go to the top of the page
+Quote Post
2Bdecided
post Sep 20 2013, 14:41
Post #2


ReplayGain developer


Group: Developer
Posts: 5286
Joined: 5-November 01
From: Yorkshire, UK
Member No.: 409



Not quite what you're saying (because there are still comparisons), but closer...
http://soundexpert.org/home

Please see past HA discussions about soundexpert.


The classic BS 1116 test isn't just ABX, it's ABC...

QUOTE
In the preferred and most sensitive form of this method, one subject at a time is involved and the selection of one of
three stimuli (“A”, “B”, “C”) is at the discretion of this subject. The known reference is always available as stimulus “A”. The hidden reference and the object are simultaneously available but are “randomly” assigned to “B” and “C”, depending on the trial.

The subject is asked to assess the impairments on “B” compared to “A”, and “C” compared to “A”, according to the continuous five-grade impairment scale. One of the stimuli, “B” or “C”, should be indiscernible from stimulus “A”; the other one may reveal impairments. Any perceived differences between the reference and the other stimuli must be interpreted as an impairment.


from...
http://www.itu.int/dms_pubrec/itu-r/rec/bs...;!PDF-E.pdf


Note those first words "In the preferred and most sensitive form of this method..." - people have tried a lot of ways of doing this, and we end up with ABX or very similar because it's the most sensitive method.


Another sensitive method is Three Alternate Forced Choice Comparison Test (3-AFC). Present A, B and C, where two of those are the original, and one is the coded version. Pick the odd one out. If the user picks correctly, move to a higher quality version. If the user pick incorrectly, move to a lower quality version. Great for training and finding thresholds of audibility, but fraught with problems in terms of moving up and down the "quality" scale in a useful way. Easy for simple masking experiments. Harder and less reliable for finding the transparency threshold of a codec, though it's one possible tool.

Cheers,
David.
Go to the top of the page
+Quote Post
UltimateMusicSno...
post Sep 20 2013, 19:49
Post #3





Group: Members
Posts: 55
Joined: 17-September 13
Member No.: 110129



Great info, thank you. Depending on how confident I am, I have conducted a de facto ABC round from time to time in the foobar ABX interface. Nothing forces the user to check both A and B before deciding, so one way I've proceeded is to check A alone, and then decide which of X or Y is actually A.

I'm also thinking of a case were a control group of large sample n gets *only* a referent file, like a CD-mastered copy, and then rates that on a variety of quality-oriented Likert scales. There *is* an implicit comparison between that and "everything else I listen to beside this test file", but subjects would not A/B. Then other treatment groups get *only* a manipulated file derived from the referent, and respond on the same survey measures. The hope would be to reject the null hypothesis on a t-test of no difference between the aggregated measures for each group.

This post has been edited by db1989: Sep 20 2013, 19:54
Reason for edit: deleting pointless full quote
Go to the top of the page
+Quote Post
testyou
post Sep 21 2013, 07:19
Post #4





Group: Members
Posts: 99
Joined: 24-September 10
Member No.: 84113



But you are comparing a qualitative judgment between two different groups using two different samples? Would you think it likely to obtain any meaningful result?

With ABX test: you can reject null hypothesis by correctly identifying x with significance, or fail to reject and imply transparency. This gives you a strong result.

This post has been edited by testyou: Sep 21 2013, 07:20
Go to the top of the page
+Quote Post
C.R.Helmrich
post Sep 21 2013, 09:30
Post #5





Group: Developer
Posts: 692
Joined: 6-December 08
From: Erlangen Germany
Member No.: 64012



QUOTE (UltimateMusicSnob @ Sep 20 2013, 20:49) *
I'm also thinking of a case were a control group of large sample n gets *only* a referent file, like a CD-mastered copy, and then rates that on a variety of quality-oriented Likert scales. There *is* an implicit comparison between that and "everything else I listen to beside this test file", but subjects would not A/B. Then other treatment groups get *only* a manipulated file derived from the referent, and respond on the same survey measures.

I think what you're describing here is provided by the Absolute Category Rating (ACR) configuration of a P.800 test: https://www.itu.int/rec/dologin_pub.asp?lan...!!PDF-E

By the way, there is also the MUSHRA methodology, which is very similar to the BS.1116: http://www.itu.int/dms_pubrec/itu-r/rec/bs...;!PDF-E.pdf

Chris


--------------------
If I don't reply to your reply, it means I agree with you.
Go to the top of the page
+Quote Post
UltimateMusicSno...
post Sep 21 2013, 14:46
Post #6





Group: Members
Posts: 55
Joined: 17-September 13
Member No.: 110129



QUOTE (testyou @ Sep 21 2013, 01:19) *
But you are comparing a qualitative judgment between two different groups using two different samples? Would you think it likely to obtain any meaningful result?

With ABX test: you can reject null hypothesis by correctly identifying x with significance, or fail to reject and imply transparency. This gives you a strong result.
Possibly just two different samples, but I was thinking of classes of samples. One group does only, say MP3 128, the other does only AAC 128, but given those settings, what I'd do then is encode each subject's only personal listening library. Collect the Likert scales plus demographics and personal listening information (dollars spent, type of equipment, genres, hours/day, etc). Then with many hundreds of subjects over a period of weeks, collect a lot of data. Then potentially you'd get results like "Punk rockers, the elderly and those who listened primarily in the car showed no variance between the treatment groups, while young listeners playing pop on headphones preferred AAC."
I actually filled out an online survey solicited by SONY with questions about my listening habits and preferences, and it did go to different formats as well, just without any actual listening tests. This information would be of great interest to SONY, but getting the methodology together is difficult.

I'm trying to get away from the unrealistic spot listening encouraged by foobar-type ABXing. I never listen to 30 seconds of music, and I certainly don't play pieces over and over. I listen to one piece, and then move on. My appreciation of the audio quality is cumulative over time, not locked to a variety of spots.

[Caveat: Perhaps this goes to successful ABXing: I ***Do*** listen to the same spot over and over, when I'm practicing a new piece on a live instrument. Getting the septuplet flourish in the Chopin fingered right and phrased beautifully, is precisely the exercise of ABXing: short excerpt, repeated listening, listening as a perfectionist for every detail.]

I listened to YouTube music videos with my children for a good two hours the other evening, and my ears acclimated to it. Then I switched to one of my own Redbook-ripped-to-HD tracks--the soundstage leaped out of the speakers, a dramatic contrast. This could conceivably be expanded as a methodology: "Listen all your usual ways all day. Then stop for ten minutes and listen to the prescribed track in the prescribed way. Then rate the quality on these dimensions. No re-listening."

The ABX result IS a strong result, the best there is, but for a highly artificial listening situation. To put it another way, ABX is a lab experiment, well-controlled and rigorous, but with artificial conditions and poor generalizability. Something closer to a field experiment would also be useful. The data would be noisier for sure, but the potential for results directly applicable to development and product offerings would make it worth it to find out if significant results are possible.

This post has been edited by UltimateMusicSnob: Sep 21 2013, 14:50
Go to the top of the page
+Quote Post
UltimateMusicSno...
post Sep 21 2013, 14:59
Post #7





Group: Members
Posts: 55
Joined: 17-September 13
Member No.: 110129



QUOTE (C.R.Helmrich @ Sep 21 2013, 03:30) *
QUOTE (UltimateMusicSnob @ Sep 20 2013, 20:49) *
I'm also thinking of a case were a control group of large sample n gets *only* a referent file, like a CD-mastered copy, and then rates that on a variety of quality-oriented Likert scales. There *is* an implicit comparison between that and "everything else I listen to beside this test file", but subjects would not A/B. Then other treatment groups get *only* a manipulated file derived from the referent, and respond on the same survey measures.

I think what you're describing here is provided by the Absolute Category Rating (ACR) configuration of a P.800 test: https://www.itu.int/rec/dologin_pub.asp?lan...!!PDF-E

By the way, there is also the MUSHRA methodology, which is very similar to the BS.1116: http://www.itu.int/dms_pubrec/itu-r/rec/bs...;!PDF-E.pdf

Chris
Excellent, that's what I was looking for. The unit of treatment exposure in the telephone tests is the "conversation", which is a highly realistic representation of the normal mode of use for that population. One long stimulus, followed by subject responses on a number of quality dimensions. Good procedure.
Go to the top of the page
+Quote Post
Arnold B. Kruege...
post Sep 23 2013, 13:29
Post #8





Group: Members
Posts: 4324
Joined: 29-October 08
From: USA, 48236
Member No.: 61311



QUOTE (UltimateMusicSnob @ Sep 20 2013, 09:13) *
Has anyone come across other listening test methodologies they thought were highly revealing, useful, and rigorous? In a few papers I've seen (mostly AES) there are methodological details that vary here and there, but mostly researchers are doing double-blind tests of direct comparisons between two files.


I'm trying to figure out where you are going here. Your statement "mostly researchers are doing double-blind tests of direct comparisons between two files" is extremely general and seems describe a very large family of procedures that seem hard to criticize. If not double blind, then what, sighted? If not direct comparisons, then what, indirect comparisons?

If you want to see a general treatment of comparison methodolgies, look at the classic:

http://www.amazon.com/Sensory-Evaluation-T...n/dp/0849338395

Sensory Evaluation Techniques, Fourth Edition [Hardcover]
Morten C. Meilgaard (Author), B. Thomas Carr (Author), Gail Vance Civille (Author)

The current favorite for perceptual coding testing seems to be MUSHRA.

http://en.wikipedia.org/wiki/MUSHRA

QUOTE
There could be a role, for example, in tests of the form usually seen in medicine. Instead of taking one person and exposing them to both stimuli as a single data point, take a lot of persons, put them in different groups, and then collect Likert-scale data on their response to one class of stimuli.


Been there, done that.

It has its moments. One key issue to bear in mind is that the test needs to be tailored and optimized for the question at had. For example: "Do these two files sound different at all?" is thought by many to be a prerequisite for, and a very different question than: "Which of these files do I prefer?"

QUOTE
You'd probably want 100-300 persons per group, and of course expose them still double-blind, but only to one format.


Lotsa luck with finding 300 qualified listeners.

Seems to me you need to do more study of subjective testing technology to date. The study of hearing developed blind testing for a decade or more before modern testng methodologies were popularized for audio in the middle 1970s. The Journal of the Acoustical Society is a good resource, and just to keep you on your toes they have an ABX test which is substantially different from the one we use in audio, and for good reasons.

Go to the top of the page
+Quote Post
UltimateMusicSno...
post Sep 23 2013, 17:55
Post #9





Group: Members
Posts: 55
Joined: 17-September 13
Member No.: 110129



QUOTE (Arnold B. Krueger @ Sep 23 2013, 07:29) *
If not double blind, then what, sighted?
No, not sighted. Here I was just listing the normal methodological procedures.

QUOTE (Arnold B. Krueger @ Sep 23 2013, 07:29) *
If not direct comparisons, then what, indirect comparisons?
No, as described earlier I'm thinking about tests in which the subjects themselves do not compare side-by-side, they only rate, as is commonly the case in drug tests. The point of comparison occurs during analysis of results, not data collection.

QUOTE (Arnold B. Krueger @ Sep 23 2013, 07:29) *
QUOTE (UltimateMusicSnob @ Sep 20 2013, 09:13) *
There could be a role, for example, in tests of the form usually seen in medicine. Instead of taking one person and exposing them to both stimuli as a single data point, take a lot of persons, put them in different groups, and then collect Likert-scale data on their response to one class of stimuli.


Been there, done that.
Excellent, do you have any citations?

QUOTE (Arnold B. Krueger @ Sep 23 2013, 07:29) *
QUOTE (UltimateMusicSnob @ Sep 20 2013, 09:13) *
You'd probably want 100-300 persons per group, and of course expose them still double-blind, but only to one format.
Lotsa luck with finding 300 qualified listeners.
Obviously a tough problem, but that's why I mention the possibility of leveraging the Internet. It *would* be expensive, but conceivably a service like Zoomerang could provide me a sample population.

QUOTE (Arnold B. Krueger @ Sep 23 2013, 07:29) *
Seems to me you need to do more study of subjective testing technology to date.
Yes.....that's why I posted the thread....

This post has been edited by db1989: Sep 23 2013, 19:30
Reason for edit: replacing full quote with inlined replies with a properly formatted version.
Go to the top of the page
+Quote Post
saratoga
post Sep 23 2013, 18:24
Post #10





Group: Members
Posts: 5120
Joined: 2-September 02
Member No.: 3264



Actually, in clinical trials, direct comparisons are used if at all feasible. The reason is that if you don't do direct comparisons, your sample sizes will need to be enormous. So you could get together huge numbers of listeners and spend months and years doing a test, but its probably easier to just design a better test that doesn't cost millions of dollars to run smile.gif
Go to the top of the page
+Quote Post
UltimateMusicSno...
post Sep 23 2013, 19:28
Post #11





Group: Members
Posts: 55
Joined: 17-September 13
Member No.: 110129



QUOTE (saratoga @ Sep 23 2013, 12:24) *
Actually, in clinical trials, direct comparisons are used if at all feasible. The reason is that if you don't do direct comparisons, your sample sizes will need to be enormous. So you could get together huge numbers of listeners and spend months and years doing a test, but its probably easier to just design a better test that doesn't cost millions of dollars to run smile.gif

Yes, the cost/benefit ratio is probably never going to work out. It's the artificiality of the ABX that works against generalizability for the research questions in audio, though. The procedure is rigorous, the data is useful, absolutely. It's just a big departure from how people actually listen.
Go to the top of the page
+Quote Post
pdq
post Sep 23 2013, 19:45
Post #12





Group: Members
Posts: 3443
Joined: 1-September 05
From: SE Pennsylvania
Member No.: 24233



QUOTE (UltimateMusicSnob @ Sep 23 2013, 14:28) *
It's just a big departure from how people actually listen.

I'm not sure what you mean by this. True, to get the most sensitivity to small differences one listens to short segments with rapidly switching between them, but there is absolutely no reason that one could not do ABX testing by listening to the entire piece from beginning to end each time. It's all a matter of what you want to accomplish and how much time you are willing to put into it.
Go to the top of the page
+Quote Post
saratoga
post Sep 23 2013, 19:48
Post #13





Group: Members
Posts: 5120
Joined: 2-September 02
Member No.: 3264



QUOTE (UltimateMusicSnob @ Sep 23 2013, 14:28) *
It's just a big departure from how people actually listen.


How so?
Go to the top of the page
+Quote Post
UltimateMusicSno...
post Sep 23 2013, 20:04
Post #14





Group: Members
Posts: 55
Joined: 17-September 13
Member No.: 110129



QUOTE (pdq @ Sep 23 2013, 13:45) *
QUOTE (UltimateMusicSnob @ Sep 23 2013, 14:28) *
It's just a big departure from how people actually listen.

I'm not sure what you mean by this. True, to get the most sensitivity to small differences one listens to short segments with rapidly switching between them, but there is absolutely no reason that one could not do ABX testing by listening to the entire piece from beginning to end each time. It's all a matter of what you want to accomplish and how much time you are willing to put into it.
The main thing is repetition. If the segments are short, that's also unnatural in terms of the length of a listening segment. True, one could listen to entire pieces ("Let's A/B Tosca! --see you in a month!") laugh.gif
But ears acclimate to the current sound environment. ABX depends on listening memory--unlikely to be effective for listening sessions which resembled "normal" listening. I tend to listen to an album all the way through, so my typical realistic session would be in the neighborhood of 30-60 minutes. I could provide Likert responses with some confidence, but compared to an album I heard an hour ago? It doesn't seem feasible. Not because it takes too long, but because aural memory will not function effectively across such long spans. I could be wrong, I'd be interested in published data if anyone has done it.

This post has been edited by UltimateMusicSnob: Sep 23 2013, 20:05
Go to the top of the page
+Quote Post
saratoga
post Sep 23 2013, 20:10
Post #15





Group: Members
Posts: 5120
Joined: 2-September 02
Member No.: 3264



QUOTE (UltimateMusicSnob @ Sep 23 2013, 15:04) *
QUOTE (pdq @ Sep 23 2013, 13:45) *
QUOTE (UltimateMusicSnob @ Sep 23 2013, 14:28) *
It's just a big departure from how people actually listen.

I'm not sure what you mean by this. True, to get the most sensitivity to small differences one listens to short segments with rapidly switching between them, but there is absolutely no reason that one could not do ABX testing by listening to the entire piece from beginning to end each time. It's all a matter of what you want to accomplish and how much time you are willing to put into it.


The main thing is repetition.


You don't have to actually repeat the test. You could do all sorts of methodologies where one does an ABX comparison a single time and then uses multiple samples. I suspect you'll find that its just a more complex way to arrive at the same answer though.

QUOTE (UltimateMusicSnob @ Sep 23 2013, 15:04) *
I tend to listen to an album all the way through, so my typical realistic session would be in the neighborhood of 30-60 minutes. I could provide Likert responses with some confidence, but compared to an album I heard an hour ago? It doesn't seem feasible. Not because it takes too long, but because aural memory will not function effectively across such long spans. I could be wrong, I'd be interested in published data if anyone has done it.


If all you care about is how things sound compared to your long term memory, than accuracy is probably not too important. Even relatively large differences will not be apparent over such time periods. Or to put this another way, differences that matter over such long time periods are generally so obvious when A/B'ed that ABX is unnecessary.
Go to the top of the page
+Quote Post
pdq
post Sep 23 2013, 20:15
Post #16





Group: Members
Posts: 3443
Joined: 1-September 05
From: SE Pennsylvania
Member No.: 24233



One of the excuses that is given when someone is unable to back up a claim of audibility using ABX testing, is that the usual protocol of switching back and forth rapidly between two versions makes it more difficult rather than less difficult to tell the difference, because it is so different than how one usually listens to music. They talk of things like "fatigue factor".

The counter argument is that if they are able to hear the difference only when listening to much longer segments separated by much longer times, there is no reason ABX cannot be performed in that way. Of course, those people never take up this challenge, or if they do then they are unwilling to report the results. biggrin.gif
Go to the top of the page
+Quote Post
UltimateMusicSno...
post Sep 23 2013, 22:18
Post #17





Group: Members
Posts: 55
Joined: 17-September 13
Member No.: 110129



QUOTE (pdq @ Sep 23 2013, 14:15) *
One of the excuses that is given when someone is unable to back up a claim of audibility using ABX testing, is that the usual protocol of switching back and forth rapidly between two versions makes it more difficult rather than less difficult to tell the difference, because it is so different than how one usually listens to music. They talk of things like "fatigue factor".

The counter argument is that if they are able to hear the difference only when listening to much longer segments separated by much longer times, there is no reason ABX cannot be performed in that way. Of course, those people never take up this challenge, or if they do then they are unwilling to report the results. biggrin.gif

One of the benefits of Likert scale data is that the researcher (if they obtained significant results) would have more than just "could they tell the difference" to use.
"How *much* better is A than B?", for example requires at least ordinal data. Detection is just the first step. Some of the protocols cited above get into this area, which strikes me as useful.
Go to the top of the page
+Quote Post
MichaelW
post Sep 24 2013, 00:32
Post #18





Group: Members
Posts: 631
Joined: 15-March 07
Member No.: 41501



QUOTE (UltimateMusicSnob @ Sep 24 2013, 10:18) *
One of the benefits of Likert scale data is that the researcher (if they obtained significant results) would have more than just "could they tell the difference" to use.
"How *much* better is A than B?", for example requires at least ordinal data. Detection is just the first step. Some of the protocols cited above get into this area, which strikes me as useful.


Perhaps one of the reasons ABX and MUSHRA rule is that more elaborate data would be useful in areas where preferences for different kinds of imperfections matter. With digital encoding, it is pretty trivial to get (audibly) perfect representation of the source, using lossless if necessary at only a minor cost in file size. So it is easy to get to a point where preferences could not be relevant. Same with electronics, as I understand.

So the large scale clinical trials would only be of interest, I think, to makers of loudspeakers and devisers of multichannel systems.
Go to the top of the page
+Quote Post
Arnold B. Kruege...
post Sep 24 2013, 12:44
Post #19





Group: Members
Posts: 4324
Joined: 29-October 08
From: USA, 48236
Member No.: 61311



QUOTE (UltimateMusicSnob @ Sep 23 2013, 12:55) *
QUOTE (Arnold B. Krueger @ Sep 23 2013, 07:29) *
If not double blind, then what, sighted?
No, not sighted. Here I was just listing the normal methodological procedures.

QUOTE (Arnold B. Krueger @ Sep 23 2013, 07:29) *
If not direct comparisons, then what, indirect comparisons?
No, as described earlier I'm thinking about tests in which the subjects themselves do not compare side-by-side, they only rate, as is commonly the case in drug tests. The point of comparison occurs during analysis of results, not data collection.

QUOTE (Arnold B. Krueger @ Sep 23 2013, 07:29) *
QUOTE (UltimateMusicSnob @ Sep 20 2013, 09:13) *
There could be a role, for example, in tests of the form usually seen in medicine. Instead of taking one person and exposing them to both stimuli as a single data point, take a lot of persons, put them in different groups, and then collect Likert-scale data on their response to one class of stimuli.


Been there, done that.
Excellent, do you have any citations?


Bailar, John C. III, Mosteller, Frederick, "Guidelines for Statistical Reporting in Articles for Medical Journals", Annals of Internal Medicine, 108:266-273, (1988).
Buchlein, R., "The Audibility of Frequency Response Irregularities" (1962), reprinted in English in Journal of the Audio Engineering Society, Vol. 29, pp. 126-131 (1981)
Burstein, Herman, "Approximation Formulas for Error Risk and Sample Size in ABX Testing", Journal of the Audio Engineering Society, Vol. 36, p. 879 (1988)
Burstein, Herman, "Transformed Binomial Confidence Limits for Listening Tests", Journal of the Audio Engineering Society, Vol. 37, p. 363 (1989)
Carlstrom, David, Greenhill, Laurence, Krueger, Arnold, "Some Amplifiers Do Sound Different", The Audio Amateur, 3/82, p. 30, 31, also reprinted in Hi-Fi News & Record Review, Link House Magazines, United Kingdom, Dec 1982, p. 37.
CBC Enterprises, "Science and Deception, Parts I-IV", Ideas, October 17, 1982, CBC Transcripts, P. O. Box 500, Station A, Toronto, Ontario, Canada M5W 1E6
Clark, D. L., Krueger, A. B., Muller, B. F., Carlstrom, D., "Lipshitz/Jung Forum", Audio Amateur, Vol. 10 No. 4, pp. 56-57 (0ct 1979)
Clark, D. L., "Is It Live Or Is It Digital? A Listening Workshop", Journal of the Audio Engineering Society, Vol.33 No.9, pp.740-1 (September 1985)
Clark, David L., "A/B/Xing DCC", Audio, APR 01 1992 v 76 n 4, p. 32
Clark, David L., "High-Resolution Subjective Testing Using a Double-Blind Comparator", Journal of the Audio Engineering Society, Vol. 30 No. 5, May 1982, pp. 330-338.
Diamond, George A., Forrester, James S., "Clinical Trials and Statistical Verdicts: Probable Grounds for Appeal", Annals of Internal Medicine, 98:385-394, (1983).
Downs, Hugh, "The High-Fidelity Trap", Modern HI-FI & Stereo Guide, Vol. 2 No. 5, pp. 66-67, Maco Publishing Co., New York (December 1972)
Frick, Robert, "Accepting the Null Hypothesis", Memory and Cognition, Journal of the Psychonomic Society, Inc., 23(1), 132-138, (1995).
Fryer, P.A. "Loudspeaker Distortions: Can We Hear Them?", Hi-Fi News and Record Review, Vol. 22, pp 51-56 (1977 June)
Gabrielsonn and Sjogren, "Preceived Sound Quality of Sound Reproducing Systems", Journal of the Acoustical Society of America, Vol 65, pp 1019-1033 (1979 April)
Gabrielsonn, "Dimension Analyses of Perceived Sound Quality of Sound Reproducing Systems", Scand. J. Psychology, Vol. 20, pp. 159-169 (1979)
Greenhill, Laurence , "Speaker Cables: Can you Hear the Difference?" Stereo Review, ( Aug 1983)
Greenhill, L. L. and Clark, D. L., "Equipment Profile", Audio, (April 1985)
Grusec, Ted, Thibault, Louis, Beaton, Richard, "Sensitive Methodolgies for the Subjective Evaluation of High Quality Audio Coding Systems", Presented at Audio Engineering Society UK DSP Conference 14-15 September 1992, available from Government of Canada Communcations Research Center, 3701 Carling Ave., Ottawa, Ontario, Canada K1Y 3Y7.
Hirsch, Julian, "Audio 101: Physical Laws and Subjective Responses", Stereo Review, April 1996
Hudspeth, A. J., and Markin, Vladislav S., "The Ear's Gears: Mechanoelectrical Transduction By Hair Cells", Physics Today, 47:22-8, Feb 1994.
ITU-R BS.1116, "Methods for the Subjective Assessment of Small Impairment in Audio Systems Including Multichannel Sound Systems", Geneva, Switzerland (1994).
Lipschitz, Stanley P., and Van der kooy, John, "The Great Debate: Subjective Evaluation", Journal of the Audio Engineering Society, Vol. 29 No. 7/8, Jul/Aug 1981, pp. 482-491.
Masters, I. G. and Clark, D. L., "Do All Amplifiers Sound the Same?", Stereo Review, pp. 78-84 (January 1987)
Masters, Ian G. and Clark, D. L., "Do All CD Players Sound the Same?", Stereo Review, pp.50-57 (January 1986)
Masters, Ian G. and Clark, D. L., "The Audibility of Distortion", Stereo Review, pp.72-78 (January 1989)
Meyer, E. Brad, "The Amp-Speaker Interface (Tube vs. solid-state)", Stereo Review, pp.53-56 (June 1991)
Nousaine, Thomas, "Wired Wisdom: The Great Chicago Cable Caper", Sound and Vision, Vol. 11 No. 3 (1995)
Nousaine, Thomas, "Flying Blind: The Case Against Long Term Testing", Audio, pp. 26-30, Vol. 81 No. 3 (March 1997)
Nousaine, Thomas, "Can You Trust Your Ears?", Stereo Review, pp. 53-55, Vol. 62 No. 8 (August 1997)
Olive, Sean E., et al, "The Perception of Resonances at Low Frequencies", Journal of the Audio Engineering Society, Vol. 40, p. 1038 (Dec 1992)
Olive, Sean E., Schuck, Peter L., Ryan, James G., Sally, Sharon L., Bonneville, Marc E., "The Detection Thresholds of Resonances at Low Frequencies", Journal of the Audio Engineering Society, Vol. 45, p. 116-128, (March 1997)
Pease, Bob, "What's All This Splicing Stuff, Anyhow?", Electronic Design, (December 27, 1990) Recent Columns, http://www.national.com/rap/
Pohlmann, Ken C., "6 Top CD Players: Can You Hear the Difference?", Stereo Review, pp.76-84 (December 1988)
Pohlmann, Ken C., "The New CD Players, Can You Hear the Difference?", Stereo Review, pp.60-67 (October 1990)
Schatzoff, Martin, "Design of Experiments in Computer Performance Evaluation", IBM Journal of Research and Development, Vol. 25 No. 6, November 1981
Shanefield, Daniel, "The Great Ego Crunchers: Equalized, Double-Blind Tests", High Fidelity, March 1980, pp. 57-61
Simon, Richard, "Confidence Intervals for Reporting Results of Clinical Trials", Annals of Internal Medicine, 105:429-435, (1986).
Spiegel, D., "A Defense of Switchbox Testing", Boston Audio Society Speaker, Vol. 7 no. 9 (June 1979)
Stallings, William M., "Mind Your p's and Alphas", Educational Researcher, November 1995, pp. 19-20
Toole, Floyd E., "Listening Tests - Turning Opinion Into Fact", Journal of the Audio Engineering Society, Vol. 30, No. 6, June 1982, pp. 431-445.
Toole, Floyd E., "The Subjective Measurements of Loudspeaker Sound Quality & Listener Performance", Journal of the Audio Engineering Society, Vol. 33, pp. 2-32 (1985 Jan/Feb)
Toole, Floyd E., and Olive, Sean E., "The Detection of Reflections in Typical Rooms", Journal of the Audio Engineering Society, Vol. 39, pp. 539-553 (1989 July/Aug)
Toole, Floyd E., and Olive, Sean E., "Hearing is Believing vs. Believing is Hearing: Blind vs. Sighted Tests, and Other Interesting Things", 97th AES Convention (San Francisco, Nov. 10-13, 1994), [3893 (H-5], 20 pages.
Toole, Floyd E., and Olive, Sean E., "The Modification of Timbre By Resonances: Perception & Measurement", Journal of the Audio Engineering Society, Vol 36, pp. 122-142 (1988 March).
Warren, Richard M., "Auditory Illusions and their Relation to Mechanisms Enhancing Accuracy of Perception", Journal of the Audio Engineering Society, Vol. 31 No. 9 (1983 September).

Acoustical Society of America, Hearing: Its Psychology and Physiologogy, American Institute of Physics
Andersen, Hans Christian, "The Emperor's New Clothes" Andersen's Fairy Tales, with biographical sketch of Hans Christian Andersen by Thomas W. Handford. Illustrated by True Williams and others., Chicago, Belford, Clarke (1889)
Armitage, Statistical Methods in Medicine, Wiley (1971)
Burlington, R., and May, D. Jr., Handbook of Probability and Statistics with Tables, Second Edition, McGraw Hill NY (1970)
Fisher, Ronald Aylmer, Sir, Statistical Methods and Scientific Inference, 3d ed., rev. and enl., New York Hafner Press (1973)
Frazier, Kendrik, ed., Paranormal Borderlands of Science, Prometheus Books (1981)
Grinnell, Frederick, The Scientific Attitude, Boulder, Westview Press (1987)
Hanushek, E., and Jackson, J., Statistical Methods for Social Scientists, Academic Press NY (1977)
Kockelmans, Joseph J., Phenomenology and Physical Science - An Introduction to the Philosophy of Physical Science, Duquesne Press, Pittsburg PA (1966)
Lakatos, Imre, The Methodology of Scientific Research Programmes, Vol. 1 , Cambridge University Press (1978)
McBurney, Donald H., Collings, Virginia B., Introduction to Sensation/Perception, Prentice Hall, Inc., Englewood Cliffs, NJ 07632 (1977)
Moore, Brian C. J., An Introduction to the Psychology of Hearing, 3rd Edition , Academic Press, London ; New York (1989)
Mosteller and Tukey, "Quantitative Methods", chapter in Handbook of Social Psychology, Lindzey G., and Aronson, Eds., Addison-Wesley (1964)
Neave, H. R., Statistical Tables, Allen & Unwin, London (1978)
Norman, Geoffrey, R., PDQ Statistics, B. C. Decker Toronto, C. V. Mosby St. Louis, (1986)
Rock, Irwin, An Introduction to Perception, Macmillan Publishing Company, New York NY (1975)
Scharf, Bertam, and Reynolds, George S. Experimental Sensory Psychology, Scott Forseman and Company, Glenview IL (1975)

Go to the top of the page
+Quote Post
UltimateMusicSno...
post Sep 24 2013, 21:12
Post #20





Group: Members
Posts: 55
Joined: 17-September 13
Member No.: 110129



QUOTE (Arnold B. Krueger @ Sep 23 2013, 07:29) *
QUOTE (UltimateMusicSnob @ Sep 20 2013, 09:13) *
There could be a role, for example, in tests of the form usually seen in medicine. Instead of taking one person and exposing them to both stimuli as a single data point, take a lot of persons, put them in different groups, and then collect Likert-scale data on their response to one class of stimuli.


Been there, done that.
QUOTE (UltimateMusicSnob @ Sep 20 2013, 09:13) *
Excellent, do you have any citations?

QUOTE (Arnold B. Krueger @ Sep 23 2013, 07:29) *
Bailar, John C. III, Mosteller, Frederick, "Guidelines for Statistical Reporting in Articles for Medical Journals", Annals of Internal Medicine, 108:266-273, (1988).
Buchlein, R., "The Audibility of Frequency Response Irregularities" (1962), reprinted in English in Journal of the Audio Engineering Society, Vol. 29, pp. 126-131 (1981)
Burstein, Herman, "Approximation Formulas for Error Risk and Sample Size in ABX Testing", Journal of the Audio Engineering Society, Vol. 36, p. 879 (1988)
Burstein, Herman, "Transformed Binomial Confidence Limits for Listening Tests", Journal of the Audio Engineering Society, Vol. 37, p. 363 (1989)

etc
etc
etc

laugh.gif Actually, this report would have been just fine, but thanks for the full list anway! laugh.gif
Go to the top of the page
+Quote Post
esldude
post Nov 17 2013, 02:44
Post #21





Group: Members
Posts: 8
Joined: 6-November 03
Member No.: 9687



One that makes sense to me is a variation used in the food industry. Two alternative forced choice. Also generally you have a reliably perceived difference if the testee scores 75% correct choices regardless of the number of trials.

You present two choices. A and B. The testee must choose one. The parameter in food industry is something like choose the sweeter or the pair. In audio you could ask a person to choose the sample with most bass, or simply the version you prefer, or that sounds most real.

I think it nicer than ABX as it is more how people listen for differences when not doing blind tests. They listen to a couple things and pick the one they prefer. Also you are not straining to hear if something is different or if it matches some references. You know for certain each of the two tracks presented are in fact different. You just pick the one you prefer or the one with whatever quality is being tested for. So you hear two versions, know they are different, pick a preference. Of course which version is presented first varies randomly. If you prefer the same version 75% or more of the time, then it is audible.
Go to the top of the page
+Quote Post
saratoga
post Nov 17 2013, 03:00
Post #22





Group: Members
Posts: 5120
Joined: 2-September 02
Member No.: 3264



QUOTE (esldude @ Nov 16 2013, 21:44) *
You present two choices. A and B. The testee must choose one. The parameter in food industry is something like choose the sweeter or the pair. In audio you could ask a person to choose the sample with most bass, or simply the version you prefer, or that sounds most real.

I think it nicer than ABX


The two are not really comparable because they test two different things. ABX is a test of transparency, hence you must have the reference. What you're describing above is a test of preference, hence no reference is necessary.

Basically, you're asking two different questions, which will not in general necessarily give you the same answer. Neither is nicer, it just depends what you want to know.

Go to the top of the page
+Quote Post
greynol
post Nov 17 2013, 03:24
Post #23





Group: Super Moderator
Posts: 10257
Joined: 1-April 04
From: San Francisco
Member No.: 13167



QUOTE (esldude @ Nov 16 2013, 17:44) *
If you prefer the same version 75% or more of the time, then it is audible.

If I guess right 75% of the time on a set of 8 coin flips, does that make me clairvoyant?

I did it yesterday, so I guess I am! Thanks for confirming my suspicion.

Seriously though, were you not satisfied with the answers you got when you raised this point nearly ten years ago?
http://www.hydrogenaudio.org/forums/index....st&p=163040

FWIW, I've done these types of tests and think they are much more fun than mushra, but they were given in order to establish a preference between two things without the presence of a reference.

This post has been edited by greynol: Nov 17 2013, 06:08


--------------------
Your eyes cannot hear.
Go to the top of the page
+Quote Post
Arnold B. Kruege...
post Nov 18 2013, 04:28
Post #24





Group: Members
Posts: 4324
Joined: 29-October 08
From: USA, 48236
Member No.: 61311



QUOTE (UltimateMusicSnob @ Sep 23 2013, 14:04) *
But ears acclimate to the current sound environment.


So let people acclimate to the current sound environment before you start the test!

QUOTE
ABX depends on listening memory


Good case of demonizing ABX for a property of all comparative listening evaluations.

Here's your challenge - how do you do comparative listening without depending on listening memory?

QUOTE
-unlikely to be effective for listening sessions which resembled "normal" listening.


Here's a news flash - normal listening is horrifically unreliable once you figure out how to determine how reliable it actually is.

We live in a world where self-deceit is very common. People do sighted listening evaluations and they think they hear all sorts of things. But there is no way to know how reliable sighted evaluations all by themselves really were since sighted listening involves other senses than listening which easily substitute their influence for just listening.

QUOTE
I tend to listen to an album all the way through, so my typical realistic session would be in the neighborhood of 30-60 minutes. I could provide Likert responses with some confidence,


You may have confidence but silly boy that I am, I notice that you know which alternative you are listening to by other means than listening, and I rightfully cry foul!


The history of ABX is that first we started doing blind tests, and we quickly encountered the problems with memory for small differences. We then devised ABX to maximize the sensitivity of our blind testing. The reason why we never encountered these problems before is that our listening tests weren't just listening tests, they were also tests that involved knowing what we listened to by other means than listening. Guess what, tests are harder if the right answers aren't posted on the blackboard during the test!

QUOTE
but compared to an album I heard an hour ago? It doesn't seem feasible. Not because it takes too long, but because aural memory will not function effectively across such long spans.


There you go! The most sensitive form of aural memory is all over with in about 2 seconds. There is actually a cascade of different kinds of aural memory, but they last for different amounts of time. Generally they become less sensitive to small differences the longer the amount of time involved.

QUOTE
I could be wrong, I'd be interested in published data if anyone has done it.


The best book I've found that describes how we remember sounds is "This is your Brain on Music by Letivin". It is full of citations of proper scientific research. It is readily available for about $15.
Go to the top of the page
+Quote Post
Arnold B. Kruege...
post Nov 18 2013, 04:31
Post #25





Group: Members
Posts: 4324
Joined: 29-October 08
From: USA, 48236
Member No.: 61311



QUOTE (esldude @ Nov 16 2013, 20:44) *
One that makes sense to me is a variation used in the food industry. Two alternative forced choice. Also generally you have a reliably perceived difference if the testee scores 75% correct choices regardless of the number of trials.

You present two choices. A and B. The testee must choose one. The parameter in food industry is something like choose the sweeter or the pair. In audio you could ask a person to choose the sample with most bass, or simply the version you prefer, or that sounds most real.

I think it nicer than ABX as it is more how people listen for differences when not doing blind tests. They listen to a couple things and pick the one they prefer. Also you are not straining to hear if something is different or if it matches some references. You know for certain each of the two tracks presented are in fact different. You just pick the one you prefer or the one with whatever quality is being tested for. So you hear two versions, know they are different, pick a preference. Of course which version is presented first varies randomly. If you prefer the same version 75% or more of the time, then it is audible.


Proving once again that tests are a lot more fun if they aren't real tests. One of the characteristics of a real test is that they must provide a means for people to fail the test. Sorry about that!

You seem to be sort of dancing around tests that are more like Mushra or ABC/hr. They aren't preference tests but they are more like preference tests than ABX.

It's really about the right tool for the job. ABX seems to still be king if you want to know if there is an audible difference, but it is horrible for preference testing.

If you want to look at proper blind preference testing, please see http://www.sensorysociety.org/ssp/wiki/Category:Methodology/ The Triangle test is pretty close to ABX.

This post has been edited by Arnold B. Krueger: Nov 18 2013, 04:40
Go to the top of the page
+Quote Post

2 Pages V   1 2 >
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 29th November 2014 - 01:35