Welcome Guest ( Log In | Register )

Should HA promote a more rigorous listening test protocol?, was: "HA -- guilty as charged?" (TOS #6)
post Nov 23 2012, 19:01
Post #1

Group: Members
Posts: 2811
Joined: 18-December 03
Member No.: 10538

I was taken aback to read today this exchange on gearslutz, from earlier this year

QUOTE ("Bob Ohlsson")
It's important to understand that what JJ considers a listening test and what the ABX/Hydogen Audio skeptics crowd considers a listening test are two very very different things.

QUOTE (Kees de Visser")
Perhaps JJ can explain what he considers a listening test and how it's different from the Hydrogenaudio standpoint.
I was somehow under the impression they were not that different.

QUOTE ("j_j")
Including positive and negative controls, lots of training for the test as well as familiarity with the equipment and music, and equiment validation are the biggies.

Test evaluation might be an issue, too. Many tests, including some of the MPEG tests and 1116 make assumptions that the entire population reacts the same to impairments. While basic masking is universal, what people dislike when they can hear something is NOT universal.


Now, I agree with Kees -- I don't think the HA community 'take' on listening tests is that different from what JJ mentions. Few here, I suspect, would dismiss the real utility of training , or of positive controls, or familiarity etc., in making a listening test maximally sensitive. (as for the rest, I confess I;m not really clear whether JJ's criciticsm of test evaluation is directed at HA)

What I think is happening is a difference in what listening tests are used for. Most individual HA reports of ABX tests are from users wanting to know if file X sounds different from file Y to them, as they are now, using the equipment they have, not as they would be after training to hear artifacts, on the most revealing equipment. They aren't doing basic research into a difference's audibility, as JJ did, for example, when developing lossy codecs. For that purpose, trained listeners, positive & negative controls, familiarity and 'validated' equipment are necessities.

Still, HA *does* host mass listening tests from time to time -- which are more akin to 'basic research' -- and its few 'official' guidelines on setting up listening tests -- the HA wiki, and Pio's sticky threads -- make no mention of training, +/- controls, etc. as factors in such tests.

Time to change this?

This post has been edited by krabapple: Nov 23 2012, 19:05
Go to the top of the page
+Quote Post
Start new topic
post Nov 28 2012, 16:51
Post #2

Group: Super Moderator
Posts: 11377
Joined: 1-April 04
From: Northern California
Member No.: 13167

If the contenders are statistically tied, changing the anchors isn't going to magically untie them. Also, having only a few listeners and a few samples doesn't make for very compelling results, especially when the listeners are untrained.

Unlike ABX, where you rely on continued trials to demonstrate that you can consistently distinguish between two things, MUSHRA tests rely on many samples and well-chosen controls to help weed out bad data. When working with contenders that are near-transparent, a hidden reference makes sense, otherwise it is a poor control that is too easy to identify. Same goes for low anchors if they are too low.

When the anchors are too close, low anchors may get ranked better than the contenders. High anchors may get ranked worse. This is not exactly unreasonable. What needs to be taken seriously is that judging is subjective; not everyone ranks different artifacts the same way. It could be that the low anchors actually do sound better or the high anchor actually does sound worse. It is also not unreasonable to get differing rankings between all stimuli based on the specific clips being auditioned. What may be unreasonable is to dismiss discrepancies like these from the "expected" results as "wrong".

With this in mind, I only take seriously the clear trends in very large tests (many participants and many worthwhile, typical real-life sample clips). I somewhat reject the idea that all participants must be trained when there are large numbers of them, however. While the testers should be able to distinguish and categorize them, they should not be steered into thinking one is less desirable than the other.

Lastly, all too often people treat the results of small tests posted here as definitive. They really aren't.

This post has been edited by greynol: Nov 28 2012, 18:34

Your eyes cannot hear.
Go to the top of the page
+Quote Post
post Nov 28 2012, 20:08
Post #3

Group: Members
Posts: 835
Joined: 17-September 06
Member No.: 35307

QUOTE (greynol @ Nov 28 2012, 15:51) *
If the contenders are statistically tied, changing the anchors isn't going to magically untie them. Also, having only a few listeners and a few samples doesn't make for very compelling results, especially when the listeners are untrained.

My point was echoing David's about the potential to compress the range of ratings given to codecs vastly superior to the low anchor in order to score the low anchor sufficiently low. This may introduce more rounding error into the ratings and widen the error bars. No magical effect, just a reduction in statistical noise that might improve discrimination at the margin (or at least should make it no worse).

I was suggesting that before the main test (which still has a lot of testers and a lot of samples), appropriately close anchors could be chosen by a short test on only a few samples which rules out anchors that are vastly superior or vastly inferior to the codecs under test. I don't think Woodinville believes it is essential that the anchors must be outside the range of the codecs under test (i.e. consistently lower and higher) but could be fairly consistenty towards the low end and fairly consistently towards the high end, assuming we used two anchors.

I think Woodinville mentioned the trickiest thing to get right. If we presume that the nature of low-pass filter quality degradation is too different to the nature of typical codec flaws (warbling, sparklies, tonal problems, transient smear, pre-echo, stereo image problems etc.) then we'd be looking for anchors instead among other encoders and settings not under test, or from consistent distortions of a similar nature. For example, we might choose a prior generation codec, even at a slightly higher bitrate as a lowish anchor. Maybe Lame -V7, for example, or l3enc at 160 kbps -hq or 192 kbps, or toolame at 128 kbps, perhaps, or FhG fastenc MP3 at a setting with Intensity Stereo rather than safe joint stereo. Perhaps a high anchor could be a previous test-winner at a slightly higher bitrate where some flaws are still evident (so that it still acts as a Positive Control - i.e. distinguishable from the original audio). There are certain encoders so badly flawed that some testers will immediately identify them, so I suppose Xing (old) with no short blocks or BLADEenc would not be good choices.

It also partly depends on the intention we have in using these close anchors. If it's to compare one listening test quality scale to another, yet to avoid simple low-pass filters, we might wish to use a consistent set of anchors (same codec version and settings) over a number of years, even if one is a high anchor in one test and a low anchor in the next. This can be especially helpful if at least some of the test samples feature in every listening test.

Another potential use of the anchors would be to calibrate and normalize the quality scales used by different listeners, though the validity of this is questionable as some people find pre-echo more annoying than tonal problems, or find stereo collapse less objectionable than high-frequency sparklies for example, while others have the reverse preferences. The preferences here are part of the reason that results can be intransitive.

Once or twice, anchors have also been used to address a common claim or myth (e.g. that WMA 64 kbps is as good as MP3 and 128 kbps). For guruboolez, some 80-96 kbps tests used lame at about 128 kbps as an anchor to assess where the truth lay at the time to his ears, for example.

I would say, however, that I think the methods of all the recent public tests are pretty darned good and provide useful information about the state of the art at the time.

These discussions might enable some more nuanced conclusions to be drawn and some comparison between the results of one test and the results of another where the same anchor on the same samples has a different rating. However, given the statistical error, there are still limits on what we can conclude.

We need to weigh up whether we'll gain enough by changing methods to be worth the additional effort. That might be an individual matter for the test organiser to choose, given how much valuable work they put in already and how they weigh up the number of codecs under test against other parameters.
Go to the top of the page
+Quote Post

Posts in this topic
- krabapple   Should HA promote a more rigorous listening test protocol?   Nov 23 2012, 19:01
- - saratoga   Lots of the personal listening tests are by people...   Nov 23 2012, 19:16
|- - krabapple   QUOTE (saratoga @ Nov 23 2012, 13:16) Lot...   Nov 24 2012, 04:10
- - greynol   Pio's post does make mention of relegating ABX...   Nov 23 2012, 19:24
|- - krabapple   QUOTE (greynol @ Nov 23 2012, 13:24) Pio...   Nov 24 2012, 04:02
- - Canar   With all due respect to Mr. J., while his criticis...   Nov 23 2012, 23:16
|- - krabapple   QUOTE (Canar @ Nov 23 2012, 17:16) With a...   Nov 24 2012, 04:09
|- - greynol   QUOTE (krabapple @ Nov 23 2012, 19:09) Wo...   Nov 24 2012, 17:36
|- - krabapple   QUOTE (greynol @ Nov 24 2012, 11:36) QUOT...   Nov 25 2012, 16:34
|- - greynol   QUOTE (krabapple @ Nov 25 2012, 07:34) QU...   Nov 25 2012, 18:17
- - Canar   Honestly, I think our procedure is fine, given wha...   Nov 24 2012, 04:39
|- - krabapple   QUOTE (Canar @ Nov 23 2012, 22:39) Honest...   Nov 24 2012, 14:07
- - greynol   My concern about people coming here to argue that ...   Nov 24 2012, 05:00
- - Axon   There's a tradeoff going on here. One the one...   Nov 25 2012, 08:04
- - Woodinville   Ok, I'm a little confused here. How does what ...   Nov 25 2012, 09:20
|- - greynol   QUOTE (Woodinville @ Nov 25 2012, 00:20) ...   Nov 25 2012, 17:31
- - Porcus   I agree with Axon, if that is what is being discus...   Nov 26 2012, 08:25
- - 2Bdecided   Do that many tests meet BS.1116? It's a long t...   Nov 26 2012, 13:58
- - dhromed   I am frankly surprised that there is no sticky at ...   Nov 26 2012, 14:22
- - IgorC   Great. A lot of problem statements. Now people can...   Nov 26 2012, 18:14
|- - Woodinville   QUOTE (IgorC @ Nov 26 2012, 09:14) Sorry,...   Nov 27 2012, 02:27
|- - krabapple   QUOTE (Woodinville @ Nov 26 2012, 20:27) ...   Nov 27 2012, 15:31
|- - IgorC   QUOTE (Woodinville @ Nov 26 2012, 22:27) ...   Nov 27 2012, 17:43
|- - Porcus   QUOTE (IgorC @ Nov 27 2012, 17:43) You ju...   Nov 27 2012, 18:12
|- - Woodinville   QUOTE (Porcus @ Nov 27 2012, 09:12) Also,...   Nov 27 2012, 23:05
|- - IgorC   QUOTE (Porcus @ Nov 27 2012, 14:12) If an...   Nov 28 2012, 02:12
- - greynol   Krabapple, the author of this discussion, did in f...   Nov 26 2012, 18:30
- - Canar   With the talk about "including positive and n...   Nov 26 2012, 18:38
|- - Woodinville   QUOTE (Canar @ Nov 26 2012, 09:38) With t...   Nov 27 2012, 02:32
|- - Dynamic   QUOTE (Woodinville @ Nov 27 2012, 01:32) ...   Nov 27 2012, 15:05
|- - krabapple   QUOTE (Dynamic @ Nov 27 2012, 09:05) I th...   Nov 27 2012, 15:40
|- - Woodinville   QUOTE (Dynamic @ Nov 27 2012, 06:05) We u...   Nov 27 2012, 23:03
- - Canar   There's a concept that might be useful: ...   Nov 27 2012, 20:21
- - IgorC   Let's suppose two separate tests and 3 codecs:...   Nov 28 2012, 01:17
|- - Woodinville   QUOTE (IgorC @ Nov 27 2012, 16:17) Let...   Nov 28 2012, 04:03
- - IgorC   Indeed it's a different one. I took just one ...   Nov 28 2012, 05:00
|- - Woodinville   QUOTE (IgorC @ Nov 27 2012, 20:00) Do You...   Nov 28 2012, 06:40
- - IgorC   Got it. The idea of positive and negative control...   Nov 28 2012, 08:03
- - greynol   Not really JJ's technique, but that which is c...   Nov 28 2012, 08:15
- - 2Bdecided   I agree that using controls is necessary in a prop...   Nov 28 2012, 12:05
|- - IgorC   QUOTE (2Bdecided @ Nov 28 2012, 08:05) e....   Nov 28 2012, 17:41
|- - Woodinville   QUOTE (IgorC @ Nov 28 2012, 08:41) QUOTE ...   Nov 28 2012, 18:35
- - Dynamic   Good point, David. I guess a rough and ready pre-...   Nov 28 2012, 15:54
- - greynol   If the contenders are statistically tied, changing...   Nov 28 2012, 16:51
- - Woodinville   QUOTE (greynol @ Nov 28 2012, 07:51) If t...   Nov 28 2012, 18:32
|- - greynol   QUOTE (Woodinville @ Nov 28 2012, 09:32) ...   Nov 28 2012, 19:00
|- - Woodinville   QUOTE (greynol @ Nov 28 2012, 10:00) QUOT...   Nov 28 2012, 19:15
- - Dynamic   QUOTE (greynol @ Nov 28 2012, 15:51) If t...   Nov 28 2012, 20:08
- - IgorC   QUOTE (Dynamic @ Nov 28 2012, 16:08) My p...   Nov 29 2012, 01:04

Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:


RSS Lo-Fi Version Time is now: 25th November 2015 - 15:01