IPB

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
Intepretation of results of blind test, When it's possible to say if codec is transparent or have a good q
IgorC
post Nov 28 2008, 02:31
Post #1





Group: Members
Posts: 1572
Joined: 3-January 05
From: ARG/RUS
Member No.: 18803



I'm curious about how it's correct to interpret the results of blind tests.

For example average scores of two competitors are A=4.5 and B=4.7
First of all statistcly speaking they are tight. But let's suppose they were enough samples as to discard this statistic interval.
I know it's not about mathematics but more about practical approach wich filosofy is way different but there are my thoughts:

Mathematical interpetation
On one hand A - 4.5/5 = 90% of "quality" and B - 4.7/5= 94%.
1st type of interpretation. 94/90 = 1.0444...
Codec B is better than A by 4.444...%

2st type of interpretation. The codec A has 90& of quality (full transparency) then it has 10% of perceptible difference (artifacts) while codec B has 94% of quality and 6% of perceptible difference.
Then Artifacts(A)/Artifiacts(B) 10%/6% = 1.666... .
Codecs B is better than A by 1.666..x times.

In my opinion 2st type of interpretation is more close to real reflection of listeting experience as people remeber mostly artifacts and don't notice a good job of codecs.

Psychological interpretation
When can I say quality is transparent? 5.0? Or less? 4.9,4.8...?
I can speak only about me in this case. I will put a X mark for one particular sample
1. 5.0 - undistinguish from original
2. 4.9 - I doubt if I ever will hear the difference again or if I actually hear it at all.
3. 4.8 - Well I hear the difference, but it was too hard and requiered a lot of concetration
4. 4.7 - The sample isn't transparent at all but the quality is high. And it's point for me to say "not transparent"

So this way codec is transparent (or extremely close to transparent) for me if the average score is at least 4.8 (Psychological interpretation)

While for mathematical (and/or statistioc) intepretation 95% is a good approximation. So 5 * 95% = 4.75 is minimal score for imaginary transparency.

Any comments and thoughts about your personal experience are welcomed.

If someone has already discussed it then please give the link.
Go to the top of the page
+Quote Post
MichaelW
post Nov 28 2008, 05:32
Post #2





Group: Members
Posts: 631
Joined: 15-March 07
Member No.: 41501



I've got no practical experience with ABC/HR, nor am I competent in statistics, but I spent a working life making subjective evaluations (and comparing results and worrying about their validity).

From that, I'd wonder if we can really read this scale in a simple linear fashion. I would be surprised if people could, reliably and accurately, use more than a 10-point scale on a task like this. Perhaps even the 5 integer points are the only ones that really count, for any individual making a judgment.

Further, are our perceptions of "quality" based on a simple arithmetic scale of number of artifacts, or whatever technical measure of goodness of compression might be appropriate? Most of life seems to be logarithmic.

I, therefore, tend to read a score of 4.5 as meaning "Half the time, the testers couldn't tell this from the reference," and 4.1 as meaning "Didn't annoy people, and transparent for a few." This, too, isn't quite right, as it ignores variability of scores, but it seems a bit closer to what the tests mean.

I'm really grateful to the testers, and especially Sebastian who has organised a lot of tests. I'm conscious that the results are being read in two ways. Some people take them as a guide to usage (as, for instance, in the latest case, we can say that a number of modern MP3 encoders are very good indeed at 128 kbps; if you have specific, critical needs, you need to do your own tests). Others are interested in absolute rankings of encoders, a kind of MP3 Olympics. I doubt if ABC/HR scores can really support such an order of merit.

Once more, thanks to everybody who does this stuff, and big ups to Sebastian.
Go to the top of the page
+Quote Post
Canar
post Nov 28 2008, 07:40
Post #3





Group: Super Moderator
Posts: 3360
Joined: 26-July 02
From: princegeorge.ca
Member No.: 2796



Another possibility for interpretation of the values is to consider them as a total order. I'm not enough of a math nerd to know whether that would have any significant statistical repercussions, but perhaps it could.

This post has been edited by Canar: Nov 28 2008, 09:22


--------------------
You cannot ABX the rustling of jimmies.
No mouse? No problem.
Go to the top of the page
+Quote Post
MichaelW
post Nov 28 2008, 09:20
Post #4





Group: Members
Posts: 631
Joined: 15-March 07
Member No.: 41501



Now I'm retired, I'm trying to teach myself high-school maths, so what do I know?

But, I suspect that the scores in tests like this might not be in a transitive relationship, if that's the right way of putting it.

Because a > b, and b > c, it doesn't necessarily mean a > c (where > is to be read as "is preferred to").

Better stop now, before I get totally out of my depth (before??).
Go to the top of the page
+Quote Post
Canar
post Nov 28 2008, 09:24
Post #5





Group: Super Moderator
Posts: 3360
Joined: 26-July 02
From: princegeorge.ca
Member No.: 2796



I see where you're coming from. I'm just hypothesizing that if we can assert that the codec ratings are in a total order, we can manipulate them mathematically with more validity.

I don't even know if there is any validity to this at all.

This post has been edited by Canar: Nov 28 2008, 09:24


--------------------
You cannot ABX the rustling of jimmies.
No mouse? No problem.
Go to the top of the page
+Quote Post
muaddib
post Nov 28 2008, 11:44
Post #6





Group: Developer
Posts: 398
Joined: 14-October 01
Member No.: 289



Are you searching for an interpretation of a private or a public listening test?

QUOTE (IgorC @ Nov 28 2008, 02:31) *
First of all statistcly speaking they are tight. But let's suppose they were enough samples as to discard this statistic interval.

You can not discard intervals. It just might happen that having enough listeners (or repetitions from 1 person in different days), intervals get so small that they don't overlap anymore.

QUOTE (IgorC @ Nov 28 2008, 02:31) *
1. 5.0 - undistinguish from original
2. 4.9 - I doubt if I ever will hear the difference again or if I actually hear it at all.
3. 4.8 - Well I hear the difference, but it was too hard and requiered a lot of concetration
4. 4.7 - The sample isn't transparent at all but the quality is high. And it's point for me to say "not transparent"

It would be good to include this in
http://www.hydrogenaudio.org/forums/index....c=67547&hl=
There is also nice recommendation in this thread for ABC/HR grades.
Go to the top of the page
+Quote Post
Alexxander
post Nov 28 2008, 12:28
Post #7





Group: Members
Posts: 462
Joined: 15-November 04
Member No.: 18143



What you try to do IgorC, is getting to solid conclusions based on cold numbers. This is only possible if all the variables are under control and if everybody rates the same way using the same rating system.

A blind test like ABC/HR only tells how the participants rated the samples and almost nothing about whether Codec A is more transparent or has more artifacts than Codec B has. Each individual has his own hearing and way of rating (and both vary in time). The end conclusion depends completely on the selection of participants (I discard controllable parameters like listening environment and used tools).

So, even ignoring error margins (which are always there!), I cannot agree with your suggestions of mathematical interpretations as the cold ABX results mean very little. Personally I would avoid calculating mathematical relationships between results.

This post has been edited by Alexxander: Nov 28 2008, 12:29
Go to the top of the page
+Quote Post
muaddib
post Nov 28 2008, 12:57
Post #8





Group: Developer
Posts: 398
Joined: 14-October 01
Member No.: 289



QUOTE (Alexxander @ Nov 28 2008, 12:28) *
The end conclusion depends completely on the selection of participants (I discard controllable parameters like listening environment and used tools).

Depends also on a time when each participant gave its grade, because grade for the same sample from the same participant may vary a LOT between two trials. Even order of encoder ratings can differ. There is a proof for this in results of public listening tests conducted so far where low (or was it high anchor) were the same in different tests. Sorry, I don't have enough time to search for this example.
Go to the top of the page
+Quote Post

Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 16th September 2014 - 16:46