Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: SoundExpert explained (Read 38568 times) previous topic - next topic
0 Members and 2 Guests are viewing this topic.

SoundExpert explained

Reply #25
I think it's commonly accepted* that signal detection (e.g. artefact detection in these tests) is a psychometric function - an S-curve, generated by integrating a Gaussian distribution...


Think you thought of a footnote text corresponding to that asterisk?

Anyway, the "signoid" curve need not be the Gaussian cumulative distribution, though that is one common choice; another is the logistic distribution (more: http://en.wikipedia.org/wiki/Link_function#Link_function ).  I'd guess a "signoid" in this context would mean any positive smooth strictly increasing convex-and-then-concave function symmetric about (0,1/2), i.e., corresponding to a unimodal symmetric distribution, absolutely continuous and of full support.

Each of these choices will constitute parametric models, meaning that you make the assumption that a certain (parametric) family of functions will be a good fit to reality. Then you fit the parameters to find the best-fit-within-the-family. Then if a model fits well from level A to level B (where all your observations are), then it is common practice to infer that it should perform acceptably  at least from somewhere below A to somewhere above B as well. How far you can extrapolate, does of course depend on circumstances.


Now I think -- though this is outside my field of expertice -- that choice of link function is more crucial for wider extrapolations. Again, this depends a bit on circumstances; for example, in an ABX listening test, the interesting issue is whether you guess better than 50%, while in diagnosis of rare diseases -- or default of sovereign bonds -- you are already in the tail of the distribution.

SoundExpert explained

Reply #26
I think it's commonly accepted* that signal detection (e.g. artefact detection in these tests) is a psychometric function - an S-curve, generated by integrating a Gaussian distribution...


Think you thought of a footnote text corresponding to that asterisk?
Yes! Must have deleted it by mistake...

King-Smith, P. E., & Rose, D. (1997). Principles of an adaptive method for measuring the slope of the psy-chometric function. Vision Research, 37(12), 1595-
1604. [PubMed]

...though I've lost my copy of the article - I cited it a decade ago so I must have thought it made sense back then.

Quote
Anyway, the "signoid" curve need not be the Gaussian cumulative distribution, though that is one common choice; another is the logistic distribution (more: http://en.wikipedia.org/wiki/Link_function#Link_function ).  I'd guess a "signoid" in this context would mean any positive smooth strictly increasing convex-and-then-concave function symmetric about (0,1/2), i.e., corresponding to a unimodal symmetric distribution, absolutely continuous and of full support.

Each of these choices will constitute parametric models, meaning that you make the assumption that a certain (parametric) family of functions will be a good fit to reality. Then you fit the parameters to find the best-fit-within-the-family. Then if a model fits well from level A to level B (where all your observations are), then it is common practice to infer that it should perform acceptably  at least from somewhere below A to somewhere above B as well. How far you can extrapolate, does of course depend on circumstances.

Now I think -- though this is outside my field of expertice -- that choice of link function is more crucial for wider extrapolations.
Yep, I agree with all of that. I think the "slightly" different curve shapes don't matter as much as you might expect in practice here since the psychometric data is likely to be rather rough anyway. If you get data on the steep part of the curve, you can probably do quite well even if you're not sure of the shape. If you get data on the shallow part of the curve, you're in more trouble if you don't know the exact shape, but you were already way off anyway.


Measured psychoacoustic thresholds are often 70% (because the procedure is often the one that I quoted on the previous page).

The 50% point on the S-curve doesn't correspond to the "getting better than 50% means it's not just chance" in ABX. In ABX, if you can't hear a thing, you'll (on average) score 50%. That's way off to the left on the S-curve. I guess that 50% on the S-curve gives a 75% score on ABX (???), which can give a very low p (depending on the number of trials).

Cheers,
David.

SoundExpert explained

Reply #27
[Heavily edited]

The 50% point on the S-curve doesn't correspond to the "getting better than 50% means it's not just chance" in ABX. In ABX, if you can't hear a thing, you'll (on average) score 50%.


Well, as I have access to the reference of yours, I looked it up. Brief review after only one reading:

- they are using a logistic model which runs not from 0 to 1, but from false positive rate to 1 minus false negative rate. Detection threshold is set at 50% chance of being detected. This is not the same as coinflipping (they use a "Zippy Estimation by Sequential Testing" method, referencing one of King--Smith's earlier works).

- Experiment: One "more detectable" (stronger light in their experiment, could be "more distorted" in ours) signal A and one "less detectable" signal B are displayed, order randomized (subject knows it is either AB or BA, with 50/50 chance). Subject asked to identify. Difference between A and B in their case is difference in log luminosity. I.e., they have one explanatory variable. In assessing psychoacoustic lossy encoding, you are rather interested in how to minimize audibility of reduction down to given filesize, but that is another issue: here we are assuming that this job is done.



As for your "75%", it may -- or not, I have not checked the ZEST reference of theirs -- refer to Gini coefficient vs. area under ROC curve. The Gini coefficient measures Pr[subject's stated ordering matches true ordering]. AUROC measures Pr[subject's stated ordering matches true ordering] - Pr[subject's stated ordering does not match true ordering] (I have assumed no ties here). Coinflipping (works here, as they are randomized at probability 50/50) yields Gini coefficients of .5 respective 0. In some applications, one targets a value for this statistic, in others one targets a significance level for better-than-coinflipping. Basically, it is Mann--Whitney's U and its properties.

SoundExpert explained

Reply #28
What is the justification for the "dashed" portion of the curve?

Shouldn't it be a flat line once you reach "imperceptible"? If not, once something is imperceptible, how can it become "more imperceptible"?



Matter of definition, interpretation and use.

1) Consider three chess games which are both "theoretically lost". One is a simple mate in one, the other is so hard that if you put 1000 chess players at the task, you won't be able to distinguish it from the startup position by statistical analysis of the outcome. And, the third is so hard that it won't be solved in fifty years. To clearcut the logic, assume that the second is like the third, except with 70 intermediary "only moves" (which do not constitute any learning curve for the subsequent ones).

Now everything else equal, you will still have a clear strict preference. Because you could risk meeting one of the very few chess players that actually can win this. You might not know that is is "humanly winable" though, but you will absolutely want to insure against the uncertainty if it is free.

Now consider a step-by-step sequence of chess positions, starting from the "third" one above. We index them by "# of very hard moves until the win is clear, as measured by statistics within confidence level [say, p]". How do you define the human-winability threshold?


2) Consider 32-bit sound file, then a 31 bit (LSB truncated) file, etc. Rank these. You may claim that every file above a "hearing threshold" of slightly below T bits, is equivalent. However, what if it is an unfinished product? Are you sure that the final mix is going to have the same hearing threshold? If not, then the high-resolution file could very well be more robust -- there might be manipulations which would enable you to hear a difference between the final and its T-bit version, although not between the original and its T-bit version. Most 16-bit CDs are mixed at higher word length, right?
Solution? A "robustness-to-manipulations" measure?


I would certainly agree it is fair to allow for a reasonable margin of error near the threshold of perception.

Quote
- if anyone makes a selling claim, then they have the burden of proof. Then "inaudible difference" is the null hypothesis.


And this really was my concern - if you have a "quality factor" metric that seems to imply one product is "better" than another based on the extrapolated portion of the curve, it is ripe for someone to misuse. For this reason, I think the information on the threshold of perception needs to be preserved.

SoundExpert explained

Reply #29
And this really was my concern - if you have a "quality factor" metric that seems to imply one product is "better" than another based on the extrapolated portion of the curve, it is ripe for someone to misuse. For this reason, I think the information on the threshold of perception needs to be preserved.

Precisely (and on this forum, SE results do get misued)!

There has been a lot of talk about psychometrics, but little to none about psychoacousitcs.  When it comes to perceptual coding it is the latter that is king.

Someone, anyone, provide some data showing a direct correlation between across the board "artifacts" amplification to the real-world application of lossy audio compression.  I've seen claims that SE results are good for those interested in applications such as surround-sound processing, transcoding and equalization.  Evidence, please!!!

NB: the word artifacts was put in silly quotes for a reason.  We already had the discussion about what constitutes artifacts and the role masking plays.  I am not denying that they can become unmasked through typical real-world usage but I am denying that across the board amplification of a difference signal that is subsequently added back in constitutes real-world usage.

AFAICT none of the criticisms put forth by people like Garf, Sebastian, Woodinville and Saratoga have been sufficiently addressed since they've been raised.  It seems we've made no progress over the last four years.

SoundExpert explained

Reply #30
Someone, anyone, provide some data showing a direct correlation between across the board "artifacts" amplification to the real-world application of lossy audio compression.  I've seen claims that SE results are good for those interested in applications such as surround-sound processing, transcoding and equalization.  Evidence, please!!!

I can well agree that such parameter as quality margin might not be very useful in practice of lossy codecs usage. The metric was developed for assessing wider class of low impairments. Finally it could be a substitute (further development) of current audio metric based on THD, SNR, IMD ... parameters. As opposed to current metric the new one has to be sensitive to psychoacoustic features of human hearing. That's why lossy coders are perfect for test drive of the metric. Also they produce time accurate output and diff signal is easy to extract.

So I prefer to separate the questions:
  • Do we need some audio metric capable of assessing quality margin of various devices and DSPs?
  • Do we need such metric for lossy encoders?
  • Is it possible to develop such metric in principle?
keeping audio clear together - soundexpert.org

SoundExpert explained

Reply #31
If we aren't going to consider real-world usage of perceptual audio coding then why does one need margin?

What role does an indirect method of measuring artifact detection play when there is direct method of measurement available, especially when the direct method can be applied to real-world usage?

I think these are poignant questions when considering whether SE results are to be used as an acceptable means of support for judging sound quality on this forum.

I found this thread among SoundExpert referals and was a bit surprised with almost complete misunderstanding of SE testing methodology and particularly how diff signal is used in SE audio quality metrics.
I feel this needs to be addressed.  The thread in question has nothing to do with SE.  When SE was raised I don't believe there was any misunderstanding.  Speaking only for myself, Serge, I believe I do understand your testing methodology.

SoundExpert explained

Reply #32
What role does an indirect method of measuring artifact detection play when there is direct method of measurement available, especially when the direct method can be applied to real-world usage?

The lower the impairments - the more expensive and less reliable the results of listening tests. This is a problem.
keeping audio clear together - soundexpert.org

SoundExpert explained

Reply #33
Breaking masking by amplifying a difference signal by fixed arbitrary amount will not guarantee real-world performance.  This is a problem.

Individual ABX testing has always taken precedence over group testing on this forum.  If someone feels that he or she is (or has become) more sensitive to artifacts then a new test can always be performed.

SoundExpert explained

Reply #34
Breaking masking by amplifying a difference signal by fixed arbitrary amount will not guarantee real-world performance.  This is a problem.

This is not a problem, this is an opportunity.
keeping audio clear together - soundexpert.org


SoundExpert explained

Reply #36
[...]
So, yes, difference signal is used in SE testing.
[...]
This is the concept.
[...]
SE testing methodology is new and questionable,

Yes, very questionable. I said this 4 years ago and I'm still saying this now.

but all assumptions look reasonable and SE ratings

Not really. It's not hard to imagine the possibility of signal pairs (main,side) where you can't hear any difference between main and main+side but you can easily hear a difference between main and main+0.5*side. Hint: phase is a bitch. ;-) Your implicit assumption is that both signals are independent. But this is not necessarily the case with perceptual audio coders. Take for example the MPEG4 tool called PNS (perceptual noise substitution). It just replaces some high frequency noise with synthetically generated noise of the same level. This is done by transmitting the noise level only. Obviously, we can use this tool in cases when the main perceptual feature is the energy level and anything else is not important. Then, we have the following properties: Noise level of original matches the noise level of the encoded result, so energy(main) = energy(main+side). Probability theory tells us that main and main+side are orthogonal. This implies a coherence between main and side of 0.7 -- ZERO POINT SEVEN. Hardly independent. This also implies that a 50/50 mix -- main+0.5*side -- would lose 3dB power. You can easily compute this via
Code: [Select]
main = [1 0];
side = [0 1] - main;
20*log10(norm(main+0.5*side))

(Matlab code)

So, by attenuating the sample-by-sample difference we actually amplify the perceived difference (since we lose power) in this case! What does that tell us? It tells us that you overrate sample-by-sample differences. Perceptual audio coders try to retain certain things so it sounds similar and tolerate other losses. And you're focussing on the "other losses" (as well). What you're doing is basically violating some of a perceptual encoder's principles (like keeping energy levels similar no matter how large the sample-by-sample difference will be). By amplifing the difference you could destroy some signal properties the encoder and our HAS cares about much more than you do. Sound perception is not as simple as you want us to believe. Sample-by-Sample differences are not important. And "extrapolating artefacts" this way is nothing but a big waste of time. Even testing with "attenuated artefacts" doesn't tell you anything. Your methodology breaks down because you're assuming that the difference is independent from the original. It is not.

Cheers!
SG

SoundExpert explained

Reply #37
How so?

Because controlled breaking of masking may turn out a powerful instrument of audio research and a basis for audio metric.

It just has to be proved or disproved. IMHO this can't be done in discussions.
keeping audio clear together - soundexpert.org

SoundExpert explained

Reply #38
Using a difference signal as a signal-detection test probe (using variable gain) is very seriously broken.

Consider. Most codecs remove lots of high frequencies. If you add, say, twice the difference signal back tot he original, YOU ADD THE PROPER HF ENERGY BACK.

This leads, and yes, I have confirmed it with a simple, unpublished experiment, to "interesting" results in terms of perceived quality.

-----
J. D. (jj) Johnston

SoundExpert explained

Reply #39
Using a difference signal as a signal-detection test probe (using variable gain) is very seriously broken.

Consider. Most codecs remove lots of high frequencies. If you add, say, twice the difference signal back tot he original, YOU ADD THE PROPER HF ENERGY BACK.


Do they "remove" the high frequencies, or do they remove the information -- i.e., replace it with something which may or may not have the same energy, but has less information content? If you take (1) a sawtooth signal, (2) dither it down to 24 bits, (3) mp3-encode -- what will each signal's Fourier coefficients look like?

(Not a rhetorical question. I don't know.)

SoundExpert explained

Reply #40
Breaking masking by amplifying a difference signal by fixed arbitrary amount will not guarantee real-world performance.
A lot of radio stations use an Orban processor to juice up the signal (EQ, multi-band compression, limiting). The Orban is pretty good at breaking masking. Lossy audio that might pass a transparency ABX test can become quite bad after "Orbanisation". If your real world is rather predictable (e.g. personal use on a known system), then ABX is probably fine.
If the SoundExpert test is flawed, what kind of stress test would be able to reveal sub-threshold differences that ABX can't ?

SoundExpert explained

Reply #41
Using a difference signal as a signal-detection test probe (using variable gain) is very seriously broken.

Consider. Most codecs remove lots of high frequencies. If you add, say, twice the difference signal back tot he original, YOU ADD THE PROPER HF ENERGY BACK.


Do they "remove" the high frequencies, or do they remove the information -- i.e., replace it with something which may or may not have the same energy, but has less information content? If you take (1) a sawtooth signal, (2) dither it down to 24 bits, (3) mp3-encode -- what will each signal's Fourier coefficients look like?

(Not a rhetorical question. I don't know.)


It depends on the codec. MP3 and AAC (no plus) simply remove the high frequencies. The "plus" series adds in non-signal that sounds kinda-sorta, maybe "ok".
-----
J. D. (jj) Johnston

SoundExpert explained

Reply #42
If the SoundExpert test is flawed, what kind of stress test would be able to reveal sub-threshold differences that ABX can't ?

You already gave the answer: "Orbanisation".  If you find some means of testing that provides direct correlation (perhaps SE will do it, though I'm not holding my breath) then great, you have an alternative; otherwise there will not be another equitable substitute for comparing lossless->orbanisation and lossless->lossy->orbanisation, where you choose the encoder, settings and samples.  ABX or ABC(/HR) will always play a role.  If the differences are "sub-threshold" then for the sake perceived audio quality there is no difference.

SoundExpert explained

Reply #43
It's not hard to imagine the possibility of signal pairs (main,side) where you can't hear any difference between main and main+side but you can easily hear a difference between main and main+0.5*side.

In practice - never. In all cases perception of gradually unmasked artifacts is monotonous function. That was also confirmed by B. Feiten in already mentioned "Measuring the Coding Margin of Perceptual Codecs with the Difference Signal" (AES Preprint # 4417). This is the main point of SE metric that was stated in the first post (above the graph). Once again - not a single case where the curve was not monotonous and numerous cases of monotonous behavior. So I treat this as a fact.

Hint: phase is a bitch. ;-) Your implicit assumption is that both signals are independent. But this is not necessarily the case with perceptual audio coders. Take for example the MPEG4 tool called PNS (perceptual noise substitution). It just replaces some high frequency noise with synthetically generated noise of the same level. This is done by transmitting the noise level only. Obviously, we can use this tool in cases when the main perceptual feature is the energy level and anything else is not important. Then, we have the following properties: Noise level of original matches the noise level of the encoded result, so energy(main) = energy(main+side). Probability theory tells us that main and main+side are orthogonal. This implies a coherence between main and side of 0.7 -- ZERO POINT SEVEN. Hardly independent. This also implies that a 50/50 mix -- main+0.5*side -- would lose 3dB power. You can easily compute this via
Code: [Select]
main = [1 0];
side = [0 1] - main;
20*log10(norm(main+0.5*side))

(Matlab code)

So, by attenuating the sample-by-sample difference we actually amplify the perceived difference (since we lose power) in this case! What does that tell us? It tells us that you overrate sample-by-sample differences. Perceptual audio coders try to retain certain things so it sounds similar and tolerate other losses. And you're focussing on the "other losses" (as well). What you're doing is basically violating some of a perceptual encoder's principles (like keeping energy levels similar no matter how large the sample-by-sample difference will be). By amplifing the difference you could destroy some signal properties the encoder and our HAS cares about much more than you do. Sound perception is not as simple as you want us to believe. Sample-by-Sample differences are not important. And "extrapolating artefacts" this way is nothing but a big waste of time. Even testing with "attenuated artefacts" doesn't tell you anything. Your methodology breaks down because you're assuming that the difference is independent from the original. It is not.

I didn't make such assumption, quite the opposite - see 1b in the first post. Nevertheless, the case you discribe is realy interesting. If exaggerated and simplified a bit it will look like following:

We have a sound excerpt which has a time interval (between tonal parts) which consists purely of, say, white noise. Also we have a coder which can only substitute the noise with uncorrelated one whenever it detects that there are no tonal parts during  that interval. Then diff. signal will consist of  amplified noise portion (being uncorrelated they will be added not subtracted). So the version of our excerpt with amplified differences will have stronger noise part which can be detected in listening tests while in practice this is not important for HAS.

Is this the case you wanted to produce? If yes I will examine it more carefully. It is really interesting as it helps to determine the limits of the metric.
keeping audio clear together - soundexpert.org

SoundExpert explained

Reply #44
Consider. Most codecs remove lots of high frequencies. If you add, say, twice the difference signal back tot he original, YOU ADD THE PROPER HF ENERGY BACK.

We filter out such frequencies from resulting test signals and mentioned that in Diff.Level paper. It is known problem and we addressed it from the beginning. 
keeping audio clear together - soundexpert.org

SoundExpert explained

Reply #45
I think it's commonly accepted* that signal detection (e.g. artefact detection in these tests) is a psychometric function - an S-curve, generated by integrating a Gaussian distribution...


Think you thought of a footnote text corresponding to that asterisk?
Yes! Must have deleted it by mistake...

King-Smith, P. E., & Rose, D. (1997). Principles of an adaptive method for measuring the slope of the psy-chometric function. Vision Research, 37(12), 1595-
1604. [PubMed]

...though I've lost my copy of the article - I cited it a decade ago so I must have thought it made sense back then.

Quote
Anyway, the "signoid" curve need not be the Gaussian cumulative distribution, though that is one common choice; another is the logistic distribution (more: http://en.wikipedia.org/wiki/Link_function#Link_function ).  I'd guess a "signoid" in this context would mean any positive smooth strictly increasing convex-and-then-concave function symmetric about (0,1/2), i.e., corresponding to a unimodal symmetric distribution, absolutely continuous and of full support.

Each of these choices will constitute parametric models, meaning that you make the assumption that a certain (parametric) family of functions will be a good fit to reality. Then you fit the parameters to find the best-fit-within-the-family. Then if a model fits well from level A to level B (where all your observations are), then it is common practice to infer that it should perform acceptably  at least from somewhere below A to somewhere above B as well. How far you can extrapolate, does of course depend on circumstances.

Now I think -- though this is outside my field of expertice -- that choice of link function is more crucial for wider extrapolations.
Yep, I agree with all of that. I think the "slightly" different curve shapes don't matter as much as you might expect in practice here since the psychometric data is likely to be rather rough anyway. If you get data on the steep part of the curve, you can probably do quite well even if you're not sure of the shape. If you get data on the shallow part of the curve, you're in more trouble if you don't know the exact shape, but you were already way off anyway.


Measured psychoacoustic thresholds are often 70% (because the procedure is often the one that I quoted on the previous page).

The 50% point on the S-curve doesn't correspond to the "getting better than 50% means it's not just chance" in ABX. In ABX, if you can't hear a thing, you'll (on average) score 50%. That's way off to the left on the S-curve. I guess that 50% on the S-curve gives a 75% score on ABX (???), which can give a very low p (depending on the number of trials).

Cheers,
David.

Why are not such tests more used with monotonically degrading stuff like lossy encoders? I have made a simple matlab-script for adaptively "honing in" on the most interesting part of the degradation, but reading a paper about the statistics in such tests made me remember how much I have forgotten from my statistics classes.

-k

SoundExpert explained

Reply #46
Why are not such tests more used with monotonically degrading stuff like lossy encoders? I have made a simple matlab-script for adaptively "honing in" on the most interesting part of the degradation, but reading a paper about the statistics in such tests made me remember how much I have forgotten from my statistics classes.


Because audiophiles tend to shun science? Ooops, did I even say that?

SoundExpert explained

Reply #47
Why are not such tests more used with monotonically degrading stuff like lossy encoders? I have made a simple matlab-script for adaptively "honing in" on the most interesting part of the degradation, but reading a paper about the statistics in such tests made me remember how much I have forgotten from my statistics classes.


Because audiophiles tend to shun science? Ooops, did I even say that?

That may be the reason why so few listening tests are done in general, but not why this particular type is seldomly used.

Estimating "just the right" bitrate for a given mp3 encoder seems to be a common issue, and using an adaptive test that spends most of the time where the point of interest along the "PMF" turns out to be, seems sensible.

-k


SoundExpert explained

Reply #49
In all cases perception of gradually unmasked artifacts is monotonous function.
How can you say this when SebG and Woodinville both gave you examples to the contrary?

I hit the exact problem Woodinville describes using the method I posted on the first page of this thread - a listener gets stuck in a "false" minima of audibility because double the difference gives you the original signal back (with the part "removed" by the codec being inverted, but that difference is not usually audible). Hardly monotonic - the chance of hearing the artefact becomes zero at a single gain setting (+6dB), and (with the specific audio I used - YMMV!) leaps back to the "expected" function very quickly either side of that.

Cheers,
David.