Skip to main content

Notice

Please note that most of the software linked on this forum is likely to be safe to use. If you are unsure, feel free to ask in the relevant topics, or send a private message to an administrator or moderator. To help curb the problems of false positives, or in the event that you do find actual malware, you can contribute through the article linked here.
Topic: SoundExpert explained (Read 38584 times) previous topic - next topic
0 Members and 1 Guest are viewing this topic.

SoundExpert explained

I found this thread among SoundExpert referals and was a bit surprised with almost complete misunderstanding of SE testing methodology and particularly how diff signal is used in SE audio quality metrics. Discussion on the topic from 2006 actually seems more meaningful. So I decided to post here some SE basics for reference purposes. I will use a thought experiment which is close to reality though.

Suppose we have two sound signals – the main and the side one. They could be for example a short piano passage and some noise. We can prepare several mixes of them in different proportions:
  • equal levels of main and side signals (0dB RMS)
  • half level of side signal (-6dB RMS)
  • quarter level of side signal (-12dB RMS)
  • 1/8 level of side signal (-18dB RMS)
  • 1/16 level of side signal (-24dB RMS)

After normalization all mixes have equal levels and we can evaluate perceptibility of the side signal in the mixes. Here at SE we found that this perceptibility is a monotonous function of side signal level and looks like this:

Figure: Side signal perception

(1) In other words, there is a relationship between objectively measured level of side signal and its subjectively estimated perceptibility in the mix. And what is more:
[blockquote](a) this relationship is well described by 2-nd  order curve (assuming levels are in dB)
(b) the relationship holds for any sound signals whether they are correlated or not, the only differences are position and curvature of the curve.[/blockquote]
(2) These side stimulus perceptibility curves are the core of SE rating mechanism. Each device under test has its own curve plotted on basis of SE online listening tests.
(3) Side signals are difference signals of devices being tested. Levels of side signals are expressed in dB of Difference level parameter which is exactly equal to RMS level of side signal in our case.
(4) Subjective grades of perceptibility are anchor points of 5-grade impairment scale.
(5) Audio metrics beyond threshold of audibility is determined by extrapolation of that 2-nd order curves. Virtual grades in extrapolated area could be considered as objective quality parameters regarding human auditory peculiarities.

So, yes, difference signal is used in SE testing. We take into account both its level and how human auditory system perceives it together with reference signal. Some difference signals having fairly high levels still remain almost imperceptible against the background of reference signal and vice versa; perceptibility curves reflect this.   

This is the concept. Many parts of it still need thorough verification in carefully designed listening tests, which are beyond SE possibilities. All we can do is to analyze collected grades returned by SE visitors. This will be done for sure and yet this can't be a replacement of properly organized listening tests.

SE testing methodology is new and questionable, but all assumptions look reasonable and SE ratings – promising, at least to me. Time will show.
keeping audio clear together - soundexpert.org

SoundExpert explained

Reply #1
What is the justification for the "dashed" portion of the curve?

Shouldn't it be a flat line once you reach "imperceptible"? If not, once something is imperceptible, how can it become "more imperceptible"?

SoundExpert explained

Reply #2
What is the justification for the "dashed" portion of the curve?

Shouldn't it be a flat line once you reach "imperceptible"? If not, once something is imperceptible, how can it become "more imperceptible"?

The super-goal is to create audio metrics which could assess quality margin of devices and technologies taking into account human auditory characteristics. THD+N does this but regardless of the latter. Quality margin exists objectively, we need an instrument for measuring it. Extrapolating those psychometric curves we create such an instrument. Dashed line could be 1-st order or more complex curve, this is pure conventional. It seems to me that most natural and simplest way to prolong the curve is to extrapolate it. Without that dashed section assessment of quality beyond perception is just impossible.
keeping audio clear together - soundexpert.org

SoundExpert explained

Reply #3
Without that dashed section assessment of quality beyond perception is just impossible.


Exactly! Which is as it should be - there is no change in "quality" beyond the point of perception, unless you're defining "quality" to mean something imperceptible.

SoundExpert explained

Reply #4
Exactly! Which is as it should be - there is no change in "quality" beyond the point of perception, unless you're defining "quality" to mean something imperceptible.

There are subjective and objective quality parameters. Subjective grades can be obtained only  in listening tests. They most accurately evaluate perceived audio quality but helpless for assessing quality margins - encoder@300kbit/s and  encoder@400kbit/s are the same for them, as well as ADC(-90dB THD+N) and ADC(-123dB THD+N) and a lot of other audio equipment and technologies. Objective quality parameters like THD, IMD, frequency response etc. cover many aspects of quality margin estimation but poorly correlate with subjective parameters because do not take into account human hearing. Also subjective and objective parameters “live” in separated worlds, you can't “translate” them into each other.

I am sure there is possibility to create such quality parameter that could assess quality margin with regard to human hearing. And this is the goal of SE efforts. We propose such parameter which combines objective measurement with evaluation by humans. 

You can call this quality parameter as you like, I prefer just “quality rating”. It's a combination of  subjective and objective parameters.
keeping audio clear together - soundexpert.org

SoundExpert explained

Reply #5
Just to be clear - I am not necessarily questioning your goals here.

The problem I see is that when you extrapolate to infinity a curve that attempts to quantify human perception, you are also implicitly implying that human perception itself extends infinitely. You are redefining "imperceptible" to mean "less perceptible". You need the curve (and math) to match the realities of human perception, or else any conclusions you attempt to draw from the extrapolations are essentially meaningless. What's the point of trying to create an objective measure that only applies to a hypothetical world?

I repeat my original assertion - the curve should be a flat line when it reaches the point labeled "imperceptible".

SoundExpert explained

Reply #6
If you want to build human-hearing-oriented audio metrics for the area beyond perception point (p-point) you will presently need some psychometric relationship in that area, which is impossible by definition – you can't research perception beyond p-point (*). So any relationship in that area will be artificial or hypothetical. The task is to find such hypothetical relationship that will serve the purpose best. In SE metrics it is extrapolation of real psychometric curve. In other words, dashed line is what we actually need for our metrics and the only purpose of the real part of this curve is to be a basis for plotting that dashed line.

So, extrapolating the curve I imply that beyond p-point the relationship revealed by real part of the curve holds. You can't prove or disprove this assumption directly because of (*), but this can be done indirectly by comparing SE quality ratings with results of traditional listening tests on audio material with very small impairments.
keeping audio clear together - soundexpert.org

SoundExpert explained

Reply #7
If you want to build human-hearing-oriented audio metrics for the area beyond perception point (p-point) you will presently need some psychometric relationship in that area, which is impossible by definition – you can't research perception beyond p-point.


> human-hearing-oriented
> beyond perception point

Does not compute.


You are talking about "quality margin", but there is no such thing as absolute quality. Quality is, essentially, a measure of fitness for particular purpose. That is, notion of quality is always related to particular application, or a defined set of applications.

So, what kind of application your extrapolated curve relates to?

If the purpose is simply to compare two codecs or devices as applied to perceived qualty of audio reproduction, then the part of the curve below the "imperceptible" threshold should be sufficient. What purpose does the extrapolated part serve then?

SoundExpert explained

Reply #8
Just to be clear, your graph example shows grades where the default noise level (0dB) is quite objectionable, and reducing the noise makes it less and less so - correct?

But with codec testing, you do kind of the opposite. The default noise level (0dB) is usually indistinguishable/transparent, or very nearly so, and to build the "worse quality" part of the curve (the part where people can hear the noise), you have to amplify the coding noise - correct?


People in this thread are saying the scale beyond "imperceptible" makes no sense. I'm not sure if that's true or not. What you're "measuring" (I put that in quotes - see later) is how far the coding noise sits below the threshold of audibility. (or above, if it's audible at the default level). If the second-order curve theory holds true, then to do this you only need sufficient points on the curve where the difference is audible. Points on the curve where the difference is inaudible don't help because it does become a flat line there.


There are several accepted ways to judge the threshold of audibility. I used this one...
Quote
Each masking threshold was determined by a 3-interval, forced choice task, using a one up two down transformed stair case tracking method. This procedure yields the threshold at which the listener will detect the target 70.7% of the time [Levitt, 1971]. The process is as follows.
For each individual measurement, the subject is played three stimuli, denoted A, B, and C. Two presentations consist of the masker only, whilst the third consists of the masker and tar-get. The order of presentation is randomised, and the subject is required to identify the odd-one-out, thus determining whether A, B, or C contains the target. The subject is required to choose one of the three presentations in order to continue with the test, even if this choice is pure guesswork, hence the title “forced choice task.” If the subject fails to identify the target signal, the amplitude of the target is raised by 1 dB for the next presentation. If the subject cor-rectly identifies the target signal twice in succession, then the amplitude of the target is re-duced by 1 dB for the next presentation. Hence the amplitude of the target should oscillate about the threshold of detection, as shown in Figure 6.5. In practice, mistakes and lucky guesses by the listener typically cause the amplitude of the target to vary over a greater range than that shown. A reversal (denoted by an asterisk in Figure 6.5) indicates the first incorrect identification following a series of successes (upper asterisks), or the first pair of correct identi-fications following a series of failures (lower asterisks). The amplitudes at which these rever-sals occur are averaged to give the final masked threshold. An even number of reversals must be averaged, since an odd number would cause a +ve or –ve bias. Throughout these tests, the final six (out of eight) reversals were averaged to calculate each masked threshold.
The initial amplitude of the target is set such that it should be easily audible. Before the first reversal, whenever the subject correctly identifies the target twice, the amplitude is reduced by 6 dB. After the first reversal, whenever the subject fails to identify the target, the amplitude is increased by 4 dB. After the second reversal, whenever the subject correctly identifies the tar-get twice, the amplitude is reduced by 2 dB. After the third reversal, the amplitude is always changed by 1 dB, and the following six reversals are averaged to calculate each masked threshold. This procedure allows the target amplitude to rapidly approach the masked thresh-old, and then finely track it. If the target amplitude were changed in 1 dB steps initially, then the decent to the masked threshold would take considerably longer, and add greatly to listener fatigue. In the case where the listener fails to identify the target initially, then the target ampli-tude is increased by 6 dB for each failed identification, up to the maximum allowed by the re-play system (90 dB peak SPL at the listener’s head).

This is normally used for simple noise masking tone experiments. It seems to work OK with coding noise, but repetition of a moment of coded audio over and over again is quite mind numbing and makes people listen in a very different way to normal music listening. Whether it pushes their thresholds up or down I don't know. Quite a fascinating subject IMO!


It seems to me that your method is far kinder to listeners. If your second order curve fitting can be justified, then it's a really neat way of finding the threshold of audbility (the cross over from 5.0 "imperceptible", to 4.9 "just perceptible but not annoying" on the usual scale) without even having to test at that (difficult) level.



So far so good. What I'm less convinced of is the implication that a given codec has so much "headroom", and that this is a "good thing".

e.g. on the range of content tested, at a given bitrate/setting, a given codec might be transparent even with the noise elevated by 12dB. It scores well in your test. Fair enough. IMO it would be wrong to draw too much from this conclusion. e.g.
1. It's tempting to think this means it's suitable for transcoding, but it might not be - it might fall apart when transcoded.
2. It's tempting to think this means that audible artefacts will be rarer (and/or less bad) with this codec than with one where the noise becomes audible when elevated by 3dB, but this might be very wrong - this wonderful codec which keeps coding noise 12dB below the threshold of audibility on the content tested might fall apart horribly on some piece of content that hasn't been tested.


I'm sure you know all this! I'm just thinking aloud.

Anyway, I find it fascinating. Thanks for the explanation.

Cheers,
David.

SoundExpert explained

Reply #9
I repeat my original assertion - the curve should be a flat line when it reaches the point labeled "imperceptible".

How wide is a Gaussian distribution?

If an encoder produce an audible flaw for 1/1000 people, for 1/1000 source materials for every 1/1000 times, that is a lot of tests to sort through blindly in order to find that one audible corner-case. And you never know if it is there until you find it.

If (and that is a big if) subjective score can be modelled as simple functions, then one could do simple, small-scale listening tests designed to extract those parameters, instead of determining the absolute threshold of audibility. If this extrapolation is sane (I have no idea if it is), then one could predict the outcome of exhaustive, expensive listening experiments from small ones, and say something clever about the likelihood of a given flaw ever being detected, right?

-k

SoundExpert explained

Reply #10
> human-hearing-oriented
> beyond perception point

Does not compute.

P-point differs from person to person and depends on training. It is determined conventionally by fixing procedure of measuring. Beyond p-point is not completely deaf area. Back to your contradiction: “human-hearing-oriented audio metrics for the area beyond perception point” simply means that such metrics should evaluate audio quality as good as it would be evaluated by golden ears in perfectly designed listening tests.

You are talking about "quality margin", but there is no such thing as absolute quality. Quality is, essentially, a measure of fitness for particular purpose. That is, notion of quality is always related to particular application, or a defined set of applications.

So, what kind of application your extrapolated curve relates to?

If the purpose is simply to compare two codecs or devices as applied to perceived qualty of audio reproduction, then the part of the curve below the "imperceptible" threshold should be sufficient. What purpose does the extrapolated part serve then?

Any applications with small impairments, which are difficult (or expensive) to evaluate in standard listening tests. Amplifiers with high THD values; noise-shaping, pitch-shifting and other sound processing algorithms, high-bitrate encoders …
keeping audio clear together - soundexpert.org

SoundExpert explained

Reply #11
If this extrapolation is sane (I have no idea if it is), then one could predict the outcome of exhaustive, expensive listening experiments from small ones, and say something clever about the likelihood of a given flaw ever being detected, right?

This is exactly what all this new audio metrics was designed for.
keeping audio clear together - soundexpert.org

SoundExpert explained

Reply #12
In the recently closed thread which the OP referred to I mentioned SoundExpert, and greynol replied:
This has been discussed on the forum on more than one occasion. While Serge may take his method seriously, HA does not.
If you're reading this greynol, would you be so kind to resume the (HA) objections ?

SoundExpert explained

Reply #13
Just to be clear, your graph example shows grades where the default noise level (0dB) is quite objectionable, and reducing the noise makes it less and less so - correct?

But with codec testing, you do kind of the opposite. The default noise level (0dB) is usually indistinguishable/transparent, or very nearly so, and to build the "worse quality" part of the curve (the part where people can hear the noise), you have to amplify the coding noise - correct?

You right in principle, but real figures are slightly different because for quantitative estimation of differences introduced by device under test we use Diff.Level parameter. Diff.Level scale is shifted by 3 dB against the one on the graph (0dB on the graph = -3dB on Diff.Level scale).

Yes, in order to build "worse quality" part diff.signal is amplified. Usually this part occupies the range between -10dB and -30dB for high bitrate encoders. Depends on test sample and particular encoder. Low bitrate encoders are tested as is without building the curves.

People in this thread are saying the scale beyond "imperceptible" makes no sense. I'm not sure if that's true or not. What you're "measuring" (I put that in quotes - see later) is how far the coding noise sits below the threshold of audibility. (or above, if it's audible at the default level). If the second-order curve theory holds true, then to do this you only need sufficient points on the curve where the difference is audible. Points on the curve where the difference is inaudible don't help because it does become a flat line there.

It seems you missed the point here or I missed your one. We "measure" exactly how far the coding noise sits above the threshold of audibility on the subjective scale (vertical). We can measure the amount of that coding noise with Diff.Level but in order to map it to subjective scale we need some curve above the threshold.

There are several accepted ways to judge the threshold of audibility. I used this one...
....................................................................................................
...........................
It seems to me that your method is far kinder to listeners. If your second order curve fitting can be justified, then it's a really neat way of finding the threshold of audbility (the cross over from 5.0 "imperceptible", to 4.9 "just perceptible but not annoying" on the usual scale) without even having to test at that (difficult) level.

Yes, the method could be used for the purpose (if it's true). Multiple iterations around target threshold will be replaced with several easy listening tests, necessary for building "worse quality" part of the curve. Then you will need to extend it just a little bit.

So far so good. What I'm less convinced of is the implication that a given codec has so much "headroom", and that this is a "good thing".

e.g. on the range of content tested, at a given bitrate/setting, a given codec might be transparent even with the noise elevated by 12dB. It scores well in your test. Fair enough. IMO it would be wrong to draw too much from this conclusion. e.g.
1. It's tempting to think this means it's suitable for transcoding, but it might not be - it might fall apart when transcoded.
2. It's tempting to think this means that audible artefacts will be rarer (and/or less bad) with this codec than with one where the noise becomes audible when elevated by 3dB, but this might be very wrong - this wonderful codec which keeps coding noise 12dB below the threshold of audibility on the content tested might fall apart horribly on some piece of content that hasn't been tested.

1. Hard to say. Not only noise headroom matters, also how this headroom maps to vertical scale. And this depends on the curve. In any case codec with greater margin will be more suitable for transcoding than with lower (I do hope so).
2. I think this relates to normal listening tests as well.
keeping audio clear together - soundexpert.org

SoundExpert explained

Reply #14
SE testing methodology is new and questionable, but all assumptions look reasonable and SE ratings – promising, at least to me. Time will show.


Seeing as all this will be entirely dependent on the short-term spectrum of both signal and interferer, I wonder how you can develop any "metric" that is not specifically designed for one track, or one short bit of music.

In your example, I see no accounting for spectra, which is a key factor for the human auditory system.

If we're talking "one kind of instrument music" vs. "white noise" we have nothing useful at hand. So what is  your point?
-----
J. D. (jj) Johnston

SoundExpert explained

Reply #15
Seeing as all this will be entirely dependent on the short-term spectrum of both signal and interferer, I wonder how you can develop any "metric" that is not specifically designed for one track, or one short bit of music.

The metric works as long as you can measure Diff.Level (always) and estimate annoyance of diff. signal in some sound excerpt (not always, for long excerpts the term "basic audio quality" could be inapplicable). In short -  if listening tests are valid for the excerpt, the metric is valid too.

In your example, I see no accounting for spectra, which is a key factor for the human auditory system.

If it is a key factor, the human auditory system will account it during listening tests which are integral part of the metric.
keeping audio clear together - soundexpert.org

SoundExpert explained

Reply #16
Seeing as all this will be entirely dependent on the short-term spectrum of both signal and interferer, I wonder how you can develop any "metric" that is not specifically designed for one track, or one short bit of music.

The metric works as long as you can measure Diff.Level (always) and estimate annoyance of diff. signal in some sound excerpt (not always, for long excerpts the term "basic audio quality" could be inapplicable). In short -  if listening tests are valid for the excerpt, the metric is valid too.


Um, I don't think so. I can measure a difference level that is exactly the same, i.e. the same exact SNR, and have enormously different percieved quality.

See "13 dB miracle", please.
-----
J. D. (jj) Johnston

SoundExpert explained

Reply #17
Um, I don't think so. I can measure a difference level that is exactly the same, i.e. the same exact SNR, and have enormously different percieved quality.

See "13 dB miracle", please.

Exactly, the "different perceived quality" will be revealed during listening tests and this will be reflected by the psychometric curve above. So, the same Diff.level will be mapped to different points on subjective scale because of different curves.

Designing of this metric was heavily inspired by "13 dB miracle". It was clear that audio quality metric can't rely upon only objective measurements (like THD, SNR...), those objective parameters have to be corrected (weighted) by some psychometric relationships. In our case - by side signal perception curves.
keeping audio clear together - soundexpert.org

SoundExpert explained

Reply #18
What is the justification for the "dashed" portion of the curve?

Shouldn't it be a flat line once you reach "imperceptible"? If not, once something is imperceptible, how can it become "more imperceptible"?



Matter of definition, interpretation and use.

1) Consider three chess games which are both "theoretically lost". One is a simple mate in one, the other is so hard that if you put 1000 chess players at the task, you won't be able to distinguish it from the startup position by statistical analysis of the outcome. And, the third is so hard that it won't be solved in fifty years. To clearcut the logic, assume that the second is like the third, except with 70 intermediary "only moves" (which do not constitute any learning curve for the subsequent ones).

Now everything else equal, you will still have a clear strict preference. Because you could risk meeting one of the very few chess players that actually can win this. You might not know that is is "humanly winable" though, but you will absolutely want to insure against the uncertainty if it is free.

Now consider a step-by-step sequence of chess positions, starting from the "third" one above. We index them by "# of very hard moves until the win is clear, as measured by statistics within confidence level [say, p]". How do you define the human-winability threshold?


2) Consider 32-bit sound file, then a 31 bit (LSB truncated) file, etc. Rank these. You may claim that every file above a "hearing threshold" of slightly below T bits, is equivalent. However, what if it is an unfinished product? Are you sure that the final mix is going to have the same hearing threshold? If not, then the high-resolution file could very well be more robust -- there might be manipulations which would enable you to hear a difference between the final and its T-bit version, although not between the original and its T-bit version. Most 16-bit CDs are mixed at higher word length, right?
Solution? A "robustness-to-manipulations" measure?


Of course:
- if no such issues apply, then zero value to superfluous information is at least as good a measure as everything else
- if anyone makes a selling claim, then they have the burden of proof. Then "inaudible difference" is the null hypothesis. You would grab the extra measured quality if for free, as an insurance against audibility, but you would frown upon someone trying to sell you an insurance against a disaster which no-one has ever substantiated has ever happened or could ever happen. (... well ...: http://en.wikipedia.org/wiki/Alien_abduction_insurance )
- even if we assume that there is some worth to this not-justified-as-generally-audible quality, then it is hard to quantify. Justifying it exists (by measurement) does not mean we can justify a reasonably narrow confidence interval for a particular point on the graph.

 

SoundExpert explained

Reply #19
Exactly, the "different perceived quality" will be revealed during listening tests and this will be reflected by the psychometric curve above. So, the same Diff.level will be mapped to different points on subjective scale because of different curves.


So, then, this curve of yours is only useful to compare like to like. This is, simply put, not very useful.  I don't get your point here.
-----
J. D. (jj) Johnston

SoundExpert explained

Reply #20
So, then, this curve of yours is only useful to compare like to like. This is, simply put, not very useful.  I don't get your point here.

I understood the author as he wanted to be able to do limited, inexpensive tests on exaggerated errors, then use this method to extend those results into smaller errors that would normally need large, expensive listening tests.

If this works and can be verifyed, then it sounds like a good thing.

-k


SoundExpert explained

Reply #22
The technique isn't new, according to this AES paper from 1997: Measuring the Coding Margin of Perceptual Codecs with the Difference Signal
Quote
Inaudible impairments of impairments near the threshold of audibility require a new method to assess the quality. A variable amplification of the impairments to provide the detection can be realized with the help of the difference signal. In a large listening test, the coding margin for 14 test items was measured. A time varying filter bank to modify the difference signal and to enhance the listening conditions is described.

NB: Just after posting I found an old HA post from Serge. Apparently it's his own paper, although the paper states "Feiten, Bernhard" as the author from Deutsche Telekom.

SoundExpert explained

Reply #23
NB: Just after posting I found an old HA post from Serge. Apparently it's his own paper, although the paper states "Feiten, Bernhard" as the author from Deutsche Telekom.

Author of the paper is Bernhard Feiten for sure. He is in the references both in my own paper and on SE site. SE metric could be considered as further development of his approach.
keeping audio clear together - soundexpert.org

SoundExpert explained

Reply #24
That's a mighty big if.

For years people have requested verification and none has been forthcoming.
I think something like it is justified.

I think it's commonly accepted* that signal detection (e.g. artefact detection in these tests) is a psychometric function - an S-curve, generated by integrating a Gaussian distribution...



X-axis is level, and y axis is chance of detecting the artefact.

If you know the function takes this shape, then it's apparently that you don't need to test at the threshold. You can test at several levels somewhat above threshold, and fit the resulting data to this graph/shape, thus giving you the actual threshold value.

The major problem with this is that, if you are testing only a long way above threshold, then very minor errors in the data will give huge errors in the threshold estimate because the fit to the graph could be wildly wrong.


Now, Sound Expert isn't doing this - it's at least one step away from it, testing at levels where people can hear the artefact all the time, and asking them how bad it sounds.

As you say, we have no proof that a graph of these results can be extrapolated back to find the threshold.


An obvious criticism is that two different kinds of artefacts, 12dB above threshold, might give very different results - i.e. one might be far more annoying than the other. But that's not necessarily a failing - if just means the curve might be steeper for one than the other - which would become apparent with more points on the curve (e.g. 6dB and 18dB, for example) so could be accounted for by the method.


It would be interesting to try to prove/disprove all this. A good starting point might be to take one of the archived listening tests from from HA with known results, and use exactly the same samples on SoundExpert. The results should speak for themselves.

Cheers,
David.