MPC vs OGG VORBIS vs MP3 at 175 kbps

Topic: MPC vs OGG VORBIS vs MP3 at 175 kbps (Read 181205 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #50 – 2004-07-17 02:41:21

-q 135 should be pretty close to 175kbs on rock music.

I don't even know if FAAC can be transparent. At the very least, it would be nice to know how far it has to go to compete with Vorbis and MPC.

Edit: might need to go a bit higher. Maybe -q 140. Depends on the material.

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #51 – 2004-07-17 04:29:18

-q135 should sound OK on non-killer rock/pop samples to most people, but to Guruboolez, IMO it is probably not up to par.

<$0.02>My ears are not especially sensitive or trained at all, but on a personal sample taken from the soundtrack CD of E.T., I had to go up as high as -q210 until the quality reached un-ABXable transparency. </$0.02>

edit: I used the first stable release of version 1.24 from Rarewares (file date = April 25, 2004 - 11:24am).

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #52 – 2004-07-17 12:37:39

Thanks for report. I've very quickly tried to see what bitrate correspond to -q135 with faac 1.24. It's apparently close to 150 kbps on average with classical (~150 kbps on common orchestral/lyrical; ~140 with less complex music as piano; ~160 with four solo harpsichord tracks I've encoded so far).

It's the same problem with vorbis megamix II. To avoid some contestation, I've choose three different vorbis settings. Adding two more encodings (faac/-q classical & faac/-q rock) would make my test more difficult, especially for building a correct hierarchisation for each sample (it's a long task with 5 contenders, it will be much longer with 7).
I'll check the overall quality of faac in preliminary tests (the next week). If the encoder is competitive, I'll see what I could do. If the codec isn't competitive enough, I'll probably wait for another AAC encoder, and why not, test faac later in a similar test opposing different AAC encoders at the same bitrate.

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #53 – 2004-07-17 16:45:31

I did another test with a more common CD.

Pink Floyd, The Dark Side of the Moon.

FAAC 1.24
-q 150

Average Bitrate = 175kbs.

Lowest Bitrate = 163kbs
Highest Bitrate = 194kbs

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #54 – 2004-07-22 11:59:42

Hello. Although this test had been done last week, I couldn't publish this until today for some reasons. This time I did ABX only since doing ABC/HR on non-killer samples at high bitrate is too hard to me.

This test features three encoders with following settings right now I'm interested in.

LAME 3.90.3 --alt-preset standard (avg. 203kbps, setting for my portable player)
Musepack 1.14 --standard --xlevel (avg. 188kbps, setting for my archiving)
Ogg Vorbis 1.1RC1 -q 6.00 (avg. 191kbps)

These samples IMO not kinds of killer samples such as castanets nor badvilbel, were cropped from tracks I usually listen to. The samples are found here

123RedLight;
Vocal problems. Vorbis behaved well.

AngelWalk;
Vocal problems. All samples weren't hard to ABX.

BWV1005_vn;
I heard some kind of distortion. Not easy to ABX for all encoders.

BWV565_org;
Noise or distortion.

BWV847_cemb;
Pre-echo or something. Musepack was harder to ABX.

ElBimbo;
Pre-echo for clavichord(?) sound. Vorbis was harder to ABX.

GrosseFuge;
Distortion. Musepack was good.

LadyMacbeth;
Pre-echo for percussions and distortion for trumpets.

Liebestod;
I found distortion during quiet part.(last few seconds)

Marteau;
I heard pre-echo and some problem for mezzo soprano.

nigai_namida;
Vocal problems. LAME was harder to ABX.

Edit: Results are included with each sample.

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #55 – 2004-07-22 12:15:52

Thanks a lot for results and samples. I can't download all of them yet; I've just finished marteau.flac (Boulez I suppose;)). Apparently, the file is corrupted: "error while decoding metadata".
Could someone confirm?

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #56 – 2004-07-22 12:24:28

Quote

I've just finished marteau.flac (Boulez I suppose;)).

Yes, of course.

Quote

Apparently, the file is corrupted: "error while decoding metadata".
Could someone confirm?
[a href="index.php?act=findpost&pid=228063"][{POST_SNAPBACK}][/a]

Oh, excuse me. I've not been yet familiar with my new hosting space. I'll upload them zipped.

Edit: Updated, played correctly here.

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #57 – 2004-07-23 23:42:35

Thank you for your work. However, considering the number of trials varying, it seems that you performed, as Guruboolez, sequencial ABX testing. It turns all the p-values useless. What was the maximum number of trials that you fixed before giving up ?

I recall that to avoid any diificulty in interpreting the results, the cleanest way to perform ABX tests is to fix a number of trials before the test begin, then, to perform the test once.

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #58 – 2004-07-24 00:39:51

Quote

I recall that to avoid any diificulty in interpreting the results, the cleanest way to perform ABX tests is to fix a number of trials before the test begin, then, to perform the test once.
[a href="index.php?act=findpost&pid=228445"][{POST_SNAPBACK}][/a]

It's easy to say, but harder to do.
By fixing a precise number of trials (16 could be exhausting, according to the difficulty of such tests - 8...12 are more realist in my opinion), there's a BIG risk: finish the test with unsignificative results.
With eight trials, situation is simple:you can't miss two of them; with 16 trials, errors are less crucial... but with 6 contenders, there are 96 trials to perform for one sample, 960 for ten, and the listening fatigue is very hard or maybe impossible to avoid. Fatigue implies more errors, and again the risk of finishing the test with unsignificative results.

By "pushing" the test beyond the fixed value, the tester tries to prove that the difference is not placebo, and that he could hear it, and even ABX it. As a tester, I prefer finish a test with 24/30 than with a poor 10/16 due to bad concentration or something else. And in order to avoid fatigue (and therefore errors), I sometime stop the test very quickly when ABX score are perfect after 5...8 trials.

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #59 – 2004-07-24 01:53:50

This way of doing things completely screws the results. I recall that the corrected p-values given there, though obtained with simulations, are exact.
When you are ready to go for 100 trials, and get p=0.05 in the ABX program, your test has failed, because your real p value is not 0.05, it is 0.2 !! p values displayed in ABX programs are only valid for tests run either without looking at the results before the test is over, either fixing the number of trials, and for tests run for the first time ! If you undergo the test for the second time, or if you look at the results before the end and the number of trials is not fixed, then the p values given are plain wrong !
It was discussed in the thread linked above, and in the other thread linked again from there.

Quote

It's easy to say, but harder to do.[{POST_SNAPBACK}][/a]

Exactly ! Getting a real p=0.05 is much harder than getting p=0.2, because there is less room for errors. That's why only a real p of 0.05 is considered as valid. Otherwise, it is too easy.

Quote

By fixing a precise number of trials (16 could be exhausting, according to the difficulty of such tests - 8...12 are more realist in my opinion), [a href="index.php?act=findpost&pid=228449"][{POST_SNAPBACK}][/a]

I use 16 for easy tests, and 8 for hard ones.

Quote

there's a BIG risk: finish the test with unsignificative results. [a href="index.php?act=findpost&pid=228449"][{POST_SNAPBACK}][/a]

The signifiance is given by the real p value, and nothing else. If you finish the test with p above 0.05, it just means that there are more than 1 out of 20 changes that you were guessing.
The risk increases because the test is more significant, that's all.

Remember :

Low p value versus high p value
Meaningful result versus not meaningful result
Low probability of guessing versus high probability of guessing
Hard test versus easy test
High risk of failure versus low risk of failure.

All these sentences are mathematically equivalent. If you want to make the test easier, then you want to make it less meaningful.

Quote

Fatigue implies more errors, and again the risk of finishing the test with unsignificative results.

You can't avoid this, the test must last longer, in order to allow the hearing system to rest.

Quote

By "pushing" the test beyond the fixed value, the tester tries to prove that the difference is not placebo, and that he could hear it, and even ABX it. [...]I sometime stop the test very quickly when ABX score are perfect after 5...8 trials.
[a href="index.php?act=findpost&pid=228449"][{POST_SNAPBACK}][/a]

This is cheating. The p value suffers from random variations. What you are doing is waiting for the p value to get below 0.05 by chance, and decide to stop here.
Remeber when Gabriel got p = 0.003 without listening to anything ? [a href="http://www.hydrogenaudio.org/forums/index.php?showtopic=15192&view=findpost&p=151932]http://www.hydrogenaudio.org/forums/index....ndpost&p=151932[/url]
If you want to allow more room for errors, increase the number of fixed trials, but in any case, don't stop before the end if you see the p value coming down accidentally, unless you use corrected p value table linked above.

Your results in this test are entierly based of the ABC/HR ratings you gave, granted that you didn't know which codec was which.
The analysis of your ABX values led to no conclusion. I could only show that you got more successes than expected randomly, but since I don't know the standard deviation of the probability of getting p < 0.05 in a 50 trials sequencial test (I just know it value), I can't tell if you got significantly more positive results than expected, or randomly more positive results than expected !

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #60 – 2004-07-24 02:40:10

Quote

Quote
By "pushing" the test beyond the fixed value, the tester tries to prove that the difference is not placebo, and that he could hear it, and even ABX it. [...]I sometime stop the test very quickly when ABX score are perfect after 5...8 trials.
[{POST_SNAPBACK}][/a]

This is cheating.

No, it's not cheating.
- Listening tests in my appartment are far from ideal listening conditions. There's a lot of noise: computer, fridge, phone, street, road, neighbour. It's easy to miss one trial due to the lack of concentration following a sudden and disturbing noise; then it's easy to miss the ten following trials due to anger.
- Sometimes, first trials are bad for other reasons: focusing on a wrong problem for exemple [that's why the training module of abc/hr 1.1 is precious].
- Don't also forget that all encoded files are not tested (ABX module) in the same conditions: I'm more familiar with the original when I'm testing the sixth and last encoding. For that reason, it's easier to fail on ABXing the first file than on the third. My ears are also more saturated on the sixth file than on the third. I could feel necessary to perform again the ABX session for the first file, failed not because it was especially difficult but because I was not very familiar with the reference. The score will report the failure, but not the reason of the failure. There are many reason that could explain a failure, and rather than redoing again the whole test, it's preferable for practical reason to "push" the number of trials. It's easy to understand I suppose. Doesn't sound as "cheating" for me... Stopping at 10/16 when I know that I could obtain a much better score is not far from cheating too...

An ABX score is not only a score. There's an history behind. It's like the final score of a soccer match: it doesn't tell anything of the quality of the winner/looser. A team could dominate a full match, and finish as looser. The score would conclude on the winner's superiority, but the full match would show the contrary. Same applies to a listening test: bad results could have another reason than difficulties.

If I decide to stop a test after 16 trials, whatever the history of this test, and to publish the result, people would say: "look, guruboolez rated 1L = 3.2, but he is probably guessing according to the ABX score". At least, it's the conclusion of the statistician...
If I decide to follow the test, it's not to prove to myself that the difference is really audible, but to publish a decent score. The ABX log files doesn't tell anything on the testing conditions. It doesn't help to understand why score is low ; it doesn't show that the tester ABXed sucessfully the 32 last trials but missed the 10 first one, but just reveal an enigmatic, unusual, "randomly" stopped 32/42. In my opinion, 32/42 is a much better score than 14/16, if 32 last trials were correct, and if the 10 first one were bad because I've focused my attention on a wrong problem.

Quote

What you are doing is waiting for the p value to get below 0.05 by chance, and decide to stop here.

It's a wrong interpretation.
I sometimes miss a test for one file, and stop it at 9/16 (it's an exemple). After I have finish with success other files, it happens that I resume this bad test. I can't reset the score, and my new attempt begins with 9/16 and not 0/0. If I decide to add 20 more trials, the final trials number will be x/36, which is unusual. You could conclude on a random stop while it's an intended one.
When this kind of situation happens (and it happens very frequently), I generally add few words on comment. But sometime I forgot to do it, or I could be too bored to do it.
I often stop a test when I succeed in 16 consecutive trials, whatever the final score looks like. The ABX log won't show that.
Again, reader could be fooled by pure numbers.

Quote

Remeber when Gabriel got p = 0.003 without listening to anything ? [a href="http://www.hydrogenaudio.org/forums/index.php?showtopic=15192&view=findpost&p=151932]http://www.hydrogenaudio.org/forums/index....ndpost&p=151932[/url]

Yes, I remember. But he performed the test in one session and stopped randomly IIRC. As I said it before, multiple scenario are possible for a same score. Gabriel had maybe stopped randomly after 26 trials, but for someone else, 26 trials could mean (10+16, with a new test inside the global one and 16 fixed trials).

Quote

If you want to allow more room for errors, increase the number of fixed trials, but in any case, don't stop before the end if you see the p value coming down accidentally

Again, it's easy to fix principles... But there's a human tester behind score or pval. I could go with 32 trials in order to minimize the impact of ABX errors, but for one or two contenders, and certainly not with 6, at least not at this bitrate.

Quote

The analysis of your ABX values led to no conclusion

[a href="index.php?act=findpost&pid=228474"][{POST_SNAPBACK}][/a]

I never rated the reference. I've found all 60 encoded files, and rated them very carefully (rating was the most important task of the test, and hierarchy its purpose). I think that's meaningful enough. I don't see what kind of additionnal conclusion you're trying to build with undescribed ABX scores.

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #61 – 2004-07-24 14:03:07

I understand what you are saying. But imagine that you are the one reading the test results. When you see a 20/36 result, how do you know if it was a one shot test (failed, sequencial or not), or a 4/20 failure followed by a 16/16 success ?
According to rule 8, the results must prove to the reader that the difference was audible. Maybe, thanks to the internal way the test was done, it is the case, but there is no published result that proves it.

Now, what if the one that did the listening says it all : 4/20, due to lack of concentration, then 16/16 ?
First, we want to get rid of placebo, and only analyse the results of the blind test. Therefore the comment about concentration can't be taken into account. It is an unproven opinion about the test result.
So we are left trying to compute the p value of the result, that is the probability of getting p <= 1/65536 in a serial of ABX tests beginning with 20 and 16 trials, and with an unknown sequel if the second test had failed (maybe the guy would have tried 12, then 8 and claimed a success, we don't know). The real result of this test takes at least one hour to compute for the people in this board who have enough math knowledge to sort it out, and it is inaccessible for most members, who didn't study probabilities and statistics.
For example, I can't tell if this 4/20-then-16/16 result has any significance. I don't know if its p value is above or below 0.01. I think that it is probably below 0.05, but I can't prove it in 5 minutes.

A binomial table giving the p value for fixed ABX tests have been made, it is linked in the FAQ about ABX, and the results are the ones given in all ABX software. It allows any people to perform tests and publish the results. By not following the standard methodology, one makes his results unreadable for most of the community, and give much analysis work to the math people of the forum.
We have a tool allowing anyone to analyse ABX test results, use it !

In your ABX logs, we can see that you performed a total of 1088 sessions. If you had fixed the number of trials to 8 for each sample and codec, you would only have performed 10 x 6 x 8 = 480 sessions, all codecs would have been tested, all results would have been understandable, and, most important, the victory of MPC and Vorbis would not have been changed ! They won with a confidence of 95 % even if all ABX tests have failed. The rankings show it.

In conclusion, we can't deduce anything from Harashin's result right now. I just hope that the information that he will give us about his methodology will help to find some significance in the results, and that he has not done all this in vain.
About your tests, Guruboolez, you see that if you don't follow the standard methodology of fixed, single ABX sessions, it is not necessary to spend much time ABXing in a way from which clear results can't be deduced. The ABC/HR results are enough, thanks to ff123's analyzer, to provide some useful information.

Thanks to this discussion, I think we should be able to write some instructions for ABX testing and include them in the forum rules, as well, maybe, as a tutorial for analyzing ABC/HR results.
By the way, shouldn't the Anova analyzer be included in ABCHR software ? ABX software gives the significance of the result, why shouldn't ABC/HR do the same ?

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #62 – 2004-07-24 14:37:10

Quote

About your tests, Guruboolez, you see that if you don't follow the standard methodology of fixed, single ABX sessions, it is not necessary to spend much time ABXing in a way from which clear results can't be deduced. The ABC/HR results are enough, thanks to ff123's analyzer, to provide some useful information.

I understand. I've tried to do my best to publish "valid" results, in order to avoid possible criticism like "mhhhh, he rated some files, but it doesn't proves us that he could really hear a difference". But in my opinion, even if ABX tests could be interpreted as sequential due to the disparity of the total number of trial, even if the pval drop from 0.05 to 0.2 because I didn't respect the number of trials I've preliminary fixed, the ABX scores I have obtained are certainly better than nothing. If a statistician can't be happy, another reader with common sense could say: "well, 39 out of 59, pval = 0.009 for Dover, Giustizia.mpc sample, it's probably not luck".

Quote

Now, what if the one that did the listening says it all : 4/20, due to lack of concentration, then 16/16 ?
First, we want to get rid of placebo, and only analyse the results of the blind test. Therefore the comment about concentration can't be taken into account. It is an unproven opinion about the test result.

If someone would adopt a suspicious attitude against results, there's no need for him to look on the validity of the ABX scores and the real pvalue they imply: he could simply question the authenticity of the log file.
All these results are based on trust: trust about the methodology, trust about the listener, trust that he tried to prove that a difference really exists and that he could hear it. Missing (partially or completely) an ABX test may lead to the conclusion than the listener can't probably hear a difference. This conclusion is wrong: multiple ABX sessions are not always a good thing. Difference are sometimes very subtle, and couldn't resist to an intensive test like ABX. That's why some people tried to perform long-term ABX tests to prove that a difference could be audible in other listening conditions. I've tried to obtain "valid" (or if not, "good") scores in listening conditions which were not ideal. I'll probably continue this way... probably not the "best" or the "ideal" way, but probably the most practical according to the difficulty of such tests.

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #63 – 2004-07-24 15:03:39

Quote

If a statistician can't be happy, another reader with common sense could say: "well, 39 out of 59, pval = 0.009 for Dover, Giustizia.mpc sample, it's probably not luck".[a href="index.php?act=findpost&pid=228578"][{POST_SNAPBACK}][/a]

Keep in mind that if someone says this, we will fight this interpretation, since it is wrong and spreads misinformation.

Quote

If someone would adopt a suspicious attitude against results, there's no need for him to look on the validity of the ABX scores and the real pvalue they imply: he could simply question the authenticity of the log file.
All these results are based on trust: trust about the methodology, trust about the listener, trust that he tried to prove that a difference really exists and that he could hear it. [a href="index.php?act=findpost&pid=228578"][{POST_SNAPBACK}][/a]

I don't think so. Audipophiles are not evil. When they say they can hear a difference, they don't lie in order to fool us, they really believe they do.
The widespread existence of strong placebo effect has lead us not to listen to opinions, but facts. Opinions about sound quality are honest. Often wrong, but 99.9 % of the times honest. So are log files. We can trust them 99.9 % of the time. But unlike opinions, they are facts that can be intepreted.

Quote

Missing (partially or completely) an ABX test may lead to the conclusion than the listener can't probably hear a difference. This conclusion is wrong[a href="index.php?act=findpost&pid=228578"][{POST_SNAPBACK}][/a]

The right conclusion is that he didn't hear the difference at least when he made the mistakes. For the rest of the test, we can't know. No proof. Still waiting for a positive result.

Quote

That's why some people tried to perform long-term ABX tests to prove that a difference could be audible in other listening conditions. [a href="index.php?act=findpost&pid=228578"][{POST_SNAPBACK}][/a]

I remember the 24 to 16 bits test, passed after several days, but as far as I remember, it was not a sequencial test, was it ?

Quote

I've tried to obtain "valid" (or if not, "good") scores in listening conditions which were not ideal. I'll probably continue this way... probably not the "best" or the "ideal" way, but probably the most practical according to the difficulty of such tests.
[a href="index.php?act=findpost&pid=228578"][{POST_SNAPBACK}][/a]

Do as you whish, but stay tuned with the forum rules and tutorials, they might soon be updated in order to point out the non significance of such results, if other specialists agree. Once done, interpreting the results as a success will be a violation of rule 8.

Whatever way you ABX, keep on the good work with ABC/HR, it is deeply appreciated !

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #64 – 2004-07-24 15:38:14

Quote

Quote
If a statistician can't be happy, another reader with common sense could say: "well, 39 out of 59, pval = 0.009 for Dover, Giustizia.mpc sample, it's probably not luck".[{POST_SNAPBACK}][/a]

Keep in mind that if someone says this, we will fight this interpretation, since it is wrong and spreads misinformation.

Misinformation? Could you be more precise? How many chance do I have to obtain this result by guessing? I wonder... I can't obtain this kind of results by ABXing MPC Q10 in ideal conditions, but I can obtain it in bad condition with mpc Q5. It's definitevely not luck.

Quote

We can trust them 99.9 % of the time. But unlike opinions, they are facts that can be intepreted.

Problem is that SCORE are not simple FACTS.
What are you trying to prove when
- REF vs CODEC_A = 14/16
- REF vs CODEC_B = 18/22
- REF vs CODEC_C = 8/8
and when rating are:
- CODEC_A = 3.5/5
- CODEC_B = 2.0/5
- CODEC_C = 4.5/5
What are you're conclusions here ? I'm interested.

Quote

Quote
Missing (partially or completely) an ABX test may lead to the conclusion than the listener can't probably hear a difference. This conclusion is wrong[a href="index.php?act=findpost&pid=228578"][{POST_SNAPBACK}][/a]

The right conclusion is that he didn't hear the difference at least when he made the mistakes. For the rest of the test, we can't know. No proof. Still waiting for a positive result.

No proof of what? If you take a look on log files I've posted, I sometimes add comment about the score's evolution. Apparently, you'e not taking this in account, because you don't know how to compute this situation.

Quote

Quote
That's why some people tried to perform long-term ABX tests to prove that a difference could be audible in other listening conditions. [a href="index.php?act=findpost&pid=228578"][{POST_SNAPBACK}][/a]

I remember the 24 to 16 bits test, passed after several days, but as far as I remember, it was not a sequencial test, was it ?

I'm not talking about 16 vs 24 bit, but about people trying to ABX high bitrate encoding after listening the same disc many, many times.

Quote

Quote
I've tried to obtain "valid" (or if not, "good") scores in listening conditions which were not ideal. I'll probably continue this way... probably not the "best" or the "ideal" way, but probably the most practical according to the difficulty of such tests.
[a href="index.php?act=findpost&pid=228578"][{POST_SNAPBACK}][/a]

Do as you whish, but stay tuned with the forum rules and tutorials, they might soon be updated in order to point out the non significance of such results, if other specialists agree. Once done, interpreting the results as a success will be a violation of rule 8.

I'd like to see it. Consequence would be funny. Most listening tests already done are simply invalid. Roberto's test should be removed from news, because they are not respecting some scientific conditions for practical reason (pval of 0.01, too few samples, not enough listeners, disparity between critical and easy listeners, etc... ff123 already [a href="http://www.hydrogenaudio.org/forums/index.php?showtopic=19190&view=findpost&p=189263]pointed out those limits[/url]). All HA tacit knowledge should be eradicate, because no proof about MPC superiority agaisnt other contender was NEVER published (but it's a common and shared idea). The "recommanded encoder and setting" threads could simply be erased, except maybe for the old-tested 3.90.3. GT3b2/aoTuV/Megamix... recommandations are all based on invalid tests. Enforce rule#8 conditions, and the only "valid" tests you'll see will be for 32 or 64 kbps. HA will be a place for low quality audio encoding reliable knowledge, and a vast desert of uncertainty because no tester on this board will have courage enough for risking the publication of a listening test following the "rules".
Tests and knowledge evolution were possible in this board because absolutely strict conditions were never requested. Limits on rigour were always accepted, for practical reasons, and even with this cool attitude, few testers are posting results. I don't know what you or someone else on this board will expect for more exactness. Chaos? Assumptions only?

Quote

Whatever way you ABX, keep on the good work with ABC/HR, it is deeply appreciated

Don't be sarcastic or insincere. I doubt that something considered as invalid and "soonly" illegal could be really appreciated.

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #65 – 2004-07-24 17:02:36

Quote

By the way, shouldn't the Anova analyzer be included in ABCHR software ? ABX software gives the significance of the result, why shouldn't ABC/HR do the same ?
[a href="index.php?act=findpost&pid=228569"][{POST_SNAPBACK}][/a]

The Anova analyzer is meant to look at the results of either multiple listeners or multiple samples.

ff123

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #66 – 2004-07-24 17:29:52

Guruboolez, I've got no time to answer your last post right now, but since you ended with a negative feeling, I'd like all the same to clarify one point now : I think that you don't understand the meaning of the Anova analysis performed by Roberto in his tests, and that I performed on yours.

For a confidence level chosen, it gives the result of the test. For example, in this test, it shows that you found MPC superior to Vorbis, and Vorbis superior to the rest with p<0.05. It means that there is less than one chance out of 20 that you rated them higher accidentally. Which make your results (as well as Roberto's tests ones) perfectly valid. That's why I was thanking you. I'm not used to thank poeple in a sarcastic way, not to post ambiguous messages.

I just wanted to point out that we are making a fuss on ABX methodology, and that it has nothing to do with your test results, because people who can't bother to read all that we post would otherwise think that I discuss your conclusions while I'm just discussing the possible analysis of your ABX results, that nearly nobody read anyway, since they are hidden as an addition to your log files.

Ratings are meaningfull without the need of ABX tests, because there is no way (or to be precise, a way inferior to the p value) that one codec comes first every time if you don't know which one it which, since the ABCHR software hides them. Anova is a way of computing the p value for this event.

So to make it short, we have MPC>Vorbis>other codecs for p < 0.05 in your test (I didn't compute the results for other p values).
The ABX results reported in your logs don't provide much more information, or it is hidden to me, nor do Harashin's ones.

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #67 – 2004-07-24 17:31:52

I said precisely

Quote

Whatever way you ABX, keep on the good work with ABC/HR, it is deeply appreciated !
[a href="index.php?act=findpost&pid=228587"][{POST_SNAPBACK}][/a]

ABX results are one thing, results are not meaningful, so claiming they are positive is a rule 8 violation.
ABC/HR ratings are another thing. Results are meaningful, work is appreciated.

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #68 – 2004-07-24 17:50:12

Quote

ABX results are one thing, results are not meaningful, so claiming they are positive is a rule 8 violation.
ABC/HR ratings are another thing. Results are meaningful, work is appreciated.
[{POST_SNAPBACK}][/a]

ABC/HR rating without ABX confirmations are few things... It's a blind test OK, but not a double blind one. Such tests won't be really and genuinely accepted. Look at LAME (3.90.3 vs new realese) testing phase for exemple:

Quote

4. Your test results have to include the following:

* ABX results for
3.90.3 vs. Original
3.96 vs. Original
3.96 vs. 3.90.3
* ABC/HR results are appreciated especially at lower bitrates, but shouldn't be considered a requirement.
* (Short) descriptions of the artifacts/differences

[a href="http://www.hydrogenaudio.org/forums/index.php?showtopic=20715]http://www.hydrogenaudio.org/forums/index....showtopic=20715[/url]

Those conditions are requested. Rating without ABX tests are often considered as useless. ABX tests are requested, especially those opposing different encoders each others. So please don't try to say that single ABC ranking are appreciated when other threads or people reaction are showing that without ABX confirmation, these notations are considered as wind...

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #69 – 2004-07-24 17:57:05

When comparing different codecs in abchr.exe, the purpose of the abx module is really just to help clarify in the listener's mind how he thinks things should be rated in the abc/hr module. Pio's point is that abx results by themselves (without the ratings) don't say anything about the relative standings of the codecs. I agree with that.

ABX: purpose is to determine if an individual can reliably detect a difference between 2 files using multiple trials.

ABC/HR: purpose is to determine preference between 2 or more codecs, but not necessarily reliably! Multiple listeners or multiple samples increase reliability for ABC/HR in the same way the multiple trials increase reliability for ABX. Generally, it is more important that multiple samples be tested than multiple people.

The helper role of the abx module in abchr.exe version 1.1 (I need to spend a little time to clean up the last few minor bugs) is further emphasized since it unhides the hidden reference after a successful abx run.

ff123

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #70 – 2004-07-24 18:18:12

Quote

(...) Pio's point is that abx results by themselves (without the ratings) don't say anything about the relative standings of the codecs. I agree with that.

I also agree. That's why I spent much more times and attention in the rating phase. By testing many encoders, I'm only interested about the hierarchy (the best, the second best, etc...). ABX scores can't reveal anything about quality, even about difficulty. I also agree that harashin's results don't give me any information about relative quality of three different format; I just know that there are serious chance that he heard difference between encoding files and the reference.
In my opinion, ABX phase is useful for three things:

• helping me to refine the notation (I often lower or higher some notation after ABX tests).

• giving to myself insurance that I wasn't dreaming about possible artifacts when I've rate different encoders (i.e. avoid placebo). Useful when during the ABC/HR phase, I've ranked two or more files with a slight difference: if I can't ABX this difference, I often change the note and give the same to both files [or sometimes, even if I failed t oABX the difference, I just maintain a slight difference of 0.1 point for the file I still suspect to sound better]

• giving to others the feeling (or in best case the proof) that the difference were really audible. I'm sorry to repeat it again, but I consider something like 45/60 better than nothing. At least when I ended the test by a nice consecutive series of correct trials.

P.S. What is the meaning of "HR" in "ABC/HR" name?

MPC vs OGG VORBIS vs MP3 at 175 kbps

Reply #71 – 2004-07-24 18:23:35

Quote

P.S. What is the meaning of "HR" in "ABC/HR" name?

Hidden Reference

ff123