My first listening test.

Topic: My first listening test. (Read 5076 times) previous topic - next topic

0 Members and 1 Guest are viewing this topic.

My first listening test.

2005-11-22 22:28:14

Being unable to game for a while, i needed another way to spend the evening indoors, which brought me to something i've been meaning to do for a while: explore my hearing.
I've been using Vorbis q2 on my portable for years now and never had a problem with it (except a decoder using tremor in low accuracy mode), but as i don't listen to the originals all that much, and certainly not the original and ogg right after another, i had no idea if the ogg's and originals were actually the same to my ears. I wasn't expecting to hear differences except on problem samples however.

I started out with surfing to ff123.net, i had read about the various common artifacts, but apart from quantitisation noise (i think) in certain mp3 sources (mostly web radio and older mp3's), i never heard much that i could clearly identify and name.
ff123's artifact training page proved a wonderful resource, some of the very obvious artifacts seemed subtle to me (blackbird and dr4), where others in the obvious category were clear as daylight to my ears (wait, daughter). I decided to ABX rather than ABC/HR because i wasn't sure i would even be able to hear differences, let alone rate them reliably, it's best to start simple.

I chose to test Vorbis q2 because i use this setting for my portable, and also because i hoped 96kbps would be a nice balance between popularity, quality and the ease with which artifacts can be spotted with my beginner ears, judging from the various other tests on this forum. Aotuv b4.51 was chosen because it's the latest and designed to bring improvements below q3, and preliminary testing indicates that it does. I like being on the cutting edge, so it's currently my Vorbis of choice.

Too lazy to make samples out of my own collection, and for easier comparing by others, i chose ten samples from rarewares, and castanets.wav, for my first ever test, the list is as follows:

Code: [Select]

Sample name    Bitrate

castanets      113 kbps
41_30sec       110 kbps
Bachpsichord   113 kbps
female_speech   81 kbps
ItCouldBeSweet  84 kbps
kraftwerk      108 kbps
Layla          116 kbps
Mama           110 kbps
mybloodrusts    89 kbps
NewYorkCity     99 kbps
Waiting        100 kbps

Average bitrate reported by foobar (castanets excluded): 102kbps (6.25% over nominal).
Actually, with castanets included it's the same 102kbps.
Most of these samples are music i own, music i like or similar to music i own. NewYorkCity is really the only one i'd never voluntarily listen to, but i put up with it in the name of science.

My setup is a SB Live!, using the kX drivers, i listen to my music on the rear speaker output, which is connected to a Sony STR-AV200E tuner/amplifier. During testing i plugged my Sony MDR-EX71SL headphones into the amp and turned off the speakers. I think the headphone output is just wired to the amplifier through a resistor.
I used foobar's ABX plugin for the testing, the dsp plugins i used were the advanced limiter and PPHS resampler, to 48000Hz, set to Ultra mode for the occasion. I realise this is far from ideal, but it's all i have.

The results:

castanets: 16/16. Easy. The preecho sounds like more like the jingle from a tambourine than a puff of air.
41_30sec: I would swear i heard differences, but time and time again i couldn't ABX them succesfully. The Placebo effect really is a powerful one.
Bachpsichord: Transparent.
female_speech: Transparent, but i didn't spend as much time trying to find differences as with the above two.
ItCouldBeSweet: 16/16. Preecho, especially the first clack.
kraftwerk: 37/54. Smearing of transients, seem to have less fidelity. This one's on the very edge of my hearing.
Layla: 14/16. Should be 15/16, i clicked the wrong one once. The applause at the start gives it away (slight warbling), i thought i heard a loss of coarseness on the snaredrum too, but i didn't ABX the sample without the applause to verify.
Mama: 15/18. Loss of stereo on the big cymbal crash 16 seconds into the sample.
mybloodrusts: 16/16. The guitar amplifier's overdrive in the first few seconds is a bit less pronounced, the effect seems the opposite of the dr4 artifact on ff123's training page.
NewYorkCity: 14/16. Loss of [..] sharpness in the pronounciation of "friend", somewhat similar to mybloodrusts, but more subtle.
Waiting: 14/16. Couldnt find a difference at first, then suddenly it struck me. There is a warbling in the air of the singer's voice, most clear in the first few seconds.
If i would do it again it'd be 16/16. I don't understand what could cause this warbling, there's no highly irregular noise like the applause in Layla, and the bitrate isn't especially low.

The results surprised me, although none of the artifacts heard really annoyed me, i found more differences than i expected. If i had to give ratings, i'd say most would be 4, with perhaps 3 for castanets, Layla and Waiting. But i really think i need more experience before i can accurately quanitfy the severity. Sometimes ogg even appeared to sound better! I'm definitely going to keep using q2 for my portable, although if i had more space, i would probably go higher. I'm a pragmatic purist.

I just read another thread explaining that when not having the results hidden, you shouldn't continue abx'ing untill you get the desired confidence level (in my case 0,5% or better). With Mama i was distracted once, and i'm confident i would have no trouble reaching it in 16 tries if i did it again, kraftwerk is another matter. Then again, with such a high number of tries, what is the chance i would achieve this without actually hearing a difference? It seems solid enough to me, but i don't have a degree in statistics. If someone could shed some light on this, or if you have other feedback for me, please do tell!

To give this test more than just personal value, and with the confidence that i can actually hear differences in most samples at q2, i've decided to repeat this with aotuv b4, and when both aren't transparent, try to ABX them against eachother, and attempt to describe the differences heard. I feel that although this has been done already, it wasn't backed up by cold hard ABX figures as thoroughly as i would've liked. EDIT: This means i'll probably expand this post with a followup later.

Last of all, a Big Thanks to ff123 for his wonderful page with resources related to testing, Rjamorim of Rarewares fame for providing the samples and his public testing efforts, Guruboolez for inspiring me with his outstanding work, HydrogenAudio for getting me in to all this in the first place, and Xiph in general and Aoyumi in particular for creating and improving the format that, in my opinion, deserves to rule the world.

My first listening test.

Reply #1 – 2005-11-22 22:35:59

Cool. Another listening test geek created. ;-)

ff123

My first listening test.

Reply #2 – 2005-11-22 22:37:42

Weeee, great to see people are getting so interested in listening tests.

My first listening test.

Reply #3 – 2005-11-23 06:51:59

Quote

I just read another thread explaining that when not having the results hidden, you shouldn't continue abx'ing untill you get the desired confidence level (in my case 0,5% or better). With Mama i was distracted once, and i'm confident i would have no trouble reaching it in 16 tries if i did it again, kraftwerk is another matter. Then again, with such a high number of tries, what is the chance i would achieve this without actually hearing a difference?

The chance is higher than you might think. Then again, if you finish the test with a fixed number of trials, and afterwards think that you now could score higher, then just do another test, and combine the two. Why not discard the first one? Well, think about the following:

ABXing the way we do it does already suffer from a necessary evil: training. You train yourself noticing subtle differences in a single sample. You'd never do something like this during casual listening. Thats also the reason why ABX-scores tend to make codecs look worse than they really perform during casual listening.

Back to your hypothetical two tests. The first test in that case represents a "learning"-factor... if you would discard it, then you would increase the above described effect even more. Additionally, if the tester is allowed to discard tests, then he can cheat. All tests performed need to be taken into account(unless you decided that they are "test-runs" *beforehand*). Now, you may ask yourself "Well, if person A only does one test and it goes well, and person B does 3 combined tests where one of it does not go well(maybe he was tired, etc.) - then doesn't that give person A's scores an unfair advantage?". You're right... the less the total number of trials, the lower the confidence and the higher the risk of the tester just striking lucky. Thats why in some test-presentations there is a "margin of error".

There's one more reason why the first test may be very valueable. ABXing usually is not well suited to tell something about the "obviousness" and "annoyance-factor" of a sample. It just tells how good you were able to tell the difference. Thats why two samples may score 12/12 yet one of them is much less annoying than the other. To some extend, one can deduct the obviousness from the ABX-score... but for that to be accurate may trials are needed, and then training- and stress-effects come into play. In short, its difficult. But, if you have two tests..... one where you barely were able to tell the difference, and a second one where you "got it" and scored high.... then the second test basically just shows that after some training, you were very reliable able to tell the difference..... but the first one tells something how difficult it was for you to get to this point..... thus, merging the two does not just avoid cheating - it also makes it more useful and interesting.

- Lyx

edit: if you want to show the results and fix the number of trials, or want to hide the results and be flexible with the number of trials - is a matter of taste. Personally, i prefer to hide the results, because then i can spontaneusly end the test when i feel that my ears become tired and overstressed.

My first listening test.

Reply #4 – 2005-11-23 15:37:47

Hello all, (stands up, wave shyly) I'm a n00b (sits back).

Okay, after seeing all those listening tests. I want to do a listening test too. Not something too ambitious. Just want to know at what q the encoded file gets unacceptable. And I am going to limit the test on Vorbis aoTuVb4.51.

The background of my test is that I use an iPAQ2210 with GSPlayer to play my song collection. Naturally I'm looking for the lowest file size with still acceptable quality. As I am not using a high-end audio device (Philips HS-320 earplugs) I don't need transparency.

I expect the result to be a table of ratings (acceptable, marginal, not acceptable) & respective bitrate, cross-tabbed between sample name and q value, i.e. :

q=10 q=9 q=3 q=-2
sample_1 A (192) A (186) ... M (96) ... N (64)
sample_2

and so on.

Has this kind of test been done?

Pointers will be greatly appreciated.

Notice