QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings, Searching for transparency by ABX
QT AAC, aoTuV (Vorbis), libopus, LAME (MP3) at high quality settings, Searching for transparency by ABX
Feb 8 2013, 20:15
Joined: 7-February 13
Member No.: 106471
My primary motivation for performing this listening test was to find the lowest QT AAC TVBR setting that is fully transparent for me, because I want to use that for my music collection. Secondary motivation was to find out how the other encoders compare to QT AAC at high quality settings.
This is my first rigorous listening test, and a rather extensive one, so I wanted to share the results with the audio community. I hope others may learn as much from this experiment as I did!
Results in a nutshell (for the impatient)
QT AAC was judged fully transparent at q91 and close to transparent at q82. The sample in which I heard a faint difference between these presets had a bitrate of only 128kbps at q82 and 159kbps at q91, so taking that in consideration together with expected bitrates at q82, in CBR mode I would assume files at 190kbps and up to be reasonably safe for my ears.
AoTuV (Vorbis) was judged very close to transparent at q5 and q6 and fully transparent at q7. If I were to use Vorbis for my music collection I would pick q6 because I think the tradeoff between file size and perceived sound quality is better at that preset than at q7. I would trust CBR files of 200kbps or greater.
Opus was judged fully transparent at VBR with target bitrate 224kbps, which is consirably higher than I expected based on previous reports. At preset 192 I judged it untransparent so there's no grey area like in AAC or Vorbis. Opus VBR seems to be a lot less variable than the other codecs so in CBR mode I would trust Opus files of 230kbps and up.
LAME (MP3) was judged very close to transparent at V1 and V0 and fully transparent at c320. I would pick V0 if I were to use LAME for my music collection. In CBR mode I would trust files of 260kbps or greater.
QT AAC: my installation of Mac OS X included CoreAudio 3.2.6, QuickTime 7.6.6 and QuickTimeX 10.0. I used TVBR mode and overall encoder quality "max".
AoTuV: XLD included release 1. Apart from the target quality setting no options were shown.
Opus: XLD included libopus 1.0.2. I used VBR mode and framesize 20ms. opus-tools 0.1.6 also uses libopus 1.0.2.
LAME: XLD included version 3.99.5. I used VBR mode with -q2 and the new VBR method.
Test setup was in an appartment with reasonably good sound isolation, in a moderately quiet environment with singing birds and low traffic. During ABX trials I kept the room door and the ventilation window closed. Computer fans were turned down. Under those conditions while wearing the headphones, most of the time the only sound I heard was the low humming of the external hard drive that carried the samples. Usually I became unaware of that sound when actively listening to a sample.
I selected 15 samples from the LAME Quality and Listening Test Information page. In 8 out of those samples I didn't hear a difference at any of the encodings I've tested. The remaining 7 samples are numbered below. In addition I included a 10-second fragment from Central Industrial by The Future Sound of London which I had previously found to contain obvious artifacts when encoded with QT AAC q63:
Henceforth I'll refer to these samples by their numbers. See the appendix for detailed discussion of each sample.
General test procedure
As a general preparation I transcoded the WavPack samples to ALAC in order to make them playable in ABXTester. I always used the lossless original as sample A and the lossy compressed file as sample B. I took regular breaks in order to prevent fatigue. The measurements were spread over multiple sessions with almost a week between the first and the last session.
For each codec, I would first encode all samples at the middle preset, i.e. q63 for QT AAC, V5 for LAME, q4 for aoTuV and 96kbps for Opus. Then for each sample I would conduct ABX testing and conclude one of the following levels of quality:
By default I set the audio volume to 5 notches out of 16. I tended to turn it up to 6 notches if I didn't immediately hear a difference in all samples except for #1, which I experienced as very loud already. Occasionally I would try the sample with the channels reversed (by reversing my headphones) in order to test if something new might come to my attention.
After testing all samples at the middle preset I would proceed to higher presets with the samples in which I heard any difference, until I found the minimal preset at which I heard no difference or until I couldn't go higher. A preset was judged "fully transparent" if I heard no difference in any sample, "very close to transparent" if I heard a marginal difference in at most one sample, and "untransparent" otherwise. I decided to assign QT AAC q82 an intermediate category "close to transparent" because I heard a clear but very faint difference in one sample. More on that below. The overall search path from preset to preset generally went like a binary search or similarly "jumpy".
I executed the above procedure first for QT AAC, then for LAME, then aoTuV and finally Opus. During the course of the experiment I noticed I had become better at detecting artifacts, so in the end I returned to QT AAC to verify my end results for that encoder.
Observed bitrate range: varies wildly around the official expected value. For example, at q63 (135kbps expected) some samples had an average bitrate of 80kbps while others went over 190kbps.
Observed artifacts: even at medium bitrates (q27) most artifacts were slight changes in timbre or texture rather than very obtrusive stand-alone sounds. The exception is sample 8 which obtained some obvious, very sharp "ticks" after encoding which were audible up to q82 at 128kbps average file bitrate.
Stage 1: all samples at q63.
I heard no differences except for a clear difference in sample 8. I decided to ignore that for the moment and to proceed my search downwards first.
Stage 2: samples 1-7 at q27.
I heard clear differences in samples 1, 2, 6, 7.
Stage 3: samples 1, 2, 6, 7 at q45.
I heard clear differences in samples 2, 6 but no difference in samples 1, 7.
Stage 4: samples 2, 6 at q54.
I heard no differences anymore and decided q54 to be fully transparent if disregarding sample 8.
Stage 5: sample 8 at q100.
Stage 6: sample 8 at q82.
Stage 7: sample 8 at q73.
Clear difference, I chose q83 as my search result for the time being.
Stage 8: samples 1, 2, 6, 7, 8 at q82 (verification after finishing the other encoders).
I did hear a clear difference in sample 8 afterall, though I had to listen to A and B a few times before I noticed it. I heard no difference in the other samples.
Note: I have not reviewed stages 1-4. With my trained ears I might actually hear some additional differences at q54 or even q63 but I haven't tested.
Stage 9: sample 8 at q91.
No difference. I decided q91 to be my final search result for QT AAC.
Observed bitrate range: the spread is somewhat less than in QT AAC, generally the highest and lowest average bitrates where within 30kbps of the expected bitrate for the given quality preset.
Observed artifacts: no standalone "objects", but changes in timbre or texture could be very un-subtle.
Stage 1: all samples at V5.
I heard clear differences in samples 1, 4, 6, marginal difference in sample 7 and no difference in samples 2, 3, 5, 8.
Stage 2: samples 1, 4, 6, 7 at V3.
Clear differences in samples 1, 6.
Stage 3: samples 1, 6 at V1.
Marginal difference in sample 1. I decided V1 to be my search result for the time being.
Stage 4: sample 1 at V2 (checking for consistency with aoTuV after finishing Opus).
Clear difference. I chose V0 as my final search result instead.
Stage 5: sample 1 at V0 (for completeness, shortly before starting this report).
Marginal difference (yes really, I believe I heard a difference and I identified 18 out of 25 Xs correctly: 72%, p=0.014).
Stage 6: sample 1 at c320.
No difference (at first I thought I heard a difference but ABX testing showed I didn't).
Observed bitrate range: average file bitrate is usually greater than the official target bitrate for the given quality preset. For example, the average bitrates at q4 were all greater than 128kbps. Upwards spread from the target bitrate seemed to be similar to QT AAC.
Observed artifacts: few and subtle. The marginal difference in sample 3 that I consistently heard up to q6 was an attenuation effect, the high frequency components were slightly softened.
Stage 1: all samples at q4.
Clear difference in sample 1, marginal difference in sample 3 and no difference in the other samples.
Stage 2: samples 1, 3 at q6.
Marginal difference in sample 3, no difference in sample 1.
Stage 3: sample 1 at q5.
Marginal difference. I decided q6 to be my search result.
Stage 4: sample 1 at q7 (for completeness, shortly before starting this report).
Observed bitrate range: average bitrates were always very close to the target bitrate, with a spread of less than 10kbps in each direction. I would compare Opus VBR to QT AAC ABR.
Observed artifacts: texture changes, some of them very severe, including "rattling" and "grinding" sounds. Usually the timbre became more "metallic".
Stage 1: all samples at target 96kbps.
Clear differences in samples 1, 2, 4, 5, 6, 7, no difference in samples 3, 8.
Stage 2: samples 1, 2, 4, 5, 6, 7 at target 192kbps.
Clear differences in samples 4, 6, no difference in samples 1, 2, 5, 7.
Stage 3: samples 4, 6 at target 256kbps.
Stage 4: samples 4, 6 at target 224kbps.
No differences. I chose 224kbps to be my search result.
Conclusions and recommendations
QT AAC and aoTuV are the clear winners in this comparison, with QT AAC achieving full transparency at the best compression ratio. I was a bit surprised to find that the highest quality preset is no overkill (for my ears) in LAME. Opus doesn't seem to perform exceptionally well (though better than LAME) at high bitrates although it's known to beat QT HE-AAC (more or less) at 64kbps. This is probably in part explained by the fact that Opus is still very young. Another explanation is that Opus might be more intended for low bitrates, which is somewhat suggested by the way it's described on the Opus home page.
According to the Hydrogenaudio wiki, most people find AAC to be transparent at about 150kbps, Vorbis at about 150-170kbps and LAME at about 160-224kbps. Given the results of this experiment, my ears might be slightly better than average.
If you wish to repeat this experiment, you might be able to save a lot of time by using my results as a hint where to find the most significant differences. The sample details in the appendix may help you to "look" in the right direction. In addition, you can probably start your searches for Opus and LAME at higher presets than I did.
If you just want to use this report as a hint for choosing your ideal encoder setting, I suggest that you perform a miniature version of my experiment using just a single sample in the encoder that you're interested in. If you hear a difference go up one preset until you don't, otherwise do the opposite by going down. Specifically:
For QT AAC, I would recommend listening to sample 8 and starting at q73. If you descend below q54 I recommend listening to samples 2, 6 instead.
For aoTuV, I would recommend listening to sample 3 and starting at q5. If you don't hear any difference switch to sample 1 at q4.
For Opus, you could take sample 4 at target 160kbps.
For LAME, I recommend listening to sample 1 starting at V3.
Appendix: sample details
Loud applause, with a "thank you" yelled through a microphone shortly after the start. The "thank you" is loud but sounds a bit muffled because of the microphone and there's a faint echo to it.
In the lossless original the applause sounds "wet"; you could compare it to rain or perhaps to oil spattering in a hot pan. In audibly different encodings it may sound dryer, noisier and coarser, perhaps like sandblasting, or very coarse and metallic (in Opus at 96kbps target bitrate).
The "thank you" should be a separate sound layered on top of the applause, and should sound fairly smooth. In audibly different encoding you may expect it to interact with the applause in several ways:
Some sawtooth-like signal with an additional trill effect that seems to contain vowels. I'm not sure whether this is a heavily filtered human voice or just something creative from a synthesizer, but either way it sounds quite interesting.
At medium bitrates in QT AAC and Opus it sounded distorted and metallic.
Symphonic fragment with drums, trumpets, violins, vocals and some high-pitched snare instrument which I think might be a steel guitar. There's also some high tingling in the right channel which I suspect is an artifact in the original file coming from the snare instrument. Sounds like a soundtrack to an epic 1960s movie.
In aoTuV you may find that the snare instrument (the proper sound slightly to the left, not the tingling in the right channel) is arpeggiated less sharply and sounds softer overall; I would call it a bit "timid" compared to the original.
Bagpipe playing a slow high-pitched melody over a constant bass. The sound is smooth overall although you'll find some irregularity especially in the second long-lasting high note. In the background there's the occasional hollow, raspy, low-pitched sound which might be either the bag being inflated by the artist or (a suggestion of) wind.
Focus on the long-lasting high-pitched notes, especially the very last one. In case of audible difference you'll find that they sound metallic and/or less smooth or even straightout distorted (Opus at 96kbps target bitrate).
Drums (something that sounds similar to a conga or a djembe) playing a samba-like rhythm. At the start an alto voice sings "aaaa", which is a bit of a shame because the voice will not help you to distinguish the encoded sample from the original and it partially masks the drums.
In Opus at 96kbps target bitrate the high-pitched slap beats sound more metallic than in the original.
Western guitar playing a country tune.
At lesser bitrates you might recognise the encoded sample directly because it sounds metallic and perhaps even a bit distorted. A high bitrates you might be able to make out the difference if you focus on the initial arpeggio and the final note. The last note of the initial arpeggio (which lasts longer than the previous notes) might sound a bit more rough than in the original. The final note might sound metallic. The latter difference is probably easier to hear than the former. You probably won't find a difference in the chords.
Monotone (synthetic) drum rythm with bass, big tom beating every second base beat, open-closing hi-hat in the right channel alternating with the bass beat and another closed hi-hat in the left channel beating four times for every bass beat.
You'll only hear a difference at the lesser quality settings, and you are most likely to find it in the closed hi-hat in the left channel.
Synthesizer music of fairly low complexity.
Frankly, the sounds aren't really important, because the main reason to listen to this fragment is the sharp ticks that are introduced by QT AAC. I don't think I need to tell you where they are because you're pretty much guaranteed to hear them at q63 and below.
Since this sample isn't available from the LAME Quality and Listening Test Information page, I made it available for download over here: https://dl.dropbox.com/u/3512486/central%20industrial.m4a
Feb 9 2013, 14:33
Joined: 7-February 13
Member No.: 106471
TLDR version: greynol is right that fb2k does something slightly different from what I do. I figured out what it calculates and will do the same from now on. So from now on my p-values will be compatible with those of everyone else at the HA forums.
I calculate the probability of a false positive, i.e. the probability that I would get the score if I were guessing randomly.
The probability of identifying a single trial correctly by luck is 0.5, and the probability of identifiying it incorrectly is also 0.5. So the probability of identifying N trials all correctly (or all incorrectly) is 0.5^N.
If you identify K out of N trials correctly (meaning some but not all of them) then the probability is greater because there are multiple ways to get exactly K out of N trials correct. For example, if you get 14 out of 15 trials correct then the single incorrect trial might be the first, the second, and so on, so there are 15 ways to get that score, so I'd have to multiply 0.5^15 by 15 to get the true probability of a false positive.
The overall formula for a false positive with K out of N trials correct is 0.5^N * (number of ways to get exactly K out of N correct). This actually also applies to getting all of the trials or none of them correct, because in both cases there's only one way to do that so you're just multiplying 0.5^N by 1. The "number of ways to get exactly K out of N correct" is known mathematically as the combination. In computer-related contexts this is often written as choose(N, K). Therefore the more formal way to write the false positive probability formula is 0.5^N * choose(N, K). This is also known as the Bernouilli distribution (actually it's a special case because the probability of success is equal to the probability of failure, but that doesn't matter now).
I checked whether foobar2000 is calculating the same thing using the first log in this topic. You are right that it isn't the same:
foobar2000 p-value at the fifth trial: 18.8%
what I would calculate: 0.5^5*choose(5, 4) = 0.15625 =~ 15.6%
I checked every fifth value and it turns out that my p-value are consistently close to, but slightly smaller than the foobar2000 p-values. Fortunately I know where this difference is coming from: foobar2000 is calculating the probability of a false positive if you correctly identify K or more trials out of N. See the math:
0.5^5*choose(5, 4) + 0.5^5*choose(5, 5) = 0.1875 =~ 18.8%
It also works for the other trials. For example, here's the tenth trial:
foobar2000 p-value: 5.5%
doing it manually: 0.5^10*choose(10, 8) + 0.5^10*choose(10, 9) + 0.5^10*choose(10, 10) = 0.0546875 =~ 5.5%
And the sixteenth trial:
manually: 0.5^16 * (choose(16, 14) + choose(16, 15) + choose(16, 16)) = 0.002090454 =~ 0.2%
I'll adopt the foobar2000 style p-values in my future tests in order to make my outcomes comparable with those of other people at the HA forums. It's to my own benefit as well as the slightly larger p-values will force me to be slightly more cautious. Thank you for making me aware of this difference!
|Lo-Fi Version||Time is now: 5th September 2015 - 18:14|