IPB

Welcome Guest ( Log In | Register )

6 Pages V   1 2 3 > »   
Reply to this topicStart new topic
Personal evaluation at ~130..135 kbps, 200 samples, AAC (iTunes, Nero) - MP3 - Vorbis aoTuV
guruboolez
post Nov 15 2005, 09:14
Post #1





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



Preliminary notes

Two years ago I performed and published my two first listening tests. Both included different formats and encoders at ~130 kbps and involved a dozen of samples: classical music only. My purpose was to see which encoder was able to produce the best encoding at a friendly bitrate (friendly for portable players), and for a specific kind of music. iTunes AAC & WMAPro appeared to be the best encoders (for myself), and the absolute quality of both encoders at such bitrate surprised me. Last year (December 2004) I performed two similar tests: the first was dedicated to AAC (Nero, Apple, old and new encoders) and the second was a match between the best AAC encoder (Nero Digital “fast” VBR) and the most advanced Vorbis one (aoTuV beta 3). Quality and enjoyment were even higher!

This year I performed a fresh multiformat listening test at 130 kbps. This new test is very different from their predecessors from a methodological point of view. I progressively improved my approach of listening tests and tried to answered to all criticism addressed in the past to previous tests (and not necessary mine). Consequently, my “personal evaluations” which were first a friendly exercise feasible in one rainy, autumnal afternoon now looks as a gigantic task which took me approximately 10 days (shared with family, friends, job, and discouragement) to complete. I improved several point of the methodology; to sum them up:


diversity : the following test is not only based on “classical” music, and will also include several (fifty!) samples of “modern” music.
grading : described once as “temperamentic” I decided to stick all marks between two anchors, a low and a high one. It will decrease the contrast between different encoders and increase at the same time the difficulty of the full exercise but it should also ensure a more accurate grading. The low anchor is vital to prevent an excessively harsh grading; the high anchor is essential to temper enthusiasm: a very good encoding at 130 kbps should be marked in regard to an excellent and high bitrate one. The presence of both anchors should guarantee a right grading: not too low, not too high.
complexity : people reproached to some listening tests to focus only “critical” or “complex” samples. It may be a problem with some VBR implementations, which sometimes decrease too much the bitrate on “non-complex” samples. In my opinion, a listening test should include both types of samples, at least to verify that non-complex/low bitrate parts are as well encoded as complex/high bitrate ones. Usually, VBR encoders handle very well non-complex part. Usually… The complexity range of my gallery of samples is wide enough to represent all situations (from ultra-low bitrate to ultra high ones) and to check the strength of VBR implementations.
abundance : a bunch of 12…15 samples is maybe not enough to give an accurate idea of the strength and weakness of different encoders. I experienced it myself in the past: my previous tests didn’t reveal some problems I only noticed after on real usage, and more important, they were unable to expose the recurrence of the detected problems. Detecting one problem (like rumbling or ringing) is one thing, measuring the periodicity of this problem is another thing. My test is based on 200 samples; this number should be enough to expose all common problems plus several uncommon ones and is also sufficient to get an idea of their redundancy. This is in my opinion the biggest advantage of my personal listening test over collective ones (which must stay friendly to avoid discouragement and attract a lot of testers).
statistical analysis : it might appear as trivial to mention this, but statistical analysis of results and confidence bars are presents (they were not used last year and the year before).
“Apples and Oranges” : no need to recall the problem. This test only mobilizes VBR encoders. No debate this time.


THE TEST: CHOICE OF ENCODERS


The market of audio encoders is ruled by a Darwinian process: the stronger only survive. Between my first test (October 2003) and this one (november 2005), only few encoders really progressed. Most other (some of them are still in use) are unchanged or only changed once: MPC, WMA (Standard and Pro), faac, all MP3 encoders (excepted LAME). Another one appeared and disappeared in the meantime (Compaact!).
On the hardware side, the situation is now very different from the one I lived two years ago. With the exception of one or two devices, AAC and Vorbis support in hardware players were more a dream than reality. Testing different audio formats was useful for a virtual and opened future, rich in dreams and promises. Now, the concrete situation is more interesting than dreams. MP3 and WMA (Std) are still the two well-established formats, but Vorbis now benefits from a growing interest of several manufacturers and if AAC still looks like an Apple monopoly the iPod market has at least mutated into several form (flash memory players, Microdrive™ based jukebox). One victim of reality is WMAPro, still not supported; and the growing popularity of WMA labeled as PlaysForSure (based on WMA Std) seems to sentence WMAPro to a long exile.
For all these considerations, I restricted the test to the most usable and interesting encoders: AAC-LC (highly developed by Apple and Nero Digital), MP3 (vigorous as ever, thanks to LAME devs), Vorbis (saved from inertia by Aoyumi). Besides these four encoders, I add two anchors. More precisely:

Apple AAC: I used iTunes 6.0.0.18 (based on QuickTime 7.03), at 128 kbps and with the recently added VBR mode . I test Apple AAC in VBR for the first time. I sadly discovered that this encoder use the same trick as the MP3 encoder included in iTunes: the minimal size of the frames are not inferior to the targeted bitrate (apart maybe digital silence). In other words, for 128 VBR encodings the bitrate starts at 128 kbps and is increased with complexity. No need to precise that if average bitrate stays close to the target, the variations are necessary limited. One advantage: this restricted mode prevents the VBR engine to use inadequately low bitrate frames, and should guarantee quality from bad surprises compared to a CBR encoding.

Nero Digital AAC: I used the very new encoder released two weeks ago (aac.dll v.3 and aacenc32.dll v.4.2.1.0 ), in VBR mode too. –internet profile is the closed to 128 kbps (slightly inferior with classical music, but higher with non-classical. I didn’t use the “fast” mode, which is now pretty similar but probably inferior to the “high” one.

LAME MP3: I used latest alpha of 3.98 (alpha 2) in order to add the –athaa-sensitivity 1 command to the –V5 --vbr-new mode. For the second group of samples and to slightly lower the bitrate I simply used –V5 –vbr-new.

Vorbis: I used aoTuV beta 4 (4.5 was released during the testing phase) instead of official 1.1.1 which corresponds to the 18 months old aoTuV beta 2 version. I used –q4,25 for the first group and –q4,00 for the second.

As low anchor, I looked for something really low and also usable in batch mode. I found a very old AAC encoder on ReallyRareWares called mbaacencoder version 0.3: it’s awfully slow, quality is terrific and is as anecdote ideal to get an idea of all progress made around AAC between 1999 (release date of mbaaencoder) and 2005 (Apple and Nero Digital). I tried to get joint stereo and LC profile in batch mode, but the encoder apparently stayed in default mode (Main Profile, 128 kbps and dual stereo).

As high anchor, I didn’t hesitate and used LAME 3.97 beta 1 –V2 --vbr new (or --preset standard) which is a reference for efficient, high quality and universal encodings. Furthermore, it would be interesting to evaluate the remaining gap between modern implementation of AAC and Vorbis at ~128 kbps to HQ MP3 at ~192 kbps.





SAMPLES

The test hinges on two big groups of samples: 150 for “classical” music group and 50 for “non-classical” (or “various”, or “modern”, or “popular”… choose your own) group. I already used the first group in three different tests in the past (80 kbps, 96 kbps, and LAME –V5). The complete collection is available for download. The 2nd group consist on all (35) non-classical samples used in previous collective listening tests; they’re all available on rarewares. To decrease the gap between the first and the second group I’ve add 15 other samples, all recently submitted for the postponed 64 kbps listening test of Sebastian Mares. Most of these last files may still be available.


THE BITRATE

The bitrate comparison is more accurate for the first group: it’s based on full tracks (6min 30 sec. per file on average) instead of short samples (10 sec. on average), and the complete collection is last but not least very representative of my entire library. For the second group of samples, I proceeded differently and I based the bitrate calculation on the 50 samples (which are longer: 24 sec. on average) and on external data (bitrate table for LAME posted by someone else). This way to evaluate the bitrate is not very precise, but I don’t have enough material to build a more accurate bitrate table. That’s why I tried to lower at maximum the difference in bitrate for all settings, and changed the command line for Vorbis (from –q4,25 to –q4,00) and LAME (--athaa-sensitivity 1 was removed).
To sum up the datas (a complete bitrate table will follow in the next days):
CODE

CLASSICAL (full tracks)

low anchor 128,00 kbps (estimated)
AAC iTunes 133,33 kbps [+4,16 %]
AAC Nero 125,71 kbps [-1,79 %]
MP3 LAME 130,81 kbps [+2,20 %]
Vorbis aoTuV 131,69 kbps [+2,88 %]
high anchor 181,46 kbps [+41,77 %]


NON-CLASSICAL (short samples)

low anchor 128,00 kbps (estimated)
AAC iTunes 137,31 kbps [+7,27 %]
AAC Nero 134,10 kbps [+4,76 %]
MP3 LAME¹ 137,82 kbps [+7,67 %]
Vorbis aoTuV² 133,42 kbps [+4,23 %]
high anchor 196,28 kbps [+53,34 %]

¹ with --athaa-sensitivity 1 bitrate reaches 139,38 kbps
² with –q4,25 bitrate reaches 140,21 kbps




TESTING CONDITIONS

The full test consists on pure ABC notation. The double blind test conditions are ensured by schnofler ABC/HR 0.5 beta (2005.08.31) software. All samples were decoded by CLI decoded within ABC/HR; offset were removed each times and minor differences in gain were systematically corrected (the highest difference reached 1.2 dB). Small mention for Vorbis: all files were decoded with foobar2000 (I still can’t make ABC/HR decode Vorbis files). There are no ABX comparisons: it’s a luxury I can’t afford with 1200 files awaiting for evaluation (200 x 6). If a difference is really unsure, I don’t rank the file. I finally ranked 16 times the reference instead of the encoded one (and 6 mistakes concern the high anchor). The error is inferior to 1.5%. I didn’t discard the errors from the final results (they don’t have a significant impact).
My hardware setting: Beyerdynamic DT-531 headphone; Audigy2 soundcard; Onkyo A-5 amp.



DREAM AND REALITY…


Last words before posting the results: I planed to write a complete review, including a complete synthesis on most common problems encountered in this test. Different encoders have different problems, and some of them are recurrent. As example, LAME produce often weird kind of rumbling (noise in low frequencies) and smearing; Vorbis has still issues with what I called “microdetails” (blurred and replaced by noise) and sometimes coarseness; iTunes suffers sometimes from a form of ringing I can’t define; Nero Digital has serious troubles on tonal passage and poor pre-echo performance.
I didn’t compile this memento yet, which should interest developers more than users. But I publish the results yet, because I feel that it’ time for me to close this test (honestly, seeing ABC/HR running somewhere drives me mad or sick).
Results are published as big png files; file size is not an issue (only 111 kb) but the image size may cause issue on small display resolution (800x600). I apologize for inconvenience. Small comments are ending the graphs. Here again, I planned to write more detailed comments, but until I achieve what I planed to do I fear that the week-end and maybe the month will be over. I postponed several activities during the two last weeks to perform and present this test, but I can’t continue anymore. If I remember correctly there’s a life outside ABC/HR smile.gif I also suspect that most people are not reading comments or details and are more interested by the final ranking. That’s why my results I’ll post today are a bit in “raw” form. I sincerely apologize, and will try to (slowly) give more textual substance in the next days. Now, results

This post has been edited by guruboolez: Nov 15 2005, 09:38
Go to the top of the page
+Quote Post
guruboolez
post Nov 15 2005, 09:14
Post #2





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



RESULTS




I. CLASSICAL: 5 electronic/artificial samples micro-group









II. CLASSICAL: 60 orchestral & chamber samples macro-group








III. CLASSICAL: 55 solo instruments samples macro-group









IV. CLASSICAL: 30 samples macro-group









V. NON-CLASSICAL or MODERN or VARIOUS: 50 samples macro-group





This post has been edited by guruboolez: Jan 29 2006, 13:58
Go to the top of the page
+Quote Post
guruboolez
post Nov 15 2005, 09:15
Post #3





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



Few words to conclude the test…
It’s pretty clear that all encoders tested here correspond to a good or even a very good output quality. There are currently no winner between AAC (iTunes) and Vorbis. It’s funny to see that results are pretty close on the finish line when problems are so different. Encodings are not fully transparent, but quality is in my opinion excellent most often (but not always).
LAME offers to MP3 the chance to stay competitive against AAC and Vorbis. Not fully competitive, but the efficiency of this format forces the respect.
Nero Digital implementation of AAC is slightly disappointing, especially with classical music, which is still a weak point of this encoder. But the quality is far from disaster (it wasn’t the case two years ago), is on average really good, gets even better with “non-classical” music and should satisfy several users.
Last but not least, difference among all these encoders is really small (don't look too much on "zoomed" plots smile.gif )

But the average mark is somewhat misleading. LAME quality is ~0.5 point lower to iTunes or Vorbis, but it doesn’t mean for example that quality of encoded albums are 0,5 lower. This lower ranking is rather the expression of higher fragility than lower quality. LAME, and Nero Digital, are more inclined to serious distortions than Vorbis or iTunes AAC at the same bitrate. The concept of quality may be replaced with such encoders by the concept of strength or robustness. To illustrate this I made the following histogram (sorry for poor quality, I’ll change it later):



Here, Vorbis and iTunes both get a mark comprise between 4.5 and 5.0 for 50% of the tested samples, whereas Nero only achieve this state (near-transparency or full transparency) for 20% of the same samples. With the classical group of samples, 30% of the them were ranked below 3.0 with Nero when iTunes or Vorbis got the same notation of less than 10% of the sample. The two winners are stronger, and could handle more situations than LAME and Nero Digital AAC.

This post has been edited by guruboolez: Dec 29 2005, 22:57
Go to the top of the page
+Quote Post
PoisonDan
post Nov 15 2005, 09:30
Post #4





Group: Members (Donating)
Posts: 678
Joined: 10-December 01
From: Belgium
Member No.: 622



QUOTE (guruboolez @ Nov 15 2005, 10:14 AM)
(honestly, seeing ABC/HR running somewhere drives me mad or sick).
*

I can imagine that. Boy, performing this test must have been such a huge task... I'm extremely impressed! blink.gif

Thanks a lot for sharing this with us, it's very interesting (especially now that you also included non-classical music).

My hat's off to you, Sir!


--------------------
Over thinking, over analyzing separates the body from the mind.
Go to the top of the page
+Quote Post
vinnie97
post Nov 15 2005, 09:34
Post #5





Group: Members
Posts: 472
Joined: 6-March 03
Member No.: 5360



bravo! You are much braver and patient than myself! It would seem that buying from the Itunes store isn't such a bad quality sacrifice going by your test. wink.gif Also, it's too bad aoTuV saw another update in the middle of your test...now you have to start again...only kidding! I don't think the quality level you tested was tuned any further in 4.5.

Thanks again, your blind tests are one of the top attractions around here.
Go to the top of the page
+Quote Post
guruboolez
post Nov 15 2005, 09:42
Post #6





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



Changes in aoTuV beta 4.5 are for inferior settings (up to -q3,00). Fortunately I would (exceptionally) say wink.gif
Go to the top of the page
+Quote Post
krazy
post Nov 15 2005, 09:45
Post #7





Group: Members
Posts: 493
Joined: 3-June 03
Member No.: 6981



Once again guruboolez, thankyou for your amazingly informative tests! And thanks for subjecting your ears to rigours the of modern music.. biggrin.gif

It's also nice to see that Aoyumi's work on vorbis is keeping it at the forefront of modern audio compression.
Go to the top of the page
+Quote Post
Synthetic Soul
post Nov 15 2005, 09:53
Post #8





Group: Super Moderator
Posts: 4887
Joined: 12-August 04
From: Exeter, UK
Member No.: 16217



Thank you guruboolez.

These tests are so important to the community.


--------------------
I'm on a horse.
Go to the top of the page
+Quote Post
arman68
post Nov 15 2005, 10:13
Post #9





Group: Members
Posts: 111
Joined: 11-December 01
Member No.: 625



shock1.gif

QUOTE (guruboolez)
Consequently, my “personal evaluations” which were first a friendly exercise feasible in one rainy, autumnal afternoon now looks as a gigantic task which took me approximately 10 days (shared with family, friends, job, and discouragement) to complete. I improved several point of the methodology


I am in awe...

I always do my own personal ABX test for my personal usage, but it is nothing compared to the enormous amount of work you do. Your tests and public results are very much appreciated, thank you.

edit: just finished reading the test results twice (to go through all the details), and I find it interesting that Nero still does not match Itunes, even though it uses a true VBR mode, whereas Itunes does not. I have been testing the new nero codec in VBR LC mode at lower bitrates for my W800i, and have been disappointed by it. What I did not do is compare it to Itunes. I will now.

This post has been edited by arman68: Nov 15 2005, 10:24
Go to the top of the page
+Quote Post
Daijoubu
post Nov 15 2005, 10:32
Post #10





Group: Members
Posts: 98
Joined: 22-February 03
From: Quebec, Montreal
Member No.: 5117



That must have been a heap load of data to compile ohmy.gif
Did you nose bleed? tongue.gif
Go to the top of the page
+Quote Post
tycho
post Nov 15 2005, 10:40
Post #11





Group: Members
Posts: 345
Joined: 5-August 03
Member No.: 8183



Invaluable tests again. Thank you so much. Vorbis aoTuV is the leading codec at medium bitrates (tied with iTunes AAC). And from other tests you did, Vorbis also shines at low and high bitrates. Nice to confirm that LAME -V2 --vbr-new is still superior to iTunes AAC at medium bitrates (and pretty tied with AAC ~180kbps, I guess).
Go to the top of the page
+Quote Post
LadFromDownUnder
post Nov 15 2005, 10:41
Post #12





Group: Members (Donating)
Posts: 90
Joined: 30-July 03
From: New Zealand
Member No.: 8083



As we say "down under", "Good on ya, mate!"
Go to the top of the page
+Quote Post
Atlantis
post Nov 15 2005, 11:18
Post #13





Group: Members
Posts: 250
Joined: 27-December 02
From: ROMA, Italy
Member No.: 4269



Thanks Guru!


--------------------
Vital papers will demonstrate their vitality by spontaneously moving from where you left them to where you can't find them.
Go to the top of the page
+Quote Post
ffooky
post Nov 15 2005, 11:41
Post #14





Group: Members
Posts: 261
Joined: 8-July 04
Member No.: 15184



Cheers Guru, fascinating stuff.
Go to the top of the page
+Quote Post
ilikedirtthe2nd
post Nov 15 2005, 11:45
Post #15





Group: Members
Posts: 470
Joined: 26-October 01
From: Germany
Member No.: 352



Thanks a lot, very interesting test again!
Go to the top of the page
+Quote Post
robert
post Nov 15 2005, 11:46
Post #16


LAME developer


Group: Developer
Posts: 788
Joined: 22-September 01
Member No.: 5



Thanks Guruboolez, very informative.

About LAME encoder being not well balanced:
QUOTE
LAME MP3: I used latest alpha of 3.98 (alpha 2) in order to add the –athaa-sensitivity 1 command to the –V5 --vbr-new mode. For the second group of samples and to slightly lower the bitrate I simply used –V5 –vbr-new.

I'm wondering, would your result be different if the encoder settings would have been the same for classical and none classical groups?
Go to the top of the page
+Quote Post
Enig123
post Nov 15 2005, 12:07
Post #17





Group: Members
Posts: 208
Joined: 11-April 02
Member No.: 1749



Oh. I don't think this test can be ignored only because it's done by just one person. Nero company really need some work to improve there aac implementation (maby already in Ivan's brain wink.gif ).

Thank you guruboolez for your great work.
Go to the top of the page
+Quote Post
dimzon
post Nov 15 2005, 12:09
Post #18





Group: Banned
Posts: 149
Joined: 1-September 05
Member No.: 24248



Thanx!
guruboolez, how about low-bitrate comparision (64kbps and below)
Go to the top of the page
+Quote Post
pest
post Nov 15 2005, 12:15
Post #19





Group: Members
Posts: 208
Joined: 12-March 04
From: Germany
Member No.: 12686



blink.gif

you must be crazy
impressive work!

thanks a lot guruboolez
Go to the top of the page
+Quote Post
Zurman
post Nov 15 2005, 12:38
Post #20





Group: Members
Posts: 238
Joined: 22-February 04
Member No.: 12193



Nice listening test, as always wink.gif
Go to the top of the page
+Quote Post
rjamorim
post Nov 15 2005, 13:11
Post #21


Rarewares admin


Group: Members
Posts: 7515
Joined: 30-September 01
From: Brazil
Member No.: 81



Awesome, awesome, awesome.


Very big thanks, Francis. You're a legend. smile.gif


--------------------
Get up-to-date binaries of Lame, AAC, Vorbis and much more at RareWares:
http://www.rarewares.org
Go to the top of the page
+Quote Post
guruboolez
post Nov 15 2005, 13:23
Post #22





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



QUOTE (robert @ Nov 15 2005, 11:46 AM)
QUOTE
LAME MP3: I used latest alpha of 3.98 (alpha 2) in order to add the –athaa-sensitivity 1 command to the –V5 --vbr-new mode. For the second group of samples and to slightly lower the bitrate I simply used –V5 –vbr-new.

I'm wondering, would your result be different if the encoder settings would have been the same for classical and none classical groups?
*


I don't think so. The --athaa-sensitivity command prevents a specific kind of ringing (I'm used to call it "background ringing"), and I don't remember any sample of the second group suffering from this problem (there are maybe one or two of them).

I already noticed this disparity in performance between classical group and "various" samples during my summer listening tests performed at 80 kbps and 96 kbps.
The difference is also not very important. And as you can see it on the distributive histograms, the main difference occurs on the last part (ranking > 4.5). ~40% of the tested samples (classical) were ranked below 4.5 with LAME, but the proportion falls to 20% for the second category. It seems that for LAME, there are more "easy" to handle situation in my sample gallery than for the 50 samples I collected from various listening tests. (I don't know if I'm really clear...).
Go to the top of the page
+Quote Post
guruboolez
post Nov 15 2005, 13:28
Post #23





Group: Members (Donating)
Posts: 3474
Joined: 7-November 01
From: Strasbourg (France)
Member No.: 420



QUOTE (dimzon @ Nov 15 2005, 12:09 PM)
Thanx!
guruboolez, how about low-bitrate comparision (64kbps and below)
*

I'm not very happy with the quality of current encoders at this bitrate. Not really suitable for my personal use. Curiosity would therefore be my only motivation for such exercise.
Go to the top of the page
+Quote Post
Busemann
post Nov 15 2005, 13:36
Post #24





Group: Members
Posts: 730
Joined: 5-January 04
Member No.: 10970



Surprising to see how close Vorbis and iTunes are to the high anchor. I guess one could safely use 160kbps VBR for transparency with iTunes now (I previously used 192kbps).
Go to the top of the page
+Quote Post
QuantumKnot
post Nov 15 2005, 14:27
Post #25





Group: Developer
Posts: 1245
Joined: 16-December 02
From: Australia
Member No.: 4097



To guruboolez, thank you for yet another incredibly fascinating and informative listening test. smile.gif

I am again very pleased to see Vorbis doing so well. Full credits to Aoyumi for his wonderful work. I'm also very pleased to see iTunes AAC doing so well too. It seems we do get value for money with these two encoders (ie. they're free!!! even better biggrin.gif )
Go to the top of the page
+Quote Post

6 Pages V   1 2 3 > » 
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 17th September 2014 - 02:14