IPB

Welcome Guest ( Log In | Register )

5 Pages V   1 2 3 > »   
Reply to this topicStart new topic
Multiformat listening test @ ~64kbps: Results, Results and post-test discussion
IgorC
post Apr 12 2011, 00:40
Post #1





Group: Members
Posts: 1572
Joined: 3-January 05
From: ARG/RUS
Member No.: 18803



The test is finished, results are available here:

http://listening-tests.hydrogenaudio.org/igorc/results.html

Summary: CELT/Opus won, Apple HE-AAC is better than Nero HE-AAC, and Vorbis has caught up with Nero HE-AAC.
Go to the top of the page
+Quote Post
Garf
post Apr 12 2011, 01:02
Post #2


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



If someone can assist with a bitrate table or per-sample results, that would be nice...
Go to the top of the page
+Quote Post
Garf
post Apr 12 2011, 01:06
Post #3


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



Oh, and given that Opus is open sourced, if one of the developers can give a technical explanation for our audience on what codec features and design decisions made them able to win this test, that would be pretty damn interesting, too smile.gif
Go to the top of the page
+Quote Post
AllanP
post Apr 12 2011, 01:14
Post #4





Group: Members
Posts: 8
Joined: 28-May 08
Member No.: 53862



I just wonder one thing, when the Vorbis encoder was tested how was it lowpassed. Was it tested with the default 14 kHz lowpass?


--------------------
256 kbps Apple AAC bought iTunes music
Go to the top of the page
+Quote Post
Garf
post Apr 12 2011, 01:15
Post #5


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



QUOTE (AllanP @ Apr 12 2011, 02:14) *
I just wonder one thing, when the Vorbis encoder was tested how was it lowpassed. Was it tested with the default 14 kHz lowpass?


You can see the exact settings used for each codec here:

http://listening-tests.hydrogenaudio.org/igorc/index.htm

Go to the top of the page
+Quote Post
AllanP
post Apr 12 2011, 01:22
Post #6





Group: Members
Posts: 8
Joined: 28-May 08
Member No.: 53862



QUOTE (Garf @ Apr 12 2011, 02:15) *
You can see the exact settings used for each codec here:

http://listening-tests.hydrogenaudio.org/igorc/index.htm


ah thanks, sorry did not see it.

It says -q 0.1 so I assume it was the default 14 kHz lowpass


--------------------
256 kbps Apple AAC bought iTunes music
Go to the top of the page
+Quote Post
romor
post Apr 12 2011, 03:08
Post #7





Group: Members
Posts: 672
Joined: 16-January 09
Member No.: 65630



Congratulation to CELT/Opus! smile.gif

I wanted to compare ratings by testers per sample, but it seems that every tester gets random testing sequence
Is there any way I can get such data, and get wanted plot - if it's not clear I want to know source sample formats (for 5 rating bins) for each tester

Thanks

edit: nevermind, I found a way - it seems that sample name appendixes are same (those describing 5 bins at header of each test result)

This post has been edited by romor: Apr 12 2011, 03:24


--------------------
scripts: http://goo.gl/M1qVLQ
Go to the top of the page
+Quote Post
IgorC
post Apr 12 2011, 03:50
Post #8





Group: Members
Posts: 1572
Joined: 3-January 05
From: ARG/RUS
Member No.: 18803



I think the results of lessthanjoey and AlexB are also anonym. It will be changed.
If anyone is interested in his/her test there is key or email me and I will send the results.


oh, I have participated in this test too. smile.gif
Garf had the key for my results and had checked them.

It's also good to keep strong words like "thank you, great job". But this time I want to say big Thank You to all participants and people who has helped to conduct these test.
Sebastian Mares - for his previous public tests. This test benefited much from them.
AlexB - for providing pre-decoded packages and being here.
Especially, Garf.

And many other people who were here around. Your time is valuable and highly aprreciated.

This post has been edited by IgorC: Apr 12 2011, 04:20
Go to the top of the page
+Quote Post
googlebot
post Apr 12 2011, 08:06
Post #9





Group: Members
Posts: 698
Joined: 6-March 10
Member No.: 78779



I'm stunned by the CELT/Opus results! I would have assumed that your toolbox is smaller than usually when you are targeting low-delay. And now Celt even beats the others by lengths.

Thanks for the great work, guys!
Go to the top of the page
+Quote Post
Alex B
post Apr 12 2011, 12:59
Post #10





Group: Members
Posts: 1303
Joined: 14-September 05
From: Helsinki, Finland
Member No.: 24472



Thanks guys! Interesting results.


One note though:

CODE
Read 5 treatments, 531 samples => 10 comparisons
    Means:
          Vorbis   Nero_HE-AAC  Apple_HE-AAC          Opus    AAC-LC@48k
           3.513         3.547         3.817         3.999         1.656


For processing the result .txt files with chunky I organized them to sample folders. I removed the results that were marked "invalid" and results that apparently had a fixed newer version (marked as such). I had a duplicate problem with romor's results (a couple of duplicates in a subfolder), but I decided to keep the newer result files. I got 566 remaining result files. Assuming I did not make lots of mistakes, I wonder what can cause the difference. Did you disqualify more results after creating the rar package or does "531 samples" mean something else than the total number of result files?

Here's how chunky parses the 566 result files I have:

CODE
% Result file produced by chunky-0.8.4-beta
% ..\chunky.exe --codec-file=..\codecs.txt -n --ratings=results --warn -p 0.05
%
% Sample Averages:

Vorbis    Nero    Apple    CELT    Anchor
2.56    4.28    4.19    2.67    1.87
2.95    4.20    4.03    2.36    1.68
3.42    3.51    3.98    4.73    2.51
4.12    3.84    4.49    4.64    2.18
4.18    3.59    3.87    4.52    1.95
3.35    3.68    3.34    4.00    1.56
3.86    2.98    2.96    3.50    1.85
4.03    3.78    4.09    4.49    2.02
3.60    3.71    3.89    3.94    1.51
4.28    2.78    2.19    4.12    1.44
4.12    3.93    4.17    4.39    1.70
3.25    3.18    3.20    4.14    1.77
3.83    3.63    3.86    4.56    1.41
3.49    3.81    4.01    4.27    1.37
4.15    3.84    4.08    4.76    2.04
3.97    2.74    3.09    4.38    1.74
3.35    3.24    4.15    4.44    1.56
2.68    2.96    3.63    4.10    1.51
3.58    4.37    4.88    3.73    1.76
3.40    4.10    4.68    4.26    1.61
3.80    3.49    3.55    4.43    1.38
3.81    3.30    4.27    4.26    1.13
3.59    3.14    3.51    4.09    1.18
3.29    3.61    3.88    4.16    1.36
3.66    3.84    4.37    3.86    1.55
2.78    3.99    4.18    2.82    1.57
3.62    3.88    3.92    3.93    1.34
3.39    4.03    4.39    3.96    1.46
3.61    4.12    4.36    4.09    1.54
4.42    3.48    4.29    4.68    1.82

% Codec averages:
% 3.60    3.63    3.92    4.08    1.65


--------------------
http://listening-tests.freetzi.com
Go to the top of the page
+Quote Post
Garf
post Apr 12 2011, 13:53
Post #11


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



QUOTE (Alex B @ Apr 12 2011, 13:59) *
I got 566 remaining result files. Assuming I did not make lots of mistakes, I wonder what can cause the difference.


I get the same result as you. It looks like the results submitted on the 10th of April are missing.

Edit: See below.

This post has been edited by Garf: Apr 12 2011, 15:28
Go to the top of the page
+Quote Post
Alex B
post Apr 12 2011, 14:14
Post #12





Group: Members
Posts: 1303
Joined: 14-September 05
From: Helsinki, Finland
Member No.: 24472



For comparison I uploaded a rar package of my "chunky" folder. it contains the reorganized result files and phong's chunky (Windows version). The command line I used is in the instructions.txt file

I had to partially rename the result files to reorganize them into the sample folders. In addition I needed to change all r.wav strings inside the result files to .wav before chunky could work. I batch processed the files with Notepad++. I believe it was a "safe" edit.

The package is here: http://www.hydrogenaudio.org/forums/index....showtopic=88033

This post has been edited by Alex B: Apr 12 2011, 14:45


--------------------
http://listening-tests.freetzi.com
Go to the top of the page
+Quote Post
NullC
post Apr 12 2011, 14:48
Post #13





Group: Developer
Posts: 200
Joined: 8-July 03
Member No.: 7653



QUOTE
For processing the result .txt files with chunky I organized them to sample folders. I removed the results that were marked "invalid" and results that apparently had a fixed newer version (marked as such). I
had a duplicate problem with romor's results (a couple of duplicates in a subfolder), but I decided to keep the newer result files. I got 566 remaining result files. Assuming I did not make lots of mistakes, I
wonder what can cause the difference. Did you disqualify more results after creating the rar package or does "531 samples" mean something else than the total number of result files?


Sounds like you didn't eliminate the listeners with more than 4 invalid results.

The filtering rules on the page are:

* If the listener ranked the reference worse than 4.5 on a sample, the listener's results for that sample were discarded.
* If the listener ranked the low anchor at 5.0 on a sample, the listener's results for that sample were discarded.
* If the listener ranked the reference below 5.0 on more than 4 samples, all of that listener's results were discarded.

You'll have to modify chunky to get the that behavior.


This post has been edited by NullC: Apr 12 2011, 15:12
Go to the top of the page
+Quote Post
Garf
post Apr 12 2011, 14:49
Post #14


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



QUOTE (Alex B @ Apr 12 2011, 15:14) *
For comparison I uploaded a rar package of my "chunky" folder. it contains the reorganized result files and phong's chunky (Windows version). The command line I used is in the instructions.txt file

I had to partially rename the result files to reorganize them into the sample folders. In addition I needed to change all r.wav strings in filenames to .wav before chunky could work. I batch processed the files with Notepad++. I believe it was a "safe" edit.

The package is here: http://www.hydrogenaudio.org/forums/index....showtopic=88033


Thanks, I didn't have the triaged results here, so this was welcome. By the way, chunky has quite dangerous behavior: by default, it squashes all listeners together per sample for the overall results. In other words, its discarding most of the information in the test, as if only a single listener did all samples! The per-sample results don't suffer from that, so those should be fine.

Edit: Whoops, I indeed missed some results that should have been discarded.

This post has been edited by Garf: Apr 12 2011, 15:18
Go to the top of the page
+Quote Post
Garf
post Apr 12 2011, 14:59
Post #15


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



QUOTE (NullC @ Apr 12 2011, 15:48) *
Sounds like you didn't eliminate the listeners with more than 4 invalid results.

The filtering rules on the page are:

* If the listener ranked the reference worse than 4.5 on a sample, the listener's results for that sample were discarded.
* If the listener ranked the low anchor at 5.0 on a sample, the listener's results for that sample were discarded.
* If the listener ranked the reference below 5.0 on more than 4 samples, all of that listener's results were discarded.

You'll have to modify chunky to get the that behavior.


Ah, good point. There were two discarded listeners, I got those. I saw one result with a rated reference that didn't cause an invalidation, so got that correctly.

But there are a few results with 5.0's for the reference. After discarding those, I'm at 559 samples now.
Go to the top of the page
+Quote Post
Alex B
post Apr 12 2011, 15:02
Post #16





Group: Members
Posts: 1303
Joined: 14-September 05
From: Helsinki, Finland
Member No.: 24472



QUOTE (NullC @ Apr 12 2011, 16:48) *
Sounds like you didn't eliminate the listeners with more than 4 invalid results.


I removed two folders (= listeners) before doing the tasks I mentioned:

- 09 (too many invalid results. The listener has never answered any email)
- 27 (something gone wrong or cheater )

I trusted the comments in the folder and file names. I did not look inside each and every result file.


--------------------
http://listening-tests.freetzi.com
Go to the top of the page
+Quote Post
Alex B
post Apr 12 2011, 15:09
Post #17





Group: Members
Posts: 1303
Joined: 14-September 05
From: Helsinki, Finland
Member No.: 24472



QUOTE (Garf @ Apr 12 2011, 16:59) *
But there are a few results with 5.0's for the reference. After discarding those, I'm at 559 samples now.

Perhaps "low anchor" would be more accurate. wink.gif


--------------------
http://listening-tests.freetzi.com
Go to the top of the page
+Quote Post
NullC
post Apr 12 2011, 15:17
Post #18





Group: Developer
Posts: 200
Joined: 8-July 03
Member No.: 7653



QUOTE (Alex B @ Apr 12 2011, 07:02) *
QUOTE (NullC @ Apr 12 2011, 16:48) *
Sounds like you didn't eliminate the listeners with more than 4 invalid results.


I removed two folders (= listeners) before doing the tasks I mentioned:

- 09 (too many invalid results. The listener has never answered any email)
- 27 (something gone wrong or cheater )

I trusted the comments in the folder and file names. I did not look inside each and every result file.


Ah, okay!

(moving and amending from my edited post, since others already replied. Sorry)
The users which should have been excluded according to that rule are 09, 27, and 22 but IgorC decided to keep 22 (because 22 didn't understand the procedure at first but got better later) and I expected 21 to be filtered too (because he only rated the low anchor on almost all the samples: 23/30 are either low anchor only or invalid, including many of the really obvious ones).

This post has been edited by NullC: Apr 12 2011, 15:19
Go to the top of the page
+Quote Post
Alex B
post Apr 12 2011, 15:35
Post #19





Group: Members
Posts: 1303
Joined: 14-September 05
From: Helsinki, Finland
Member No.: 24472



QUOTE (Garf @ Apr 12 2011, 16:59) *
But there are a few results with 5.0's for the reference. After discarding those, I'm at 559 samples now.

I found six "low anchor = 5.0" instances (I outputted a csv file from chunky and sorted the data by the low anchor column in Excel)

My math says 560. smile.gif

(or did you actually remove the "rated but accepted reference" instance?)


--------------------
http://listening-tests.freetzi.com
Go to the top of the page
+Quote Post
Garf
post Apr 12 2011, 15:42
Post #20


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



QUOTE (Alex B @ Apr 12 2011, 16:35) *
QUOTE (Garf @ Apr 12 2011, 16:59) *
But there are a few results with 5.0's for the reference. After discarding those, I'm at 559 samples now.

I found six "low anchor = 5.0" instances (I outputted a csv file from chunky and sorted the data by the low anchor column in Excel)

My math says 560. smile.gif

(or did you actually remove the "rated but accepted reference" instance?)


No. But after running chunky I only had 565, not 566 files. It appears to reject one input file for some reason (this is on Linux).

A lesson here is that the post-screened data-set should be published, too, because it's easy to make mistakes there and it makes it easier for people wanting to do other/further analysis. But considering the comment from NullC the results on the site are probably correct.
Go to the top of the page
+Quote Post
Alex B
post Apr 12 2011, 16:14
Post #21





Group: Members
Posts: 1303
Joined: 14-September 05
From: Helsinki, Finland
Member No.: 24472



Regarding the bitrate table,

I guess that CELT/Opus is not supported in any program that can display and/or export accurate bit rate data.

If the bitrate needs to be calculated from the file size should the size of the ogg container data be reduced from the file size before performing the calculation? What would be the correct amount?

Would the bitrate value then be comparable with the values that foobar shows for the other contenders? (It is quite simple to export bitrate data from foobar.)


--------------------
http://listening-tests.freetzi.com
Go to the top of the page
+Quote Post
motion_blur
post Apr 12 2011, 16:15
Post #22





Group: Members
Posts: 13
Joined: 8-March 11
Member No.: 88816



QUOTE (Alex B @ Apr 12 2011, 16:02) *
QUOTE (NullC @ Apr 12 2011, 16:48) *
Sounds like you didn't eliminate the listeners with more than 4 invalid results.


I removed two folders (= listeners) before doing the tasks I mentioned:

- 09 (too many invalid results. The listener has never answered any email)
- 27 (something gone wrong or cheater )

I trusted the comments in the folder and file names. I did not look inside each and every result file.


# 27 are my results. I do not know, if something went wrong, but I am definitely not a cheater.
Over a week ago, I sent Igor some wave-files he asked for, but he did not answered my email jet.
Go to the top of the page
+Quote Post
NullC
post Apr 12 2011, 17:54
Post #23





Group: Developer
Posts: 200
Joined: 8-July 03
Member No.: 7653



QUOTE (motion_blur @ Apr 12 2011, 08:15) *
# 27 are my results. I do not know, if something went wrong, but I am definitely not a cheater.
Over a week ago, I sent Igor some wave-files he asked for, but he did not answered my email jet.


I think it's really unfortunate that Igor released a file with the word cheater in it. There are so many ways for a result to go weird which have nothing to do with "cheating".

Your results can be excluded purely based on the previously published confused reference criteria (2,4,9,22,30 invalid), so that should close the question on correctness of excluding those results and it should have been left at that. Even with good and careful listeners this can happen, and it's nothing anyone should take too personally.

Though, your results are pretty weirdó You ranked the reference fairly low (e.g. 3) on a couple comparisons where many people found the reference and codec indistinguishable. I think you also failed to reverse your preference on some samples where the other listeners changed their preference (behavior characteristic of a non-blind test?).

I don't mean to cause offense, but were you listening via speakers or could you have far less HF sensitivity than most of the other listeners (if you are male and older than most participants then the answer to that might be yes)? Any other ideas why your results might be very different overall and also on specific samples?

This post has been edited by NullC: Apr 12 2011, 18:23
Go to the top of the page
+Quote Post
NullC
post Apr 12 2011, 18:10
Post #24





Group: Developer
Posts: 200
Joined: 8-July 03
Member No.: 7653



QUOTE (Alex B @ Apr 12 2011, 08:14) *
Regarding the bitrate table,
I guess that CELT/Opus is not supported in any program that can display and/or export accurate bit rate data.
If the bitrate needs to be calculated from the file size should the size of the ogg container data be reduced from the file size before performing the calculation? What would be the correct amount?
Would the bitrate value then be comparable with the values that foobar shows for the other contenders? (It is quite simple to export bitrate data from foobar.)


If you wish to remove container overhead for the Vorbis and Opus files you can use a tool like ogg-dump from oggztools to extract all the packet sizes.

On a few samples Vorbis suffers a bit because the Vorbis headers are fairly large compare to an 8 second 64kbit/sec file (e.g. Sample01) but I don't think the container overhead is all that considerable.

This post has been edited by NullC: Apr 12 2011, 18:15
Go to the top of the page
+Quote Post
IgorC
post Apr 12 2011, 18:13
Post #25





Group: Members
Posts: 1572
Joined: 3-January 05
From: ARG/RUS
Member No.: 18803



Yes, I was too strict. Sorry about it.

Some of the listeners prefer Nero over Vorbis or vice versa. Some of them have rated Vorbis higher against HE-AAC codecs.
Other preferred Apple HE-AAC over CELT on second half of samples. These variations are all fine.
Finally on average Opus/CELT was better for all listeners with enough results.
It was very strange that you have ranked the Opus as low as low anchor! (like sample 10 and much others) where ALL other listeners scored it very well.
You average scores (including 5 invalid samples):
Vorbis - 3.53
Nero - 3.15
Apple -3.51
CELT - 2.34


Maybe your hardware has some issues.

Earlier I also wrote you to re run again the whole test because there were 5 invalid results and all test was discarded.

This post has been edited by IgorC: Apr 12 2011, 18:18
Go to the top of the page
+Quote Post

5 Pages V   1 2 3 > » 
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 16th September 2014 - 04:07