IPB

Welcome Guest ( Log In | Register )

2 Pages V  < 1 2  
Reply to this topicStart new topic
Jeff Atwood's "Great MP3 Bitrate Experiment", From the Coding Horror blog.
2Bdecided
post Jul 3 2012, 11:52
Post #26


ReplayGain developer


Group: Developer
Posts: 5138
Joined: 5-November 01
From: Yorkshire, UK
Member No.: 409



Like someone else said, I think some testers got a bit confused at how they were supposed to be rating these things.

Though I find it amazing that anyone could perfectly rate them (except by accident or cheating). I guess I just have cloth ears!

Cheers,
David.
Go to the top of the page
+Quote Post
splice
post Jul 3 2012, 15:04
Post #27





Group: Members
Posts: 125
Joined: 23-July 03
Member No.: 7935



QUOTE (greynol @ Jun 28 2012, 14:58) *
One of these:


Tch. Kids these days... obviously never read Billboard.


--------------------
Regards,
Don Hills
Go to the top of the page
+Quote Post
mzil
post Jul 3 2012, 15:40
Post #28





Group: Members
Posts: 606
Joined: 5-August 07
Member No.: 45913



QUOTE (nevermind @ Jul 2 2012, 23:21) *
maybe if i remove some of the people who rated the lowest sample higher then cd it would be more interesting, and I think i have found something unusual....


See: 1.1 "Discarding unfavorable data"

This post has been edited by mzil: Jul 3 2012, 15:42
Go to the top of the page
+Quote Post
db1989
post Jul 3 2012, 15:47
Post #29





Group: Super Moderator
Posts: 5275
Joined: 23-June 06
Member No.: 32180



There is a fundamental difference between data that are unfavourable and data that do not meet the requirements of the test.

Many users submitted their data under the incorrect assumption that the scale of 1–5 was a rank of their preference for each individual sample, with each value being useable only once. In actuality, the scale was supposed to be used as their rating of perceived quality for each sample, with no limit to the number of occurrences.

So, I don’t think your reference is relevant.

Whether or not it’s possible to confidently identify the data that do not meet the actual specification, to discard them, and to retain sufficient numbers to draw a useful conclusion is another question entirely.

This post has been edited by db1989: Jul 3 2012, 17:23
Reason for edit: wording
Go to the top of the page
+Quote Post
mzil
post Jul 3 2012, 18:26
Post #30





Group: Members
Posts: 606
Joined: 5-August 07
Member No.: 45913



You shouldn't cherry pick raw data under any circumstances in a properly unbiased, double blind test. It makes the test suspect, regardless of the test conductor's intentions, good or evil. If there was poor wording or a misunderstanding in the instructions, then one needs to conduct a fundamentally new test, not discard raw data one "believes" to be compromised.

[In different circumstances I'd accept using one test in an attempt to find certain "gifted" test subjects ,who are then retested, however. This could be used, for instance, to find "golden-eared" listeners.]

This post has been edited by mzil: Jul 3 2012, 18:40
Go to the top of the page
+Quote Post
Canar
post Jul 3 2012, 18:47
Post #31





Group: Super Moderator
Posts: 3361
Joined: 26-July 02
From: princegeorge.ca
Member No.: 2796



Disregarding all listeners who rated WAV as less than 5 gives us this chart (based on his Excel file, can't be bothered to make it more "accurate")



And our original:



Note how that completely removes much of the preference for the first option, and brings all the other options roughly in line with what we would expect.

This post has been edited by Canar: Jul 3 2012, 18:58


--------------------
You cannot ABX the rustling of jimmies.
No mouse? No problem.
Go to the top of the page
+Quote Post
db1989
post Jul 3 2012, 18:51
Post #32





Group: Super Moderator
Posts: 5275
Joined: 23-June 06
Member No.: 32180



Disclaimer: meandering musings

QUOTE (mzil @ Jul 3 2012, 18:26) *
You can't cherry pick raw data under any circumstances. It makes the test invalid regardless of your intentions, good or evil. If there was poor wording or a misunderstanding in the instructions, then you need to conduct a fundamentally new test, not discard raw data you "believe" to be compromised.

I donít disagree in principle. Hail science! I was just pointing out that, however scientifically tenuous it might be, excluding data because they were submitted in the wrong format is not exactly equivalent to excluding data because they arenít conducive to someoneís ulterior motive(s). At the very least, itís not equivalent ethically: one is done in an effort to improve the reliability of a conclusion, whereas the other is done merely out of cynical self-interest.

Scientific ethics aside (just for a moment! wink.gif), is such filtering of incorrectly calibrated data even likely to be possible in any real-life study with any probability of preserving its objective reliability? I lack the experience to answer either way, and I suspect that itís better avoided anyway due to the same concerns that youíve raised Ė but in this case, I donít think itís very likely that one could do it. That was what I meant by my closing sentence, although I should have given it more consideration.

Of course, as you implied, this question should never arise: collection of data should be designed so as to preclude any of them being Ďincorrectí or ambiguous. In this specific case, the take-home message is that instructions must be clear and unambiguous, so that respondents can provide useful data. Itís a shame how this test is somewhat marred by its shortcomings in that area and, as I said, how this confounding factor canít be removed post hoc.

QUOTE (Canar @ Jul 3 2012, 18:47) *
Disregarding all listeners who rated WAV as less than 5
Since youíve just reminded me of something I wondered about earlier: how about disregarding all respondents whose data sets included each number only once? Or am I getting desperate here? tongue.gif
Go to the top of the page
+Quote Post
Canar
post Jul 3 2012, 19:11
Post #33





Group: Super Moderator
Posts: 3361
Joined: 26-July 02
From: princegeorge.ca
Member No.: 2796



QUOTE (db1989 @ Jul 3 2012, 10:51) *
Since you’ve just reminded me of something I wondered about earlier: how about disregarding all respondents whose data sets included each number only once? Or am I getting desperate here? tongue.gif

I also excluded all data sets consisting of one number for all entries.

Combining our approaches (restricted to WAV=5, only entries with duplicates) does not provide good results either: [4.11, 3.79, 5, 3.79, 3.52]

This post has been edited by Canar: Jul 3 2012, 19:16


--------------------
You cannot ABX the rustling of jimmies.
No mouse? No problem.
Go to the top of the page
+Quote Post
Porcus
post Jul 3 2012, 20:28
Post #34





Group: Members
Posts: 1842
Joined: 30-November 06
Member No.: 38207



QUOTE (mzil @ Jul 3 2012, 19:26) *
You shouldn't cherry pick raw data under any circumstances in a properly unbiased, double blind test. It makes the test suspect, regardless of the test conductor's intentions, good or evil. If there was poor wording or a misunderstanding in the instructions, then one needs to conduct a fundamentally new test, not discard raw data one "believes" to be compromised.


Well ... opinions certainly differ on that one. As far as I know, there is no universally agreed-upon treatment of outliers.

However, if the null is random ranking, then various statistical models could cope with those who rank the other way around. You could formulate the alternative hypothesis to be H1: after possibly switching order of rankings, they are still more concordant with bitrate than what is consistent with the null. But if you start looking at data, you are mining, and that is not without issues either.


Now for designing a new test, you are of course free to look at your old data with any creativity you can imagine. You are essentially looking for any pattern that could be tested.


--------------------
One day in the Year of the Fox came a time remembered well
Go to the top of the page
+Quote Post
mzil
post Jul 3 2012, 21:24
Post #35





Group: Members
Posts: 606
Joined: 5-August 07
Member No.: 45913



http://news.change.org/stories/cherry-pick...ientific-method

QUOTE
As far as I know, there is no universally agreed-upon treatment of outliers
You count them.

As always, if you discover there is flaw it the test design, then you chuck ALL the data in the trash bin and re-design a new test. You don't go back and cherry pick out (keep) only the data you feel with your "completely objective and unbiased view" is "legit".

This post has been edited by mzil: Jul 3 2012, 21:31
Go to the top of the page
+Quote Post
db1989
post Jul 3 2012, 21:30
Post #36





Group: Super Moderator
Posts: 5275
Joined: 23-June 06
Member No.: 32180



Shall I just repeat what I’ve already said about your allegation that the exclusion of incorrectly formatted data – which was not done by the actual researcher, it must be emphasised –†is equivalent to cynical cherry-picking in favour of an ulterior motive? Or are you the only one who gets to repeat yourself?

I don’t disagree in principle that one should always endeavour to solve problems at the earliest/proper point, i.e. the experimental design in this case. I was just musing hypothetically. That last word is important, since it’s me who’s twittering away to myself here, rather than the researcher having actually done this or anything like it! Looking back, I do not agree with the filtering suggested by nevermind, which began all of this, but again: That’s different from asking whether one can filter data that were not formatted correctly. Which, again, isn’t something I think can be done reliably – but it was just a hypothetical question about the possibility of putting a Band-Aid on a less than optimally designed test, not prodding something in a direction according to self-interest

This post has been edited by db1989: Jul 3 2012, 21:37
Reason for edit: adding entire second paragraph of elaboration
Go to the top of the page
+Quote Post
mzil
post Jul 3 2012, 21:33
Post #37





Group: Members
Posts: 606
Joined: 5-August 07
Member No.: 45913



I think there is a belief here that "as long as the motives are pure, and unmotivated by desired outcome", then cherry picking is "OK". I don't feel that way. There could be things which are unforeseen by all of us.

This post has been edited by mzil: Jul 3 2012, 21:39
Go to the top of the page
+Quote Post
Canar
post Jul 3 2012, 21:39
Post #38





Group: Super Moderator
Posts: 3361
Joined: 26-July 02
From: princegeorge.ca
Member No.: 2796



Data are data. If there has been some kind of procedural error and it's not feasible to re-run the experiment, it's entirely legit to restrict your data down to the valid subset, if there is some easy way to do so. If, due to some error, only 10% of your data are actually valid, and you can identify that 10% post hoc, there is no reason not to analyze that 10%. It might redeem the entire experiment.

Ideally, yes, you re-run the experiment and try ensure that 100% of your data are valid. This is not always feasible, nor should it be absolutely required.


--------------------
You cannot ABX the rustling of jimmies.
No mouse? No problem.
Go to the top of the page
+Quote Post
saratoga
post Jul 3 2012, 22:05
Post #39





Group: Members
Posts: 4967
Joined: 2-September 02
Member No.: 3264



QUOTE (mzil @ Jul 3 2012, 16:24) *
As always, if you discover there is flaw it the test design, then you chuck ALL the data in the trash bin and re-design a new test.


While I understand your motivation, this is basically just you unsupportable opinion.
Go to the top of the page
+Quote Post
Canar
post Jul 3 2012, 22:23
Post #40





Group: Super Moderator
Posts: 3361
Joined: 26-July 02
From: princegeorge.ca
Member No.: 2796



QUOTE (saratoga @ Jul 3 2012, 14:05) *
basically just you unsupportable opinion
Careful, we've stepped below science into its superstructure: philosophy of science. Here there be dragons: terrible things that could render all the lovely objectivity around here into little more than "unsupportable opinion"...


--------------------
You cannot ABX the rustling of jimmies.
No mouse? No problem.
Go to the top of the page
+Quote Post
mzil
post Jul 3 2012, 23:18
Post #41





Group: Members
Posts: 606
Joined: 5-August 07
Member No.: 45913



QUOTE (Canar @ Jul 3 2012, 16:39) *
If there has been some kind of procedural error and it's not feasible to re-run the experiment, it's entirely legit to restrict your data down to the valid subset, if there is some easy way to do so.
...

Huh? I suspect you don't really mean this, unless I am just completely mis-reading it. The feasibility or ease of re-running a test doesn't make a difference as to the legitimacy of the original test.

To paraphrase what you have written, one could say "If it is difficult to re-run a test, then we should accept at least the subset of the data that we believe wasn't compromised, due to the known error", [as long as we still have a large enough sample left over to make the results statistically significant, I guess]. "If it is easy to re-run the test, however, then the original data is suspect, should be ignored, and we should do the re-test."

The difficulty in re-running a test, but this time without the design flaw, doesn't change whether the original test data is legit or not. It either is or it isn't, regardless of the time needed/ease/difficulty in conducting a new test without the design flaw. Right?
---

"Cherry picking" is a type of confirmation bias, more accurately called a "fallacy of suppressed evidence" and may very well be unconscious in nature, despite its sinister sounding name. I wasn't, however, trying to speak poorly of anyone here or question their motives, but I seem to be alone here in thinking that claims of "pure and unbiased" motivation, which of course all scientists think applies to them wink.gif , doesn't suddenly make cherry picking "acceptable". Everyone thinks their selection process is "sound, pure, and motivated only by the unbiased pursuit of truth".

As it says here, one's motivation may indeed be pure and honest, but the fallacy name, even if not a very good name, still applies:

"If the relevant information is not intentionally suppressed by rather inadvertently overlooked, the fallacy of suppressed evidence also is said to occur, although the fallacyís name is misleading in this case. The fallacy is also called the Fallacy of Incomplete Evidence and Cherry-Picking the Evidence.."

I unfortunately don't have any more time on my hands to devote to this, so I'm outta here.
Happy July 4th everyone!

This post has been edited by mzil: Jul 3 2012, 23:35
Go to the top of the page
+Quote Post
saratoga
post Jul 4 2012, 00:19
Post #42





Group: Members
Posts: 4967
Joined: 2-September 02
Member No.: 3264



QUOTE (mzil @ Jul 3 2012, 18:18) *
"Cherry picking" is a type of confirmation bias, more accurately called a "fallacy of suppressed evidence" and may very well be unconscious in nature, despite its sinister sounding name. I wasn't, however, trying to speak poorly of anyone here or question their motives, but I seem to be alone here in thinking that claims of "pure and unbiased" motivation, which of course all scientists think applies to them wink.gif , doesn't suddenly make cherry picking "acceptable". Everyone thinks their selection process is "sound, pure, and motivated only by the unbiased pursuit of truth".


I don't think anyone is saying that cherry picking doesn't exist. I think the point is that you're remarks about cherry picking are not really relevant in this particular instance.

QUOTE (Canar @ Jul 3 2012, 17:23) *
QUOTE (saratoga @ Jul 3 2012, 14:05) *
basically just you unsupportable opinion
Careful, we've stepped below science into its superstructure: philosophy of science.


Which is why it is incorrect to make universal assertions about how things must be done.
Go to the top of the page
+Quote Post
Porcus
post Jul 4 2012, 00:59
Post #43





Group: Members
Posts: 1842
Joined: 30-November 06
Member No.: 38207



QUOTE (mzil @ Jul 3 2012, 22:24) *
QUOTE
As far as I know, there is no universally agreed-upon treatment of outliers
You count them.

Before or after you have defined them?


QUOTE (mzil @ Jul 3 2012, 22:24) *
As always, if you discover there is flaw it the test design, then you chuck ALL the data in the trash bin and re-design a new test.

Well go tell that to a paleontologist tongue.gif


No, seriously: have a look at http://en.wikipedia.org/wiki/Meta-analysis..._and_weaknesses .


--------------------
One day in the Year of the Fox came a time remembered well
Go to the top of the page
+Quote Post

2 Pages V  < 1 2
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 20th September 2014 - 03:09