IPB

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
FLAC I/O less efficient than STDIN, Direct file access can be almost twice slower than STDIN
skamp
post Nov 26 2012, 18:26
Post #1





Group: Developer
Posts: 1412
Joined: 4-May 04
From: France
Member No.: 13875



I was optimizing caudec when I came across this oddity. Basically, letting /usr/bin/flac access .flac files on a slowish HDD directly for decoding ('flac -d file.flac') was in one particular case almost twice slower than piping the files to /usr/bin/flac via STDIN ('cat file.flac | flac -d -').

I used a double album for testing, made of 37 tracks for a total of about 1 GiB, located on a HDD that tops out at about 70 MB/s. Incidentally, flac decodes on my machine at a similar rate.

I ran caudec twice (figuratively - I repeated the tests many times) with 8 concurrent processes, for decoding those FLAC files to WAV on a ramdisk. I made sure to drop all caches between each run. First run was with direct file access, and completed in 40 seconds. Second run was with piping to STDIN, and completed in 25 seconds.

The difference was much less pronounced, surprisingly, on a USB flash drive that tops out at 35 MB/s, 34 seconds vs. 30 seconds, and non-existant on a RAID 1 array that tops out at 130 MB/s and on a SSD that tops out at 500 MB/s. I experienced similar differences with WavPack.

Does anyone have any idea of what's going on?


--------------------
See my profile for measurements, tools and recommendations.
Go to the top of the page
+Quote Post
phofman
post Nov 26 2012, 19:11
Post #2





Group: Members
Posts: 283
Joined: 14-February 12
Member No.: 97162



Running the reading and decoding processes in parallel?
Go to the top of the page
+Quote Post
Destroid
post Nov 26 2012, 20:19
Post #3





Group: Members
Posts: 545
Joined: 4-June 02
Member No.: 2220



I would imagine reading and writing to the same mechanical disk would be the culprit. This is supported by the USB drive measurement but that doesn't appear to explain everything. If I am understanding correctly the plain version of decoding is slower than a workaround-type STDIN decoding. Although I am unfamiliar with *nix there are two possibilities (either or both):

OS - handling of STDIO regarding write-buffering via HDD driver and/or CPU;
BIN - when outputting the data from a STDIN source it reads/writes larger chunks and the chunk size works better with the write-buffering and/or CPU cache (I recall something a FB2K regarding something about differences in seek-table when the encoder used STDIN, not sure why this would also affect decoding).


--------------------
"Something bothering you, Mister Spock?"
Go to the top of the page
+Quote Post
skamp
post Nov 26 2012, 20:59
Post #4





Group: Developer
Posts: 1412
Joined: 4-May 04
From: France
Member No.: 13875



QUOTE (phofman @ Nov 26 2012, 19:11) *
Running the reading and decoding processes in parallel?


Yes.

QUOTE (Destroid @ Nov 26 2012, 20:19) *
I would imagine reading and writing to the same mechanical disk would be the culprit.


Like I said, I was writing to a ramdisk! The drives were only read from, not written to.


--------------------
See my profile for measurements, tools and recommendations.
Go to the top of the page
+Quote Post
Destroid
post Nov 26 2012, 22:02
Post #5





Group: Members
Posts: 545
Joined: 4-June 02
Member No.: 2220



Ok, I see that in your third paragraph. (Pardon my oversight, I posted right before a medical appointment so apparently I was distracted.)

If I read correctly:
FLAC -d [HDD -> RAMdisk] = 40s
FLAC STDIO [HDD -> RAMdisk] = 25s
FLAC -d [flashdrive -> RAMdisk] = 34s
FLAC STDIO [flashdrive -> RAMdisk] = 30s
RAID = no tangible difference
SSD = no tangible difference

You mentioned WavPack having similar behavior so it doesn't seem the binary is the culprit either. I don't suppose using less concurrent threads would improve the performance of FLAC -d but it might be worth checking. It also is unclear how these threads are distributed but I wondered if multiple decode threads caused an unintended bottleneck (especially with fast-decoding formats).

edit: I should also have mentioned I thought an instance of STDIO was limited to one thread per file, but this may be a bad assumption on my part.

This post has been edited by Destroid: Nov 26 2012, 22:08


--------------------
"Something bothering you, Mister Spock?"
Go to the top of the page
+Quote Post
skamp
post Nov 26 2012, 22:22
Post #6





Group: Developer
Posts: 1412
Joined: 4-May 04
From: France
Member No.: 13875



QUOTE (Destroid @ Nov 26 2012, 22:02) *
If I read correctly:


Yes.

QUOTE (Destroid @ Nov 26 2012, 22:02) *
I don't suppose using less concurrent threads would improve the performance of FLAC -d but it might be worth checking.


Using 4 processes instead of 8: HDD direct: 45 seconds, HDD STDIN: 27 seconds; USB direct: 34 seconds, USB STDIN: 31 seconds. Note that I'm using a quad-core CPU with hyperthreading (4 cores, 8 threads).

QUOTE (Destroid @ Nov 26 2012, 22:02) *
I should also have mentioned I thought an instance of STDIO was limited to one thread per file, but this may be a bad assumption on my part.


Yes. Really, the only difference here is that I delegated the reading to /usr/bin/cat. That alone magically improves performance, particularly in the HDD case. /usr/bin/cat is doing something right, that /usr/bin/flac is doing wrong, or so it seems anyway.


--------------------
See my profile for measurements, tools and recommendations.
Go to the top of the page
+Quote Post
greensdrive
post Nov 26 2012, 23:16
Post #7





Group: Members
Posts: 28
Joined: 20-May 11
Member No.: 90802



try using just one thread for cat and the same for flac -d. perhaps one binary is optimized for multi-core, and one is not.

seems to me that flac would use ram to decode. the less ram (because of the ramdisk) may be limiting the decode ability of flac. but I think that maybe the same could be said for cat. that's assuming you used actual ram and not swap or other temporary hard drive space for the ramdisk.

also... /usr/bin/flac seems like a binary provided by your distribution. maybe try using a more optimized one that you compiled, or even one from rarewares (if they have it) since caudec supports wine anyway.
Go to the top of the page
+Quote Post
yourlord
post Nov 26 2012, 23:28
Post #8





Group: Members
Posts: 198
Joined: 1-March 11
Member No.: 88621



I tested this on my machine with mostly insignificant differences.

On a 171MB FLAC -8 encoded file.
I ran one decode to allow Linux to cache the FLAC file in RAM and then discarded the results.

I did 3 runs...

A proper process efficient redirection to standard input:
flac -o test.wav -d - < 01-A\ Change\ Of\ Seasons.flac

real 0m3.740s
user 0m3.432s
sys 0m0.300s


A process inefficient pipe from cat:
cat 01-A\ Change\ Of\ Seasons.flac | flac -o test.wav -d -

real 0m3.869s
user 0m3.428s
sys 0m0.720s


Allowing FLAC to read the file itself:
flac -o test.wav -d 01-A\ Change\ Of\ Seasons.flac

real 0m3.765s
user 0m3.392s
sys 0m0.336s

I ran 3 runs of each test and while the numbers fluctuated slightly, the time spread remained similar on all runs.

In the given example run:
The difference between the best run (process efficient redirect) and the worst (pipe from cat) is 129ms
The difference between the process efficient redirect and directly reading is only 25ms.

Given this admittedly abysmally inadequate sample size it would appear that shell STDIN redirection provides the fastest decode, but the difference between redirection and directly reading the file is small enough to basically dismiss as noise. It would appear that in all cases the context switches involved with invoking cat yield the slowest results by a significant margin.

This post has been edited by yourlord: Nov 26 2012, 23:29
Go to the top of the page
+Quote Post
skamp
post Nov 26 2012, 23:42
Post #9





Group: Developer
Posts: 1412
Joined: 4-May 04
From: France
Member No.: 13875



QUOTE (yourlord @ Nov 26 2012, 23:28) *
I tested this on my machine with mostly insignificant differences.


For obvious reasons:

QUOTE (yourlord @ Nov 26 2012, 23:28) *
On a 171MB FLAC -8 encoded file.


You used a single file (my experiments use many, concurrently) that amounts to a rather small amount of data (I used a total of 1 GiB in order to make the differences more pronounced)!

QUOTE (yourlord @ Nov 26 2012, 23:28) *
I ran one decode to allow Linux to cache the FLAC file in RAM and then discarded the results.


You let your OS cache your file on purpose, so it was decoded from RAM. Why? I'm talking about hard drive access. Do you even understand what I'm talking about?

QUOTE (yourlord @ Nov 26 2012, 23:28) *
A proper process efficient redirection to standard input:
flac -o test.wav -d - < 01-A\ Change\ Of\ Seasons.flac


Actually, I tried that, and it's a lot less efficient in my experiments than piping cat's output: my test, using that method, completes in 40 seconds (versus 25) off my HDD.

QUOTE (yourlord @ Nov 26 2012, 23:28) *
Given this admittedly abysmally inadequate sample size


Your entire testing process is completely off-topic.


--------------------
See my profile for measurements, tools and recommendations.
Go to the top of the page
+Quote Post
yourlord
post Nov 27 2012, 01:21
Post #10





Group: Members
Posts: 198
Joined: 1-March 11
Member No.: 88621



QUOTE (skamp @ Nov 26 2012, 17:42) *
You let your OS cache your file on purpose, so it was decoded from RAM. Why? I'm talking about hard drive access. Do you even understand what I'm talking about?


I eliminated the hard drive from the equation because you're trying to investigate an issue with many variables in play.
My goal was to narrow this down to one factor, the method of delivering data to flac, and to test to see if there is a discernible and significant performance pattern related to them.
As expected, in my limited testing, using shell builtin redirection was faster than spawning a wasted cat process and was slightly faster than letting flac read it directly.

You came here with a question about why in your script you are seeing this anomaly, and the first step in that is to break down the steps you use and test if there is an inherent inefficiency in them.
You raised the question about the different performance based on how the data was delivered to flac, and I set out to test each method to see if there was a significant performance hit for any one. You take a slight performance penalty for spawning unneeded processes, and it has an impact even on a single decode. Multiply that 125ms by several hundred decodes and it adds up. I was simply presenting a small data set test to show the performance differences between the methods you asked about. It may or may not be the source of your problem, but it's a data point that can be considered and then confirmed or eliminated as a contributing factor.

You came here with a problem and I tried to provide a small bit of data to aid in your investigation. I'm sorry if my attempt at helping offends you.

QUOTE (skamp @ Nov 26 2012, 17:42) *
QUOTE (yourlord @ Nov 26 2012, 23:28) *
A proper process efficient redirection to standard input:
flac -o test.wav -d - < 01-A\ Change\ Of\ Seasons.flac


Actually, I tried that, and it's a lot less efficient in my experiments than piping cat's output: my test, using that method, completes in 40 seconds (versus 25) off my HDD.



Then there is something else terribly wrong. Shell native redirection should ALWAYS be faster than piping the data from cat. There's an entire process that no longer needs to be spawned and managed (cat) for every single decode operation.

I'm not sure why there appears to be a slight performance hit for having flac read the file directly. I'd need to dig into the sources to see where the difference lies but I would imagine there was a lot more thought put into efficient IO by the people writing bash than by the guy who wrote flac.
Go to the top of the page
+Quote Post
Axon
post Nov 27 2012, 01:32
Post #11





Group: Members (Donating)
Posts: 1984
Joined: 4-January 04
From: Austin, TX
Member No.: 10933



You might try `blockdev --setra 65536 --setfra 65536 <device>` to set blockdev/fs readahead to ridiculously high values.

It's possible that the difference in performance between the HD, USB HD and RAID are primarily due to small I/O timing differences between the processes tickling the pagecache in different ways.
Go to the top of the page
+Quote Post
Axon
post Nov 27 2012, 01:43
Post #12





Group: Members (Donating)
Posts: 1984
Joined: 4-January 04
From: Austin, TX
Member No.: 10933



QUOTE (yourlord @ Nov 26 2012, 18:21) *
Then there is something else terribly wrong. Shell native redirection should ALWAYS be faster than piping the data from cat. There's an entire process that no longer needs to be spawned and managed (cat) for every single decode operation.


Not true. The pipe adds an extra layer of buffering between the filesystem read and the decoding process (and one whose size is adjusted dynamically by the kernel). With a redirect, whenever flac read()s stdin for new data, the read goes right to the kernel. With a pipe, the filesystem read may have already been completed by cat.
Go to the top of the page
+Quote Post
Axon
post Nov 27 2012, 02:46
Post #13





Group: Members (Donating)
Posts: 1984
Joined: 4-January 04
From: Austin, TX
Member No.: 10933



OP might also try tuning the I/O scheduler; see e.g. Documentation/block/switching-sched.txt, Documentation/block/cfq-iosched.txt, and Documentation/block/deadline-iosched.txt.
Go to the top of the page
+Quote Post
skamp
post Nov 27 2012, 02:52
Post #14





Group: Developer
Posts: 1412
Joined: 4-May 04
From: France
Member No.: 13875



QUOTE (Axon @ Nov 27 2012, 01:32) *
You might try `blockdev --setra 65536 --setfra 65536 <device>` to set blockdev/fs readahead to ridiculously high values.


Bingo! With those values, the test on the HDD completed in 16 seconds in all cases! All my drives were set to 256 sectors (128 KiB). I noticed that performance improved dramatically when adjusting that value a single step up (512), and 2048 (1 MiB) sounds like a rather sane value.


--------------------
See my profile for measurements, tools and recommendations.
Go to the top of the page
+Quote Post
Axon
post Nov 27 2012, 19:59
Post #15





Group: Members (Donating)
Posts: 1984
Joined: 4-January 04
From: Austin, TX
Member No.: 10933



Cool.

Note that --setra and --setra are completely different settings IIRC. Setting these values too high could compromise performance on other applications, so unless the drive is devoted to music, you should probably tune them down appropriately.

I'm rather curious as to if you can improve performance at the default readahead values by instead tuning CFQ params.
Go to the top of the page
+Quote Post
skamp
post Nov 27 2012, 21:59
Post #16





Group: Developer
Posts: 1412
Joined: 4-May 04
From: France
Member No.: 13875



QUOTE (Axon @ Nov 27 2012, 19:59) *
I'm rather curious as to if you can improve performance at the default readahead values by instead tuning CFQ params.


Yes: 23 seconds with CFQ/readahead at 256 (vs. 40 seconds with deadline), 17 seconds with CFQ/readahead at 16384.

I completely forgot that I changed the scheduler to deadline years ago.


--------------------
See my profile for measurements, tools and recommendations.
Go to the top of the page
+Quote Post
Axon
post Nov 28 2012, 01:20
Post #17





Group: Members (Donating)
Posts: 1984
Joined: 4-January 04
From: Austin, TX
Member No.: 10933



QUOTE (skamp @ Nov 27 2012, 14:59) *
QUOTE (Axon @ Nov 27 2012, 19:59) *
I'm rather curious as to if you can improve performance at the default readahead values by instead tuning CFQ params.


Yes: 23 seconds with CFQ/readahead at 256 (vs. 40 seconds with deadline), 17 seconds with CFQ/readahead at 16384.

I completely forgot that I changed the scheduler to deadline years ago.


huh.gif Yes, I could understand how CFQ would perform better in this sort of workload. But 1MiB readahead still seems a *tad* too high. I was imagining that tweaking things like /sys/block/*/queue/iosched/slice_idle (or other settings described in cfq-iosched.txt) could help.

This post has been edited by Axon: Nov 28 2012, 01:24
Go to the top of the page
+Quote Post

Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 26th July 2014 - 09:03