Mar 31 2010, 02:27
Joined: 16-October 03
Member No.: 9337
I have only now become aware of Gregory S. Chudov's effort to develop CTDB (CUETools DB). I am very excited about this as I have actually been suggesting this to spoon (dbpoweramp's developer for over 5 years)
for others that have missed it:
What's it for?
You probably heard about AccurateRip, a wonderfull database of CD rip checksums, which helps you make sure your CD rip is an exact copy of original CD. What it can tell you is how many other people got the same data when copying this CD. CUETools Database is an extension of this idea.
What are the advantages?
* The most important feature is the ability not only to detect, but also correct small amounts of errors that occured in the ripping process.
* It's free of the offset problems. You don't even need to set up offset correction for your CD drive to be able to verify and what's more important, submit rips to the database. Different pressings of the same CD are treated as the same disc by the database, it doesn't care.
* Verification results are easier to deal with. There are exactly three possible outcomes: rip is correct, rip contains correctable errors, rip is unknown (or contains errors beyond repair).
* If there's a match, you can be certain it's really a match, because in addition to recovery record database uses a well-known CRC32 checksum of the whole CD image (except for 10*588 offset samples in the first and last seconds of the disc). This checksum is used as a rip ID in CTDB.
What are the downsides and limitations?
* CUETools DB doesn't bother with tracks. Your rip as a whole is either good/correctable, or it isn't. If one of the tracks is damaged beyound repair, CTDB cannot tell which one.
* If your rip contains errors, verification/correction process will involve downloading about 200kb of data, which is much more than it takes for AccurateRp.
* Verification process is slower than with AR.
* Database was just born and at the moment contains much less CDs than AR.
How many errors can a rip contain and still be repairable?
* That depends. The best case scenario is when there's one continuous damaged area up to 30-40 sectors (about half a second) long.
* The worst case scenario is 4 non-continuous damaged sectors in (very) unlucky positions.
What information does the database contain per each submission?
* CD TOC (Table Of Contents), i.e. length of every track.
* Offset-finding checksum, i.e. small (16 byte) recovery record for a set of samples throughout the CD, which allows to detect the offset difference between the rip in database and your rip, even if your rip contains some errors.
* CRC32 of the whole disc (except for some leadin/leadout samples).
* Submission date, artist, title.
* 180kb recovery record, which is stored separately and accessed only when verifying a broken rip or repairing it.
Mar 31 2010, 18:46
Joined: 2-October 08
Member No.: 59035
Of course CTDB is open, and the code required to use it is LGPLed as all CUETools libraries. The only problem is it's in C#, i wonder if i will have to provide a .dll with C interface at some point. The algorithm is not very simple, there's quite a lot of code.
The basic algorithm is Reed-Solomon code on 16-bit words. Unfortunately, 32-bit Reed-Solomon is extremely slow, and 16-bit Reed-Solomon can be used directly only on chunks of up to 64k words == 128kbytes. So i have to process the file as a matrix with rows of 10 sectors (5880 samples == 11760 words/columns). Such matrix has up to ~30000 rows for a 70 minute CD, so i can use 16-bit Reed-Solomon for each column independently. Using the same notation as in wikipedia it's a (65536,65528) code, which produces 8 words for each column. So the total size is 8*11760*16bit = 188160 bytes.
N-word recovery record can detect and correct up to N/2 erroneous words, so this 8-word recovery record can detect up to 4 errors in each column. N cannot be much smaller, but it also cannot be much larger, because processing time grows proportionally to N, so N=8 was chosen as the highest value which is still "fast enough" - close to FLAC decoding speed.
Row size doesn't have such impact on performance, so it can be easily extended in the future, so that popular CDs can have larger recovery records. Current size was chosen so that if database contained as many entries as AccurateRip, it would fit on a 1TB drive. I also took into account that making records larger only helps in best-case scenario when the damage is sequential (scratches etc). When damage occurs at random points, fixing it requires larger N, not larger row size, but this has a performance impact. So the current record size was chosen to be more or less balanced.
Is there a point in better identification of where the damage is, when the database is unable to fix it?
Discs don't have to pass AR before being added to the CTDB, AR is used only as a kind of proof that there is a physical CD with such content when adding with CUETools.
CD Rippers can add CDs to CTDB even if AR doesn't know them. There is already a number of CDs in database submitted by CUERipper, some of them have confidence 1 - that means they didn't pass AR check or weren't found in AR.
This post has been edited by Gregory S. Chudov: Mar 31 2010, 19:03
|Lo-Fi Version||Time is now: 6th March 2015 - 10:21|