IPB

Welcome Guest ( Log In | Register )

3 Pages V  < 1 2 3  
Reply to this topicStart new topic
Song fingerprinting tools, using FooID
odyssey
post Nov 14 2009, 15:04
Post #51





Group: Members
Posts: 2296
Joined: 18-May 03
From: Denmark
Member No.: 6695



It crashes hard, not with the crash-handler. I figured that your tool correctly ignores regular streams, but the problem I experience is while using foo_mslive to stream wma-audio. foo_biometrics don't see these as streams and tries to fingerprint them. Taking one at a time doesn't make foobar2000 crash, it just waits until it times out or something like that. However, take 5 streams or so at the same time, it will crash after a while.

Did you fix the FLAC-problem?


--------------------
Can't wait for a HD-AAC encoder :P
Go to the top of the page
+Quote Post
musicmusic
post Nov 15 2009, 00:00
Post #52


Columns UI developer


Group: Developer
Posts: 3034
Joined: 20-December 02
From: United Kingdom
Member No.: 4177



QUOTE (odyssey @ Nov 14 2009, 14:04) *
It crashes hard, not with the crash-handler.
Not sure if you mean it hangs, or just exits with no error. If it hangs, you can create a dump using Task Manager.

QUOTE (odyssey @ Nov 14 2009, 14:04) *
I figured that your tool correctly ignores regular streams, but the problem I experience is while using foo_mslive to stream wma-audio. foo_biometrics don't see these as streams and tries to fingerprint them. Taking one at a time doesn't make foobar2000 crash, it just waits until it times out or something like that. However, take 5 streams or so at the same time, it will crash after a while.
Possibly foo_mslive doesn't handle decoding multiple streams at the same time properly. Don't think I have any special handling for streams etc., I'll try and have a look as you are probably right that it shouldn't bother scanning them.

QUOTE (odyssey @ Nov 14 2009, 14:04) *
Did you fix the FLAC-problem?
As far as I could see, it seemed like the floating-point output of the FLAC decoder was different in the two cases. fb2k normally decodes to floating-point, and FooID takes floats as input so I'm not sure that I'm doing anything inherently wrong, unless anyone else has any input.. foo_bitcompare is happy though but I don't know exactly what it does.


--------------------
.
Go to the top of the page
+Quote Post
partneriflight
post Nov 24 2009, 19:52
Post #53





Group: Members
Posts: 1
Joined: 24-November 09
Member No.: 75270



QUOTE (Garf @ Jan 16 2009, 00:15) *
I restored the relevant part of http://foosic.org


So the site seems to be down again. Anywhere I could find matching.zip?

Thanks!
Go to the top of the page
+Quote Post
odyssey
post Apr 21 2010, 03:04
Post #54





Group: Members
Posts: 2296
Joined: 18-May 03
From: Denmark
Member No.: 6695



I'm a little puzzled about the results (and/or the threshold levels).

Often it finds a similar track (which it should imho) like the instrumental track (from a cdm), a different cut or slightly different version, but at the same time completely ignores some tracks that should be the exact same track just on a different album. Now I tried to lower the threshold to min 50% and the results are mostly the same, except I now get some large groups with completely different songs in them - It seems to easily confuse especially electronic tracks and club mixes that mostly begin with a simple beat.

Can someone clarify?


--------------------
Can't wait for a HD-AAC encoder :P
Go to the top of the page
+Quote Post
romor
post Nov 2 2012, 06:46
Post #55





Group: Members
Posts: 673
Joined: 16-January 09
Member No.: 65630



Can I revive this a bit? smile.gif

I realize that developer is unavailable, but maybe someone else may shred some light

I downloaded Garf's source library, although C illiterate but out of curiosity
From what I can guess browsing it, I assume that track is downsampled and sliced on fixed number of parts then dominant harmonic is extracted for each part?

I then converted fingerprint hex string:
CODE
dec = array([int('0x' + d['FINGERPRINT_FOOID'][x:x+2], 16) for x in xrange(0, 424, 2)])

So splitting the string on every 2nd char, and converting to decimal I get an array. However I can't get any matching between similar tracks, doing standard correlation tests.
I then made histograms of tested arrays with 10 and then 16 bins, doing same simple correlation tests (Pearson, Spearman, Kendall ... ) on histogram values, but again no luck

Any tips from more knowledgeable?


--------------------
scripts: http://goo.gl/M1qVLQ
Go to the top of the page
+Quote Post
foosion
post Nov 2 2012, 10:27
Post #56





Group: FB2K Moderator (Donating)
Posts: 4433
Joined: 24-February 03
Member No.: 5153



Check the definition of the t_fingerprint structure in common.h:
CODE
/*
fingerprint storage
*/
struct t_fingerprint
{
/*
fingerprint version
*/
short version;
/*
length in centiseconds
*/
int length;
/*
average line fit, times 1000
*/
short avg_fit;
/*
average dominant line, times 100
*/
short avg_dom;
/*
spectral fits, 4 bits times 16 bands = 32 times 87 frames
-> 348 bytes
*/
unsigned char r[348];
/*
spectral doms, 6 bits times 87 frames = 65.25
*/
unsigned char dom[66];
};

The content of this structure is packed into a byte array in fp_calculate function in fooid.c:
CODE
    memcpy(buff, &(fi->fp.version), sizeof(short));
    buff += sizeof(short);
    memcpy(buff, &(fi->fp.length), sizeof(int));
    buff += sizeof(int);
    memcpy(buff, &(fi->fp.avg_fit), sizeof(short));
    buff += sizeof(short);
    memcpy(buff, &(fi->fp.avg_dom), sizeof(short));
    buff += sizeof(short);
    memcpy(buff, &(fi->fp.r), sizeof(unsigned char) * 348);
    buff += sizeof(unsigned char) * 348;
    memcpy(buff, &(fi->fp.dom), sizeof(unsigned char) * 66);
    buff += sizeof(unsigned char) * 66;

As you can see different parts of the fingerprint contain different values. If you need more details about the algorithm, you could try to contact Garf. Since he developed FooID during his time at university, he might have written a paper about it. wink.gif


--------------------
http://foosion.foobar2000.org/ - my components for foobar2000
Go to the top of the page
+Quote Post
romor
post Nov 2 2012, 12:40
Post #57





Group: Members
Posts: 673
Joined: 16-January 09
Member No.: 65630



Thanks foosion smile.gif I did so naive...
Can I ask further assistance, as how to decode this structure?

For example, variable `r`, if I convert each "unsigned char" [10:358] to ordinal integer, I don't get expected results (doing correlation), so I assume it's not how it should be done.
In common.h (as quoted nicely) there is equation like: "4 bits times 16 bands = 32 times 87 frames -> 348 bytes" which I can't make sense.
4*87 is 348, so if I divide it (`r`) on 87 frames, I'll get 4 byte values, or?

This post has been edited by romor: Nov 2 2012, 12:41


--------------------
scripts: http://goo.gl/M1qVLQ
Go to the top of the page
+Quote Post
romor
post Nov 2 2012, 13:41
Post #58





Group: Members
Posts: 673
Joined: 16-January 09
Member No.: 65630



OK, I get that I should group each 4 bytes in this subsequence, then convert each char to bits, slice in 2 and convert this 4 bits to integer, so that I get 16 bins from each 4bytes. Results aren't satisfactory, still

Nevermind

This post has been edited by romor: Nov 2 2012, 13:42


--------------------
scripts: http://goo.gl/M1qVLQ
Go to the top of the page
+Quote Post
romor
post Nov 3 2012, 11:39
Post #59





Group: Members
Posts: 673
Joined: 16-January 09
Member No.: 65630



My previous post is not correct. It may make sense if we decode each character in FOOID tag, but that's just wrong of course, as we have string that actually represent byte stream.

I assume there are 8 (4bit) bands in 4 bytes. There must be a type in the code.

Here is my notebook, which I cleaned a bit now, in case anyone gets similar idea: http://nbviewer.ipython.org/url/dl.dropbox...nb/foo_id.ipynb


--------------------
scripts: http://goo.gl/M1qVLQ
Go to the top of the page
+Quote Post
foosion
post Nov 6 2012, 18:04
Post #60





Group: FB2K Moderator (Donating)
Posts: 4433
Joined: 24-February 03
Member No.: 5153



According to the code in spectrum.c the r field contains 16 bands per frame, but only 2 bits are used per band.


--------------------
http://foosion.foobar2000.org/ - my components for foobar2000
Go to the top of the page
+Quote Post
romor
post Nov 6 2012, 19:27
Post #61





Group: Members
Posts: 673
Joined: 16-January 09
Member No.: 65630



Thanks foosion. I discarded that idea, because of 2bit capacity

I did again now, and with what I initially thought - cross correlate not values, but histograms - here is for example plot for all 16 bands (same playlist as in example): http://i.imgur.com/9Q3tc.png
It was just silly idea. Things can't work that way.

Coincidentally, just yesterday, I had a tweet about Mel Cepstral Coefficients (MFC). Used for voice recognition, could provide also genre classification. Computing MFC is relatively easy, especially in Python: `import mfcc` wink.gif, and searching further I found some papers, but related to classification (machine learning), not about track to track comparison, which should be much easier and deduced just by these coefficients, but not there yet.



--------------------
scripts: http://goo.gl/M1qVLQ
Go to the top of the page
+Quote Post
foosion
post Nov 6 2012, 19:50
Post #62





Group: FB2K Moderator (Donating)
Posts: 4433
Joined: 24-February 03
Member No.: 5153



I remember that Garf talked on IRC about the fingerprint matching algorithm he used on foosic.org. What I don't remember is how he did the matching.


--------------------
http://foosion.foobar2000.org/ - my components for foobar2000
Go to the top of the page
+Quote Post
Garf
post Nov 8 2012, 13:37
Post #63


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



I uploaded some Python source that reads and matches fingerprints here:
http://sjeng.org/ftp/fooid.py

If you have specific questions, I'll try to answer.

The fits data reprents how "flat" or "spiky" the band is, quantized to a value from 0..3.
Go to the top of the page
+Quote Post
romor
post Nov 9 2012, 01:45
Post #64





Group: Members
Posts: 673
Joined: 16-January 09
Member No.: 65630



Thanks Garf.

I tried your suggestion, and here is example result: http://nbviewer.ipython.org/url/dl.dropbox...ynb/fooid.ipynb
It's same playlist as in my previous post. Decoding is done differently, without `struct` module, but results are same which I checked, just in case.

So I assume you know I'm looking for similar tracks, but result doesn't show any candidates. Do you perhaps have further suggestions?
foosion mentioned you may have some paper written for your fingerprinting - is it so, and is it public?


--------------------
scripts: http://goo.gl/M1qVLQ
Go to the top of the page
+Quote Post
Garf
post Nov 12 2012, 22:43
Post #65


Server Admin


Group: Admin
Posts: 4885
Joined: 24-September 01
Member No.: 13



QUOTE (romor @ Nov 9 2012, 01:45) *
So I assume you know I'm looking for similar tracks, but result doesn't show any candidates. Do you perhaps have further suggestions?


I would say that it's working as intended. Only the exact same song, possibly after having gone through lossy compression etc, should match with high confidence.

If you want to search for similar songs, drop the confidence, investigate a large sample, and have a look at the mutual closest matching ones. But in my experience the songs that are judged close won't really sound so much to a human listener. I think we use different criteria to judge that compared to the ones libfooid measures - even if they're psycho-acoustically very robust.

There are exceptions - you will likely see "live" recordings match the studio ones fairly well.

QUOTE
foosion mentioned you may have some paper written for your fingerprinting - is it so, and is it public?


I have a PowerPoint presentation, but it doesn't go in depth, and it's in Dutch.
Go to the top of the page
+Quote Post
ncmaothvez
post Nov 13 2012, 02:53
Post #66





Group: Members
Posts: 1
Joined: 13-November 12
Member No.: 104469



romor, are you using Garf's (?) fooid library as is or are you trying to rewrite it in another language? If you're using it as is then you might have run into the same problem I had when I messed around with fooid. It's been nearly a year since I touched the project last time so the details are a bit fuzzy.

The short version smile.gif
As far as I can tell, fooid has a problem with detecting the start of some songs. This throws off the analysis and the resulting fingerprints are too different to indicate a good match, even when two songs sound the same. In some cases the fingerprints have 0% match even though the songs sound identical. So, it could be that your fingerprint comparison algorithm is actually OK but the fingerprints coming from fooid are bad.

fooid will allways produce fingerprints with 100% match if you compare two identical file-copies of the same song though.


The long version smile.gif
Line 101 in fooid.c, downloadable from the Google project page, says:
QUOTE
if (fabs(data[(pos * fid->channels) + c]) >= (1.0f/32768.0f - EPSILON))

and as far as I can remember this detects when the intro silence ends and the song starts. As written, that detection threshold is a just single LSB above absolute silence!

In my case this was waaay to sensitive. Most duplicates in my song collection were detected properly but alot of songs were detected as being 0% similiar, even though they were perfectly identical sound-wise. The reason being that the sampling of the song data had started at differnet points in time due to noise exceeding one LSB before the songs started. I worked with nearly noise-less MP3 and FLAC files but I suspect that if one tries to compare let's say recordings of vinyl records with static noise before the song starts, then it's probably quite likely that one would end up with alot more 0% matches.

When I replaced the '1.0f/32768.0f' part with a much higher value (can't remember how much, I tried several different values, the threshold is a value between 0.0 and 1.0) the results were significantly improved: Identical versions of songs (regardless of encoding method) were detected as 95% similiar or better, off-vocal versions of songs compared to their on-vocal versions were detected with around 75% similiarity and everything else fell below 50% similiarity. Don't quote me on the exact numbers but I remember seeing very well defined groups of similiarity values.

After changing the threshold I ran into another problem though: Identical versions of songs with gradually increasing sound level at the intro rather than a well defined start beat would still be detected as 0% identical if the rate of sound level change at the intro was different between the two songs being compared.

I suspect that the threshold detection is the culprit here too. If the volume level change rate during the intro is different, then the song start level is detected at different points in time and the sampling of the song thus starts at different points in time.
Go to the top of the page
+Quote Post
romor
post Nov 13 2012, 03:33
Post #67





Group: Members
Posts: 673
Joined: 16-January 09
Member No.: 65630



Thanks ncmaothvez for you input
I'm not rewriting fooid. I just thought to play with the data provided by foo_biometric, and maybe extract other feature then same song matching. In case I find something interesting I'll be able to provide general script so that any foobar user can use it (in VBS perhaps).
Garf provided some formulas and I haven't tried anything other than that, but will try one of these days

Also MFCC approach should be interesting, and I'll back on that too, as I won't need to scan whole song to calculate MFCC and extract song features, but just couple of samples.


--------------------
scripts: http://goo.gl/M1qVLQ
Go to the top of the page
+Quote Post

3 Pages V  < 1 2 3
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 18th September 2014 - 19:29