IPB

Welcome Guest ( Log In | Register )

 
Reply to this topicStart new topic
Comparing two cd databases, Need formats for cdinfo and / or regexes
DrDoogie
post Nov 20 2003, 16:53
Post #1





Group: Members
Posts: 81
Joined: 17-April 03
Member No.: 6024



Hi all!

My challenge / problem is this:
1. I have freedb (and can parse it, with perl, to output "artist<TAB>album"), around 140K entries - each entry is a CD
2. I have "another" database (which is also in the format "artist<TAB>album"), more than 10K entries

And I wish to find out how many of the entries in 1. can be found in 2.. Unfortunately, I have no experience in generating dynamic patterns, nor in using Spell / ISpell (checks for spelling errors).

So, I am curious as to what (perl packages / other) you would recommend as a "most probably successfull way to do it".

Database 2. is currently so messy that I can only find around 5% of the entries therein, in 1..

At this stage, I am gratefull for any and all suggestions.

PS! Oh, and I use linux, so I have access to all tools available for that platform.

This post has been edited by DrDoogie: Nov 20 2003, 16:54
Go to the top of the page
+Quote Post
Jasper
post Nov 21 2003, 10:08
Post #2





Group: Members
Posts: 189
Joined: 9-July 02
Member No.: 2536



Wouldn't it make sense to put all the entries into a singe database with MySQL or something similar. Then you could just use a simple SQL query to find duplicates or even clean up the entire table.
Go to the top of the page
+Quote Post
DrDoogie
post Nov 21 2003, 23:29
Post #3





Group: Members
Posts: 81
Joined: 17-April 03
Member No.: 6024



QUOTE (Jasper @ Nov 21 2003, 01:08 AM)
Wouldn't it make sense to put all the entries into a singe database with MySQL or something similar. Then you could just use a simple SQL query to find duplicates or even clean up the entire table.

Mmm, I suppose I could use some "case-insensitive 'like'" stuff in MySQL, but why?

Perhaps you don't know what a regular expression is.

Say that you have the name of an artist in two formats:
A. "Mike Oldfield"
B. "Oldfield, Mike"

In order to match these two, you need a regular expression.
Say for instance with this:
CODE
s/([^,]*),\s(.*)/$2 $1/


Also, for some various erroneous entries in albumtitle, I have currently come up with some other patterns, which I read from a file as:
CODE
while (<album_patterns>) {
       chomp;
       if (!(/^$/ || /^#/)) {
               my ($pattern, $replacement, $modifier) = split /\t/;
               $pattern =~ s/^'(.*)'$/$1/;
               $replacement =~ s/^'(.*)'$/$1/;
               $modifier =~ s/^'(.*)'$/$1/;
               $albumPatterns{$pattern} = $replacement;
          }
}


These are the patterns, though I should note that they are not finished yet. Also, the unicode setup on my box i f'ed, so I have to devise the patterns somewhat 'tarded:
CODE
# year
#'(\D('[1-9]\d|[1-9]\d{3}))'    '[YEAR: $1]'
# yearspan
#'(\D('[1-9]\d|[1-9]\d{3}))(\s*.?\s*)(('[1-9]\d|[1-9]\d{3})\D?)'        '[YEARSPAN: $2$3$5]'
# volumenumber
#'[Vv]ol(ume|\.)[\W\s]?(\d*|[a-zA-Z]*)' '[VOLUMENUMBER $2]'
# volumespan
#'[Vv]ol(\.|ume)?s?[\W\s]+(\w+)(.*[Vv]ol(\.ume)?s?)?(\W+(\w+))' '[VOLUMESPAN: $2_$6]'


This post has been edited by DrDoogie: Nov 21 2003, 23:30
Go to the top of the page
+Quote Post

Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 



RSS Lo-Fi Version Time is now: 25th July 2014 - 16:10