Searching

Discussion:

Searching

Yuri

2002-08-25 23:34:13 UTC

First, introductions: I'm an Engineering student (B.Eng. Computer Systems)
from Adelaide, Australia, in the third year of my four year course (and
loving it).

With regard to the full-text search engine:

Some feedback would be nice. I've had one or two comments on my schemings on
the dev board, but I'm hoping more people read this list than the boards :) .
I really don't want to write a slab of code to find out what I did wasn't
really what was wanted.....

The last week or two have been rather hectic, so I haven't gotten around to
slogging through the indexing code for the current system (besides which, I
find perl scrypt amazingly annoying to decrypt :0 - the lack of any and all
comments in the source doesn't help...) but from the html doc, the whole
thing looks a little clunky (no offense to anyone/anything, that's
just how it looks)... if someone can just tell me the index's table structure
the I'd be quite grateful ;)
- What I need to know about this is: *exactly* what information is stored in
each table, and in what form?

On the upside, I've put together a near-optimal solution to this particular
problem (I think ;) ) - take a look at the threads in the dev board. (Note: I
worked out that the reindexing issue isn't so bad as I thought it might be -
The index should function admirably with 6Mb of bucket for every million
database inserts or so (a (very) small bit of processing has to be done on
new entries to add them to the index), and the lists can be re-merged every
month or whenever, with very little extra processing or temporary storage
requirement...)

About the issues raised about the current search:
Firstly, the ascii-issue (also the alternate spellings issue):
This is trivial for things like accents and missing/extra apostrophies:
strip the character down to it's base letter for the former, and discard the
shorter string for the latter ( Michael's -> michael, L'Industrie ->
indtustrie ). If this is done on the requested keyword as well as the index,
there won't be a problem.
For an added layer of fuzzyness, a soundex-like encoding system could
be used, but this would of course require an index encoded in the same way,
and would return more search results (of course), especially for one-word
searches, which may become a server load problem when people want to browse
through them.. (comments?)
About the odd behaviour of boolean operations in the current search:
I'm not debugging that. Perl diving is a dangerous job full of nasty
surprises and unforseen consequences ... if the original author can still
debug it, good luck to him. (Note: I don't have a problem with Perl or
writing Perl, but reading/debugging Perl isn't something I enjoy)
About the duplicate elimination:
I had kind of assumed that this would be one of the main administrative
uses of the full text search, as it is a trivial task, especially with a
hashed search as I've described on the dev board. You don't even need any
misspelling correction: the hashed search as described is quick enough to
check the similarity of the whole file, and > 75% word matches are almost
certainly the same thing (assuming the user didn't misspell more than 25% of
the words that is) so we do
for each record in the database:
if there are matches with more than 50% relevance:
print a list of matches in order of relevance
let the user decide what to do
and thats practically a python program - perl users take note :)

anyway, feedback? please?

-Yuri

Joerg Hevers

2002-08-25 15:36:07 UTC

Permalink

Hello,

Post by Yuri
The last week or two have been rather hectic, so I haven't gotten around to
slogging through the indexing code for the current system (besides which, I
find perl scrypt amazingly annoying to decrypt :0 - the lack of any and all
comments in the source doesn't help...) but from the html doc, the whole
thing looks a little clunky (no offense to anyone/anything, that's
just how it looks)... if someone can just tell me the index's table structure
the I'd be quite grateful ;)

I guess the only person who can tell you is Gerhard Gonter, the
author. He is also on this list - let's hope he answers to your mail.

Post by Yuri
On the upside, I've put together a near-optimal solution to this particular
problem (I think ;) )

The solution you described on the board looks great for me - if you
can put it to code and it works as intended ;)

Post by Yuri
strip the character down to it's base letter for the former, and discard the
shorter string for the latter ( Michael's -> michael, L'Industrie ->
indtustrie ). If this is done on the requested keyword as well as the index,
there won't be a problem.

Sounds good. But what will we do about entries in UTF-8, once that's
implemented?

Post by Yuri
I had kind of assumed that this would be one of the main administrative
uses of the full text search, as it is a trivial task, especially with a
hashed search as I've described on the dev board. You don't even need any
misspelling correction: the hashed search as described is quick enough to
check the similarity of the whole file, and > 75% word matches are almost
certainly the same thing (assuming the user didn't misspell more than 25% of
the words that is) so we do
print a list of matches in order of relevance
let the user decide what to do
and thats practically a python program - perl users take note :)

The problem with duplicates is, that even though something is a
duplicate regarding the titles etc. it is most likely not regarding
the discid and track offsets. This may be because of different
pressings of a CD being available or because someone submitted info
for a CD which he burned from MP3s himself. We cannot delete such
duplicates, if we want the original CD to be recognized as an exact
match. That's the problem. A good idea would be a possibility to link
several entries (but not with a hardlink in the filesystem, like the
current server software allows - the track offsets of each entry
should be preserved, but titles should be the same for all "linked"
entries. I'd imagine, that this could be solved quite well if we move
to a relational database.
As for real duplicates: we have a script that checks for them and we
run it from time to time. Currently we would be able to remove about
3500 dupliactes from the database by running this script - but that's
just a fraction of the duplicates with different discids and track
offsets...

btw: you didn't send the email from the address you registered with on
this list - therefore I had to approve it. Approving every post to a
mailinglist manually is quite annoying, so I'd like to ask everyone to
make sure that he uses the right sender address.

Joerg

t***@home.com

2002-08-25 17:59:46 UTC

Permalink

Joerg, Yuri,

I think I've got this coming from the right address - if not I
apologize.

I'm Tom, BTW. I'm a C++ Wireless Networking on Windows developer from
Toronto. And if no one else has already claimed this task, I'd like to
write the C++ (Windows) client code for these new features.

The issue, which you both raised, of duplicates is very important in my
opinion. Once the search functionality is built into the server it will
be used much more (it appears to me to be the most requested feature),
and duplicates can often make search results quite ugly. For an extreme
example try 'Beuna Vista Social Club'. For a more reasonable example
try 'Bob Seger/Greatest Hits'. There are 6 editions of this album
listed in the results (ignoring the `various` entry), where there should
be only two (one is a 'bonus tracks' edition). This problem affects the
query command as well.

I like your solution Joerg: linking. I imagine it working something
like this:

For each set of matching albums, one could be randomly chosen as the
master (or choose the one with the most extended data?). A field could
be added to each album that pointed to its master. For the master it
would point to itself or, even better, contain a list of the albums
pointing to it. If the entry is new and this field isn't set yet then
just assume it is its own master.

For new entries this field could be set by the same process that updates
the index. For any two albums, if the # of tracks are equal, and the
track names and lengths are close, then they are the same album. This
would be a time consuming computation because it could not use the
existing fuzzy query (track lengths can be +-10 sec in different
editions of the same album).

The text search index could be significantly shrunk because it would
only include 'master' albums, not 'edition' albums.

If, in the future, we provided the ability to add and view additional
fields thru the web, we would only show the 'master' albums.
Eventually, 'edition' albums could be stripped of all data except tracks
offsets.

The new field (the link) could be implemented as a list of
<genre>/<discID> pairs. For non-masters, the list would only contain
one item.

The web search currently returns:
<genre>, <discID>, <artist>, <album title>
for each album. I would suggest that it also return this new link field
so that you don't have to read each discID to figure out which albums to
hide from your user. It would also be nice if it returned the # of
tracks to help in identifying 'bonus track' editions (but I guess that
can be derived from the discID).

Tom.

-----Original Message-----
Sent: August 25, 2002 11:36 AM
Subject: Re: [fdb-dev] Searching
Hello,

decrypt :0 -

Post by Yuri
the lack of any and all comments in the source doesn't help...) but
from the html doc, the whole thing looks a little clunky

(no offense

Post by Yuri
to anyone/anything, that's just how it looks)... if someone

can just

Post by Yuri
tell me the index's table structure the I'd be quite grateful ;)

I guess the only person who can tell you is Gerhard Gonter,
the author. He is also on this list - let's hope he answers
to your mail.

Post by Yuri
On the upside, I've put together a near-optimal solution to this
particular problem (I think ;) )

The solution you described on the board looks great for me -
if you can put it to code and it works as intended ;)

Post by Yuri
strip the character down to it's base letter for the

former, and discard the

Post by Yuri
shorter string for the latter ( Michael's -> michael,

L'Industrie ->

Post by Yuri
indtustrie ). If this is done on the requested keyword as

well as the index,

Post by Yuri
there won't be a problem.

Sounds good. But what will we do about entries in UTF-8, once
that's implemented?

Post by Yuri
I had kind of assumed that this would be one of the main administrative
uses of the full text search, as it is a trivial task,

especially with a

Post by Yuri
hashed search as I've described on the dev board. You don't

even need any

Post by Yuri
misspelling correction: the hashed search as described is

quick enough to

Post by Yuri
check the similarity of the whole file, and > 75% word

matches are almost

Post by Yuri
certainly the same thing (assuming the user didn't misspell

more than 25% of

Post by Yuri
the words that is) so we do
print a list of matches in order of relevance
let the user decide what to do
and thats practically a python program - perl users take note :)

The problem with duplicates is, that even though something is
a duplicate regarding the titles etc. it is most likely not
regarding the discid and track offsets. This may be because
of different pressings of a CD being available or because
someone submitted info for a CD which he burned from MP3s
himself. We cannot delete such duplicates, if we want the
original CD to be recognized as an exact match. That's the
problem. A good idea would be a possibility to link several
entries (but not with a hardlink in the filesystem, like the
current server software allows - the track offsets of each
entry should be preserved, but titles should be the same for
all "linked" entries. I'd imagine, that this could be solved
quite well if we move to a relational database. As for real
duplicates: we have a script that checks for them and we run
it from time to time. Currently we would be able to remove
about 3500 dupliactes from the database by running this
script - but that's just a fraction of the duplicates with
different discids and track offsets...
btw: you didn't send the email from the address you
registered with on this list - therefore I had to approve it.
Approving every post to a mailinglist manually is quite
annoying, so I'd like to ask everyone to make sure that he
uses the right sender address.
Joerg
_______________________________________________
fdb-dev mailing list
http://dtype.org/mailman/listinfo/fdb-dev

v***@infoeng.flinders.edu.au

2002-08-27 22:57:08 UTC

Permalink

OK, I hope this has the correct "From" field...

About UTF-8 : it doesn't affect the index, but it does affect the hashing
function. Also, there has to be a decision as to whether to strip some UTF-8
characters down to ascii for misspelling/ascii client reasons, or leave them
as-is : eg a good part of the latin-1 suplement ( 0080 - 00FF) and most of
extended latin A ( 0100 - 017F ) consist of "modified" letters which might
often be replaced by their "base" letters in user input. IMHO these should be
converted to their base characters in the fulltext search index, to improve
the number of 'valid' matches. This doesn't mean we try to convert everything
to ascii (that would be silly at best) just that eg. u with umlauts (00FC)
ends up as a straight u. The diacritical marks (0300-036F I think) can just
be dropped. I'm not sure what to do about things like the control pictures
(2400-243F)- should these be 'applied' and the result 'read' or are they
intended to be 'read' verbatim?
Anyway, for the time being I'll just read 8 bit characters in the hashing,
and I'll make the hashing function a seperate module (which is almost
obligatory anyway) so that it can be changed without fuss.

Joerg is quite right about database fulltext searches being slow... this is
why fulltext search benchmarks are rare as hens teeth, and indexed search
speed is on the wishlist of MySQL. Speedwise, like I said, my method should
be near-optimal :P - there certainly shouldn't be a speed decrease from the
existing html search (if there is, well, :-X I'll just go stick my head in a
bucket of water ... I've got one ready ... hang on a minute ...)

As regards the existing html search, I tried to give it a spin, but it dumps
out with "Can't locate Net/freedb/file.pm" (which I assume is a freedb file
parser from a freedb perl binding?), and I couldn't find any reference to
this in CVS or on the freedb site, and it doesn't seem to be part of the
CDDB::File or Net::freedb perl modules ... (help?)

Linking files with different diskids would probably be a good idea (eg. file
could consist of one line: "LINK=<discid>" or maybe include track offsets too
... also, two-way linking might be useful (ie the master containing a record
of it's 'copies' ) ) - finding the duplicates is the easy bit once the user
has supplied title, artist and track info. The "hard" ;) bit is what to do
about it ... let the user confirm the match, link automatically if it is a
good match, or set some sort of certainty threshold below which the user
chooses? Anyway, this kind of basic database admin is a slightly different
problem, it just happens to be a lot easier with fulltext searching...

For the time being, I'll just write the module in C for ascii as a standalone
app for easy testing. I'll set it up so that integration into the server sw
will consist merely of adding a couple of function calls, and including the
module sources. You can probably expect a working prototype in a few days to
a few weeks (I just started work experience, 9-5 5 days/week, 20 weeks, and
I'm still getting used to it after the light hours at Uni) which I'll put up
on CVS or whatever so you can have a tinker with it. Migration to c++ should
be trivial - just a bit of preprocessor work - and I can do a java version
for the java server, but that'll be much further down the track.

Well, keep the suggestions/comments coming...

G'night,
-Yuri

Joerg Hevers

2002-08-27 15:35:31 UTC

Permalink

Hi,

Post by v***@infoeng.flinders.edu.au
OK, I hope this has the correct "From" field...

Yes, was correct :)

Post by v***@infoeng.flinders.edu.au
About UTF-8 : it doesn't affect the index, but it does affect the hashing
function. Also, there has to be a decision as to whether to strip some UTF-8
characters down to ascii for misspelling/ascii client reasons, or leave them
as-is : eg a good part of the latin-1 suplement ( 0080 - 00FF) and most of
extended latin A ( 0100 - 017F ) consist of "modified" letters which might
often be replaced by their "base" letters in user input. IMHO these should be
converted to their base characters in the fulltext search index, to improve
the number of 'valid' matches. This doesn't mean we try to convert everything
to ascii (that would be silly at best) just that eg. u with umlauts (00FC)
ends up as a straight u. The diacritical marks (0300-036F I think) can just
be dropped. I'm not sure what to do about things like the control pictures
(2400-243F)- should these be 'applied' and the result 'read' or are they
intended to be 'read' verbatim?

I don't know enough about UTF-8 to be able to comment on this. :(

Post by v***@infoeng.flinders.edu.au
As regards the existing html search, I tried to give it a spin, but it dumps
out with "Can't locate Net/freedb/file.pm" (which I assume is a freedb file
parser from a freedb perl binding?), and I couldn't find any reference to
this in CVS or on the freedb site, and it doesn't seem to be part of the
CDDB::File or Net::freedb perl modules ... (help?)

Seems like you forgot to get the required stuff from the hyx-tools at
http://sourceforge.net/projects/hyx-tools
The p5-net-freedb package contains the necessary modules. lmd is also
needed - for generating the index.

Post by v***@infoeng.flinders.edu.au
Linking files with different diskids would probably be a good idea (eg. file
could consist of one line: "LINK=<discid>" or maybe include track offsets too

Yes, we should _definitely_ keep the track offsets of the entries to
be linked.

Post by v***@infoeng.flinders.edu.au
The "hard" ;) bit is what to do
about it ... let the user confirm the match, link automatically if it is a
good match, or set some sort of certainty threshold below which the user
chooses? Anyway, this kind of basic database admin is a slightly different
problem, it just happens to be a lot easier with fulltext searching...

I'd say link automatically if the match is "good enough" and the track
offsets are at least "fuzzy matching".

Post by v***@infoeng.flinders.edu.au
For the time being, I'll just write the module in C for ascii as a standalone
app for easy testing. I'll set it up so that integration into the server sw
will consist merely of adding a couple of function calls, and including the
module sources. You can probably expect a working prototype in a few days to
a few weeks (I just started work experience, 9-5 5 days/week, 20 weeks, and
I'm still getting used to it after the light hours at Uni) which I'll put up
on CVS or whatever so you can have a tinker with it.

Great :) If you want to use our CVS repository, please give me your
Sourceforge nick and I'll add you as a developer.

Well, it would be great to hear some comments from other people as
well ;)

- Jörg

t***@rogers.com

2002-08-27 18:26:09 UTC

Permalink

I think I got the address right this time; please tell me if I didn't.

Regarding UTF-8. The proposal to strip down some characters sounds like
a good idea.

Regarding linked entries...

Post by v***@infoeng.flinders.edu.au

Post by v***@infoeng.flinders.edu.au
The "hard" ;) bit is what to do
about it ... let the user confirm the match, link

automatically if it

Post by v***@infoeng.flinders.edu.au
is a
good match, or set some sort of certainty threshold below

which the user

Post by v***@infoeng.flinders.edu.au
chooses? Anyway, this kind of basic database admin is a

slightly different

Post by v***@infoeng.flinders.edu.au
problem, it just happens to be a lot easier with fulltext

searching...
I'd say link automatically if the match is "good enough" and
the track offsets are at least "fuzzy matching".

That's fine as long as there is a way for the user to get the linked
albums that weren't returned. For instance, if the album returned by
the search contains a field that lists the albums that link to it.

Also, it should always return the same album of a linked set - ie. one
album of the set should be designated as the master in the database. If
a user searches for and finds an album, and then searches for and finds
the same album two days later, the client software should be able to
determine that this is the same album.

I'd be happy to describe sample applications that require these
features, if anybody wants me to.

Tom.

Joerg Hevers

2002-08-27 20:09:18 UTC

Permalink

Hello,

Post by t***@rogers.com
I think I got the address right this time; please tell me if I didn't.

Yes, the address was right :)

Post by t***@rogers.com
Regarding linked entries...

[snip]

Post by t***@rogers.com

Post by Joerg Hevers
I'd say link automatically if the match is "good enough" and
the track offsets are at least "fuzzy matching".

The idea is to have the data fields only saved once! I.e. such a
linked entry would have the track offsets and then just a link field,
that tells to which other entry it's linked to - but none of the data
fields. This way we can get exact matches for all the variations of
the entry, but will always have one (hopefully) correct set of data
fields in the master entry. Therefore there are not really entries
that weren't returned any longer.
The idea is good, I think - only question is, if it is possible to put
this into code and who will do it. This probably requires rewrite of
major parts of the server software.

- Joerg

t***@rogers.com

2002-08-27 23:20:50 UTC

Permalink

Post by t***@rogers.com

Post by Joerg Hevers
I'd say link automatically if the match is "good enough" and
the track offsets are at least "fuzzy matching".

That's fine as long as there is a way for the user to get

the linked

Post by t***@rogers.com
albums that weren't returned. For instance, if the album

returned by

Post by t***@rogers.com
the search contains a field that lists the albums that link to it.

The idea is to have the data fields only saved once! I.e.
such a linked entry would have the track offsets and then
just a link field, that tells to which other entry it's
linked to - but none of the data fields. This way we can get
exact matches for all the variations of the entry, but will
always have one (hopefully) correct set of data fields in the
master entry. Therefore there are not really entries that
weren't returned any longer. The idea is good, I think - only
question is, if it is possible to put this into code and who
will do it. This probably requires rewrite of major parts of
the server software.
- Joerg

Yes, I agree with everything you said.

When I said I wanted to be able to get the other, non-master, entries I
meant the offsets. Any other data that might still be contained in
these non-master entries would be redundent.

I can't personally offer to write the code - I've never used C, Linux,
or any database - but perhaps I could find someone who would do a code
swap with me? Otherwise I can write (C++, Win32) client code, and I
could prototype an algorithm that determines which albums should be
linked and what data the master should have. I could also do manual
'eye-ball' work to verify that the master-album's data is correct for
all the albums I own.

But I've never let the lack of a skill stop me from having and voicing
an opinion! so here are my thoughts on the implemenation of this. (I
refer to the link as a field here, based on the existing, flat file
database.)

An initial, partial, implementation wouldn't be too hard, but it
wouldn't save any space - it would take a little more. It would
involve:
(1) Adding the link field to the database. This means changing the
server code to parse and return this new field. (If the field doesn't
exist then the album is its own master). I think Yuri was talking about
adding some new fields, so this new field could be added at the same
time.
(2) Writing code to create the link field for all existing albums, based
on similar songs and offsets. It would also pick the best data for the
master album (eg. the longest extended data field). The first part of
this code would also be used for maintenance: it would run as part of
the index update process, and would determine links for all new albums.
(3) In the search, only adding master albums to the index.

A fuller implementation would involve removing everything but the
offsets from all non-master albums, and altering large amounts of the
server code so that it automatically uses the links whenever it
encounters a non-master album.

Logically, the master-album becomes a whole new datatype, distinct from
the album which just contains a discID and offsets. The master-albums
could even get their own, sequential, 32-bit ID number.

These are not conflicting implementations, but stages of the same
implementation.

Tom.

Joerg Hevers

2002-08-29 17:19:40 UTC

Permalink

Hello,

Post by t***@rogers.com
When I said I wanted to be able to get the other, non-master, entries I
meant the offsets. Any other data that might still be contained in
these non-master entries would be redundent.

Would this really be necessary? I mean, the server software could
answer with the (best-)matching track offset - exactly it currently
does. I don't think we need an extra function to get all linked
entries.

Post by t***@rogers.com
I could also do manual
'eye-ball' work to verify that the master-album's data is correct for
all the albums I own.

Well, I guess that's then. We can always start a freedb cleanup
campaign - especially if we have a database editing interface on the
website. But this only makes sense when we have the entries linked
properly.

Post by t***@rogers.com
An initial, partial, implementation wouldn't be too hard, but it
wouldn't save any space - it would take a little more. It would
(1) Adding the link field to the database. This means changing the
server code to parse and return this new field. (If the field doesn't
exist then the album is its own master). I think Yuri was talking about
adding some new fields, so this new field could be added at the same
time.

Yes, I think the set number field Yuri proposed would be very helpful.
As for the LINK-field, I don't think we need to give that info to the
users - this can be handled internally without any disadvantages.

- Joerg

t***@rogers.com

2002-08-29 19:11:07 UTC

Permalink

Post by Joerg Hevers
Would this really be necessary? I mean, the server software
could answer with the (best-)matching track offset - exactly
it currently does. I don't think we need an extra function to
get all linked entries.

Let me give some examples.

Example#1. In the spring I wrote an extension to Gnucleus (the
open-source gnutella client). The user selects an album (thru a search
interface) and the program goes and gets all the tracks. It evaluated
search results based on a fuzzy text match and based on the playtime of
the search result. To do this playtime comparison I had to get the
offsets of all the editions of an album (I did it by looking at similar
albums in the search results).

Example#2. Now I'm doing a retager/renamer for MP3's and Ogg files, and
based on the postings I see I think other people are working on the same
thing (mine will be free and probably open-source). If a set of files
have ID3 tags then I can do a freedb text search. Then I take a result
and see if it matches the name and playtime of the files. Note that I
might only have some of the tracks in an album. Again, I need all the
possible offsets for the search results.

Post by Joerg Hevers

Post by t***@rogers.com
I could also do manual
'eye-ball' work to verify that the master-album's data is

correct for

Post by t***@rogers.com
all the albums I own.

Well, I guess that's then. We can always start a freedb
cleanup campaign - especially if we have a database editing
interface on the website. But this only makes sense when we
have the entries linked properly.

field doesn't

Post by t***@rogers.com
exist then the album is its own master). I think Yuri was talking
about adding some new fields, so this new field could be

added at the

Post by t***@rogers.com
same time.

Yes, I think the set number field Yuri proposed would be very
helpful. As for the LINK-field, I don't think we need to give
that info to the users - this can be handled internally
without any disadvantages.

Yes, it is better to handle it internally, but it is also more work.

Tom.

Post by Joerg Hevers
- Joerg
_______________________________________________
fdb-dev mailing list
http://dtype.org/mailman/listinfo/fdb-dev

Joerg Hevers

2002-08-29 19:22:16 UTC

Permalink

Hello,

Post by t***@rogers.com

Post by Joerg Hevers
As for the LINK-field, I don't think we need to give
that info to the users - this can be handled internally
without any disadvantages.

Yes, it is better to handle it internally, but it is also more work.

Implementing a new protocol level (which would be required if we
change the info, which is returned to the user) is more work! Handling
it internally shouldn't be difficult - we already do something like
that with the DYEAR and the DGENRE fields, which are only presented to
the querying application, if it's using protocol level 5 (or higher).
To filter one line from being displayed to users should be pretty
easy.

- Joerg