Yuri
2002-08-25 23:34:13 UTC
First, introductions: I'm an Engineering student (B.Eng. Computer Systems)
from Adelaide, Australia, in the third year of my four year course (and
loving it).
With regard to the full-text search engine:
Some feedback would be nice. I've had one or two comments on my schemings on
the dev board, but I'm hoping more people read this list than the boards :) .
I really don't want to write a slab of code to find out what I did wasn't
really what was wanted.....
The last week or two have been rather hectic, so I haven't gotten around to
slogging through the indexing code for the current system (besides which, I
find perl scrypt amazingly annoying to decrypt :0 - the lack of any and all
comments in the source doesn't help...) but from the html doc, the whole
thing looks a little clunky (no offense to anyone/anything, that's
just how it looks)... if someone can just tell me the index's table structure
the I'd be quite grateful ;)
- What I need to know about this is: *exactly* what information is stored in
each table, and in what form?
On the upside, I've put together a near-optimal solution to this particular
problem (I think ;) ) - take a look at the threads in the dev board. (Note: I
worked out that the reindexing issue isn't so bad as I thought it might be -
The index should function admirably with 6Mb of bucket for every million
database inserts or so (a (very) small bit of processing has to be done on
new entries to add them to the index), and the lists can be re-merged every
month or whenever, with very little extra processing or temporary storage
requirement...)
About the issues raised about the current search:
Firstly, the ascii-issue (also the alternate spellings issue):
This is trivial for things like accents and missing/extra apostrophies:
strip the character down to it's base letter for the former, and discard the
shorter string for the latter ( Michael's -> michael, L'Industrie ->
indtustrie ). If this is done on the requested keyword as well as the index,
there won't be a problem.
For an added layer of fuzzyness, a soundex-like encoding system could
be used, but this would of course require an index encoded in the same way,
and would return more search results (of course), especially for one-word
searches, which may become a server load problem when people want to browse
through them.. (comments?)
About the odd behaviour of boolean operations in the current search:
I'm not debugging that. Perl diving is a dangerous job full of nasty
surprises and unforseen consequences ... if the original author can still
debug it, good luck to him. (Note: I don't have a problem with Perl or
writing Perl, but reading/debugging Perl isn't something I enjoy)
About the duplicate elimination:
I had kind of assumed that this would be one of the main administrative
uses of the full text search, as it is a trivial task, especially with a
hashed search as I've described on the dev board. You don't even need any
misspelling correction: the hashed search as described is quick enough to
check the similarity of the whole file, and > 75% word matches are almost
certainly the same thing (assuming the user didn't misspell more than 25% of
the words that is) so we do
for each record in the database:
if there are matches with more than 50% relevance:
print a list of matches in order of relevance
let the user decide what to do
and thats practically a python program - perl users take note :)
anyway, feedback? please?
-Yuri
from Adelaide, Australia, in the third year of my four year course (and
loving it).
With regard to the full-text search engine:
Some feedback would be nice. I've had one or two comments on my schemings on
the dev board, but I'm hoping more people read this list than the boards :) .
I really don't want to write a slab of code to find out what I did wasn't
really what was wanted.....
The last week or two have been rather hectic, so I haven't gotten around to
slogging through the indexing code for the current system (besides which, I
find perl scrypt amazingly annoying to decrypt :0 - the lack of any and all
comments in the source doesn't help...) but from the html doc, the whole
thing looks a little clunky (no offense to anyone/anything, that's
just how it looks)... if someone can just tell me the index's table structure
the I'd be quite grateful ;)
- What I need to know about this is: *exactly* what information is stored in
each table, and in what form?
On the upside, I've put together a near-optimal solution to this particular
problem (I think ;) ) - take a look at the threads in the dev board. (Note: I
worked out that the reindexing issue isn't so bad as I thought it might be -
The index should function admirably with 6Mb of bucket for every million
database inserts or so (a (very) small bit of processing has to be done on
new entries to add them to the index), and the lists can be re-merged every
month or whenever, with very little extra processing or temporary storage
requirement...)
About the issues raised about the current search:
Firstly, the ascii-issue (also the alternate spellings issue):
This is trivial for things like accents and missing/extra apostrophies:
strip the character down to it's base letter for the former, and discard the
shorter string for the latter ( Michael's -> michael, L'Industrie ->
indtustrie ). If this is done on the requested keyword as well as the index,
there won't be a problem.
For an added layer of fuzzyness, a soundex-like encoding system could
be used, but this would of course require an index encoded in the same way,
and would return more search results (of course), especially for one-word
searches, which may become a server load problem when people want to browse
through them.. (comments?)
About the odd behaviour of boolean operations in the current search:
I'm not debugging that. Perl diving is a dangerous job full of nasty
surprises and unforseen consequences ... if the original author can still
debug it, good luck to him. (Note: I don't have a problem with Perl or
writing Perl, but reading/debugging Perl isn't something I enjoy)
About the duplicate elimination:
I had kind of assumed that this would be one of the main administrative
uses of the full text search, as it is a trivial task, especially with a
hashed search as I've described on the dev board. You don't even need any
misspelling correction: the hashed search as described is quick enough to
check the similarity of the whole file, and > 75% word matches are almost
certainly the same thing (assuming the user didn't misspell more than 25% of
the words that is) so we do
for each record in the database:
if there are matches with more than 50% relevance:
print a list of matches in order of relevance
let the user decide what to do
and thats practically a python program - perl users take note :)
anyway, feedback? please?
-Yuri