Kevin Dalley
2003-12-15 09:13:46 UTC
Please comment on my artist / title splitting algorithm.
For the PostgreSQL port of cddbd, I am splitting each DTITLE and
TTITLE into separate artist and title fields. I am trying to maintain
backwards compatibility, so that clients using protocol 6 and below
with see DTITLE as "artist / title", while clients using protocol 7
will see separate DTITLE and DARTIST fields.
Here is the current
DBFORMAT excerpt for DTITLE.
DTITLE: Technically, this may consist of any data, but by
convention contains the artist and disc title (in that order)
separated by a "/" with a single space on either side to
separate it from the text. There may be other "/" characters
in the DTITLE, but not with space on both sides, as that
character sequence is exclusively reserved as delimiter of
artist and disc title! If the "/" is absent, it is implied
that the artist and disc title are the same, although in this
case the name should rather be specified twice, separated by
the delimiter. If the disc is a sampler containing titles of
various artists, the disc artist should be set to "Various"
(without the quotes).
When determining DARTIST, I read DTITLE, using the currently available
db_read function.
No occurrence of " / ":
--------------------
If " / " does not appear, I assume that DTITLE is the title, and that
the artist is "". This does not strictly match the DTITLE description
above, but is probably close to reality. If a file has:
DTITLE=Ella Fitzgerald /. The Memorial Album
to use a real life example, then the artist is blank and the title is
"Ella Fitzgerald /. The Memorial Album".
DTITLE=Ella Fitzgerald /. The Memorial Album
DARTIST=
This isn't accurate, but is the best which can be done with an
incorrect entry. Older clients will see the exactly what they saw
before. A newer client will see "DARTIST=" and realize that there is
an error. The user will split DTITLE into the correct DTITLE and
DARTIST, and the world will be well. Older clients will then see the
correct:
DTITLE=Ella Fitzgerald / The Memorial Album
Alternatively, I could make DARTIST and DTITLE identical if " / " is
missing. This option is closer to DBFORMAT. Should I take this
approach?
As of the December, 2003 release of freedb, there are over 11,000
DTITLE entries without " / ".
Occurrence of " / "
--------------------
When I find the first " / ", I split everything before " / " into
artist and everything after " / " into title.
DTITLE=Ella Fitzgerald / Ella Swings Lightly
is transformed into
DTITLE=Ella Fitzgerald
DARTIST=Ella Swings Lightly
This is correct.
On the other hand:
DTITLE=AC / DC / ALLES ODER LIGHT
is turned into:
DTITLE=AC
DARTIST=DC / ALLES ODER LIGHT
The good news is that this solution is backwards compatible. An old
client will see the same information as before. When a user with a
new client notices the problem, it will be fixed. Older clients will
then see the correct version:
DTITLE=AC/DC / ALLES ODER LIGHT
After the split, DTITLE may contain " / ", but DARTIST may not contain
" / ". This must be prohibited in software. When a user includes
" / " in DARTIST, " / " will be replaced with "/". There are over
8,000 entries with 2 appearances of " / " in DTITLE, with about 15,000
occurrences in TTITLE.
Handling TTITLE
--------------------
TTITLE is more confusing. I probably need to make some changes here.
First, I will describe my current implementation, then I'll mention
some alternatives.
Option 1 - Current implementation
--------------------
Currently, I look for " / " in each TTITLE,
as with DTITLE above. If " / " appears in TTITLE, then artist and
title are split as in DTITLE. If any artist appears in any track,
then the TARTIST for that track is set. Any tracks which don't have
an artist are assigned the value of DARTIST. This is not backwards
compatible, but might be acceptable anyway. Older clients will have
have most of the TTITLEs changed by having adding the DARTIST.
Option 2 - some DARTIST are blank
--------------------
This is the same as option 1, except that a TTITLE without an artist
is left with a blank DARTIST. This method is backwards compatible. I
suspect that it has some problems, but I'm not sure what they are.
Option 3 - requiring Various
--------------------
According to DBFORMAT, a disc which has different artists for each
track must have the disc artist set to "Various". Under this
interpretation, any disc artist set to Various would have a different
artist for each. If " / " appears in the track, the artist would be
found as in DTITLE. If there isn't an artist, then artist is blank.
This method is backwards compatible, but may not be good enough.
There are over 1700 artist starting with "Vari", many of which mean
various in some language, including more than 90,000 titles. There
are even more DTITLEs which should be "Various", but don't start with
"Vari". Of course, these files violate the DBFORMAT standard.
Should discs with multiple artists be forced to have a DARTIST of
"Various"?
There are around 600,000 track which have a DARTIST of "Various",
but no additional information in TTITLE.
Multiple artists
--------------------
Each track is allowed to have multiple artists and each artist has a
role which describes what the artist did for the track. When
translating from older protocols, a default role of artist is used.
Here is an example.
TARTIST0.0=Art Tatum
TARTIST0.1=Buddy DeFranco
TARTIST0.2=Red Callender
TARTIST0.3=Bill Douglass
TARTIST0.4=Rudy Vallee
TARTIST1.0=Art Tatum
TARTIST1.1=Buddy DeFranco
TARTIST1.2=Red Callender
TARTIST1.3=Bill Douglass
TARTIST1.4=Richard Rodgers
TARTIST1.5=Lorenz Hart
...
TROLE0.0=piano
TROLE0.1=clarinet
TROLE0.2=bass
TROLE0.3=drums
TROLE0.4=composer
TROLE1.0=piano
TROLE1.1=clarinet
TROLE1.2=bass
TROLE1.3=drums
TROLE1.4=composer
TROLE1.5=lyrics
Currently, my implementation only allows one artist for DARTIST, but
I think I will change that. Does anyone have opinions on that topic?
For the PostgreSQL port of cddbd, I am splitting each DTITLE and
TTITLE into separate artist and title fields. I am trying to maintain
backwards compatibility, so that clients using protocol 6 and below
with see DTITLE as "artist / title", while clients using protocol 7
will see separate DTITLE and DARTIST fields.
Here is the current
DBFORMAT excerpt for DTITLE.
DTITLE: Technically, this may consist of any data, but by
convention contains the artist and disc title (in that order)
separated by a "/" with a single space on either side to
separate it from the text. There may be other "/" characters
in the DTITLE, but not with space on both sides, as that
character sequence is exclusively reserved as delimiter of
artist and disc title! If the "/" is absent, it is implied
that the artist and disc title are the same, although in this
case the name should rather be specified twice, separated by
the delimiter. If the disc is a sampler containing titles of
various artists, the disc artist should be set to "Various"
(without the quotes).
When determining DARTIST, I read DTITLE, using the currently available
db_read function.
No occurrence of " / ":
--------------------
If " / " does not appear, I assume that DTITLE is the title, and that
the artist is "". This does not strictly match the DTITLE description
above, but is probably close to reality. If a file has:
DTITLE=Ella Fitzgerald /. The Memorial Album
to use a real life example, then the artist is blank and the title is
"Ella Fitzgerald /. The Memorial Album".
DTITLE=Ella Fitzgerald /. The Memorial Album
DARTIST=
This isn't accurate, but is the best which can be done with an
incorrect entry. Older clients will see the exactly what they saw
before. A newer client will see "DARTIST=" and realize that there is
an error. The user will split DTITLE into the correct DTITLE and
DARTIST, and the world will be well. Older clients will then see the
correct:
DTITLE=Ella Fitzgerald / The Memorial Album
Alternatively, I could make DARTIST and DTITLE identical if " / " is
missing. This option is closer to DBFORMAT. Should I take this
approach?
As of the December, 2003 release of freedb, there are over 11,000
DTITLE entries without " / ".
Occurrence of " / "
--------------------
When I find the first " / ", I split everything before " / " into
artist and everything after " / " into title.
DTITLE=Ella Fitzgerald / Ella Swings Lightly
is transformed into
DTITLE=Ella Fitzgerald
DARTIST=Ella Swings Lightly
This is correct.
On the other hand:
DTITLE=AC / DC / ALLES ODER LIGHT
is turned into:
DTITLE=AC
DARTIST=DC / ALLES ODER LIGHT
The good news is that this solution is backwards compatible. An old
client will see the same information as before. When a user with a
new client notices the problem, it will be fixed. Older clients will
then see the correct version:
DTITLE=AC/DC / ALLES ODER LIGHT
After the split, DTITLE may contain " / ", but DARTIST may not contain
" / ". This must be prohibited in software. When a user includes
" / " in DARTIST, " / " will be replaced with "/". There are over
8,000 entries with 2 appearances of " / " in DTITLE, with about 15,000
occurrences in TTITLE.
Handling TTITLE
--------------------
TTITLE is more confusing. I probably need to make some changes here.
First, I will describe my current implementation, then I'll mention
some alternatives.
Option 1 - Current implementation
--------------------
Currently, I look for " / " in each TTITLE,
as with DTITLE above. If " / " appears in TTITLE, then artist and
title are split as in DTITLE. If any artist appears in any track,
then the TARTIST for that track is set. Any tracks which don't have
an artist are assigned the value of DARTIST. This is not backwards
compatible, but might be acceptable anyway. Older clients will have
have most of the TTITLEs changed by having adding the DARTIST.
Option 2 - some DARTIST are blank
--------------------
This is the same as option 1, except that a TTITLE without an artist
is left with a blank DARTIST. This method is backwards compatible. I
suspect that it has some problems, but I'm not sure what they are.
Option 3 - requiring Various
--------------------
According to DBFORMAT, a disc which has different artists for each
track must have the disc artist set to "Various". Under this
interpretation, any disc artist set to Various would have a different
artist for each. If " / " appears in the track, the artist would be
found as in DTITLE. If there isn't an artist, then artist is blank.
This method is backwards compatible, but may not be good enough.
There are over 1700 artist starting with "Vari", many of which mean
various in some language, including more than 90,000 titles. There
are even more DTITLEs which should be "Various", but don't start with
"Vari". Of course, these files violate the DBFORMAT standard.
Should discs with multiple artists be forced to have a DARTIST of
"Various"?
There are around 600,000 track which have a DARTIST of "Various",
but no additional information in TTITLE.
Multiple artists
--------------------
Each track is allowed to have multiple artists and each artist has a
role which describes what the artist did for the track. When
translating from older protocols, a default role of artist is used.
Here is an example.
TARTIST0.0=Art Tatum
TARTIST0.1=Buddy DeFranco
TARTIST0.2=Red Callender
TARTIST0.3=Bill Douglass
TARTIST0.4=Rudy Vallee
TARTIST1.0=Art Tatum
TARTIST1.1=Buddy DeFranco
TARTIST1.2=Red Callender
TARTIST1.3=Bill Douglass
TARTIST1.4=Richard Rodgers
TARTIST1.5=Lorenz Hart
...
TROLE0.0=piano
TROLE0.1=clarinet
TROLE0.2=bass
TROLE0.3=drums
TROLE0.4=composer
TROLE1.0=piano
TROLE1.1=clarinet
TROLE1.2=bass
TROLE1.3=drums
TROLE1.4=composer
TROLE1.5=lyrics
Currently, my implementation only allows one artist for DARTIST, but
I think I will change that. Does anyone have opinions on that topic?
--
Kevin Dalley
***@kelphead.org
Kevin Dalley
***@kelphead.org