RDBMS schema (MySQL)

Hello,

Post by Ryan Fox
I've created a schema for cddb data storage that I submit for
comments/feedback. It is below, followed by my comments. In case it
gets mangled, it's also available at
http://www.amerisuk.com/~rfox/cddb/schema.sql . If you like, you can
plop it into MySQL (version >= 3.23.23) using a command like 'mysql -u
root -p < ./schema.sql' after you save it to a file.

Hmm, I'm not sure if I understand everything correctly (I'm pretty

Post by Ryan Fox
create database cddb;
use cddb;
create table disc (
id int unique auto_increment not null,
length int not null,
discid char(8) not null unique primary key,

You cannot make the discid a unique key - currently the
discid/genre category pair is unique, which is certainly not optimal.
As there are lots of collisions we would loose too many entries by
using that solution - this is simply not feasible. You would need a
new discid for that!

Post by Ryan Fox
revisionlead varchar(255),
revisionversion varchar(255) not null,
revisionrelease varchar(255),
revisionlevel varchar(255),

What are these fields meant for? An entry has a revision, but why 4
different fields titles revision-something?
btw: We also need to store the name of the submitting program and it's
version. Perhaps we should also store when an entry was submitted
(important for generating update-archives (we should still think about
releasing the database archives).

Post by Ryan Fox
artist varchar(255) not null,
title varchar(255) not null,
releaseyear decimal(4),
maingenre char(16) not null,
genre varchar(255) not null,
numtitles int not null,
extendeddata text,

ok.

Post by Ryan Fox
playorder varchar(255),

We don't need to store anything here - the PLAYORDER= line is always
empty.

Post by Ryan Fox
fulltext key(artist), fulltext key(title), fulltext (artist,title),
key(maingenre), key(genre), key(releaseyear), key(discid)
);

Looks like you want to implement search via the database. As database
searches can be pretty slow (especially for left hand wildcard
searches) we should IMHO rather use Yuri's method for search.

Post by Ryan Fox
create table discoffsets (
disc_id int not null,
subid int not null,
offset int not null,
key(disc_id), key(subid)
);
create table disctracks (
disc_id int not null,
number int not null,
name varchar(255) not null,

At least currently track names can have a zero length. This is useful
e.g. for "fake tracks" like on CDs, which have a bonus track on track
99. We should also have a separate field for track artist name - e.g
for samplers.

Post by Ryan Fox
extendeddata text,
key(disc_id), key(name)
);
There are 3 tables. Disc, discoffsets, and disctracks. Disc contains
info such as album title, genre, release year, etc., and there would be
1 record per album. In the interest of data normalization, I think
artist name and genre should have separate tables. This would make it
easier to search for albums by the same artist or genre, as well as use
less disk space. I did not implement the disc table in this fashion
because of the difficulty in converting from the current format to that
normalized format. It would probably be easier if we moved from the
current format to this intermediate sql format, and then moved towards
data normalization in a later version.

For the production system we shouldn't make changes too often - as all
the mirrors have to do them and we need to verify that the changes
were done properly etc.
Question is, if we really need this separate tables, if we use the
method proposed by Yuri for searching.

Post by Ryan Fox
The discoffsets table references an id of an album, and a subid (more on
that in a minute), and an offset. There would be 1 record per album,
per unique variation, per offset. For example, if the cddb database
contained to different versions of album A, there would be 1 record in
the disc table (as all of that info is the same between the 2 albums),
and 1 set of records in the disc offset table (1 per offset) for each
version of the album.

So far so good. But have you thought about how you want to find the
matching entry easily? At least for exact matches the server software
needs to find a certain discid - which may not be the discid of the
master CD. searching in the discoffsets table for that might not be
optimal, I think.

Post by Ryan Fox
Using this method, this should take care of the current linked album
discussion, as well aas move us down a path towards better data storage.
Hopefully I'm not too far off track with this. Please respond with
thoughts/questions/flames/whatever. :)

Well, this were my first thoughts. Now you can flame me or whatever ;)

- Joerg

Ryan Fox

2002-08-29 17:51:32 UTC

Post by Ryan Fox
create table disc (
id int unique auto_increment not null,
length int not null,
discid char(8) not null unique primary key,

Good to know. Actually looking my schema back over, I realize that I
made some mistakes. Sounds like the above lines should be

id int unique auto_increment not null primary key,
length int not null,
discid char(8) not null,

Post by Ryan Fox
revisionlead varchar(255),
revisionversion varchar(255) not null,
revisionrelease varchar(255),
revisionlevel varchar(255),

What are these fields meant for? An entry has a revision, but why 4
different fields titles revision-something?

They're meant for the "# Revision:" line. The format docs i looked at
said that this like could have up to 4 parts, and I figured if we're
storing them, we might as well store them separate. Not a big issue,
either way.

Post by Joerg Hevers
btw: We also need to store the name of the submitting program and it's
version. Perhaps we should also store when an entry was submitted
(important for generating update-archives (we should still think about
releasing the database archives).

Ok. I added the lines

submitter varchar(255),
submitteddate date,

Post by Joerg Hevers
We don't need to store anything here - the PLAYORDER= line is always
empty.

Ok. Removed.

Post by Joerg Hevers
Looks like you want to implement search via the database. As database
searches can be pretty slow (especially for left hand wildcard
searches) we should IMHO rather use Yuri's method for search.

I should have been more clear. If everything was in rdbms, this would
work for the server software, and the web site search.

With MySQL's fulltext, text searches like this are very quick.
(Actually, I'm not sure that MySQL's fulltext supports left hand
wildcards). I wrote a search engine, using mysql as a backend, and
fulltext indexes on the page title and description. Searching a couple
million rows happened in less than a second on a low cost intel
machine. Creating the indexes takes a few days, but that only needs to
happen after the initial import.

Check out
http://www.mysql.com/documentation/mysql/bychapter/manual_Reference.html#Fulltext_Search for some more info.

Post by Joerg Hevers
At least currently track names can have a zero length. This is useful
e.g. for "fake tracks" like on CDs, which have a bonus track on track
99.

That's fine. An empty string in MySQL is different than a NULL value,
so even though the column is defined as not null, it'll accept empty
strings.

Post by Joerg Hevers
We should also have a separate field for track artist name - e.g
for samplers.

Ok. A added an artist field in the tracknames table.

Post by Joerg Hevers
For the production system we shouldn't make changes too often - as all
the mirrors have to do them and we need to verify that the changes
were done properly etc.
Question is, if we really need this separate tables, if we use the
method proposed by Yuri for searching.

I believe Yuri and I's methods are similar, except as he is suggesting
the general idea of using indexes, I am suggesting the specific idea of
MySQL's fulltext indexing. If I'm wrong, please correct me

Eh. I meant to go back and explain the subid field, but it looks like I
never did. That's ok, I got the schema wrong anyway. It's
specification should have been:
subid char(8) not null,
This field would be the unique discid (not to be confused with the
album's index in the disc table) of the version. Thus the example album
with 2 versions would have 1 set of entries in the discoffsets table
with an disc_id of 1, and a subid of '348d8d33'. It's matching sister
album would have entries in the discoffsets table with a disc_id of 1
(same album title, etc), but a subid of 'dc33989a'. Then you can easily
group together albums with more than 1 disc id without data duplication.

I'm not sure what you mean by an 'exact match', but if you give me the
criteria you're searching on, and what you want to return, I can craft
an sql query for this schema that will do that.

Post by Joerg Hevers
Well, this were my first thoughts. Now you can flame me or whatever ;)

They were great. Thank you!
Ryan

Joerg Hevers

2002-08-29 18:51:30 UTC

Hello,

Post by Ryan Fox
id int unique auto_increment not null primary key,
length int not null,
discid char(8) not null,

Why would you need the discid here at all?

Post by Ryan Fox
revisionlead varchar(255),
revisionversion varchar(255) not null,
revisionrelease varchar(255),
revisionlevel varchar(255),

What are these fields meant for? An entry has a revision, but why 4
different fields titles revision-something?

The # Revision: line just contains one integer value of 0 or higher.
You must have misread something. Perhaps you read the part for the
"Submitted via:" line and thought it was about the "Revision:"-line?

Ok. I added the lines
submitter varchar(255),

If you mean the person: privacy issue, should not be stored. If you
mean the program, we might think about storing the program version
separately.

I should have been more clear. If everything was in rdbms, this would
work for the server software, and the web site search.

Performance is most important - there will most likely be lots of
users querying the database at the same time. So we should test this
carefully...

Post by Ryan Fox
That's fine. An empty string in MySQL is different than a NULL value,
so even though the column is defined as not null, it'll accept empty
strings.

/me should have known that.

Eh. I meant to go back and explain the subid field, but it looks like I
never did. That's ok, I got the schema wrong anyway. It's
subid char(8) not null,
This field would be the unique discid (not to be confused with the
album's index in the disc table) of the version. Thus the example album
with 2 versions would have 1 set of entries in the discoffsets table
with an disc_id of 1, and a subid of '348d8d33'. It's matching sister
album would have entries in the discoffsets table with a disc_id of 1
(same album title, etc), but a subid of 'dc33989a'. Then you can easily
group together albums with more than 1 disc id without data duplication.
I'm not sure what you mean by an 'exact match', but if you give me the
criteria you're searching on, and what you want to return, I can craft
an sql query for this schema that will do that.

I can also give you that query string:

SELECT DISTINCT disc_id from discoffsets where subid='xxx'

But I was thinking if it wouldn't be better to have a separate table
for the assignment of discids to our new IDs.
btw: Don't forget that currently the genre category is part of our key
- I don't think that we can leave it out as easily as you probably
thought. So we might need to put the maingenre in the discoffsets
table as well, even though all LINKed entries _should_ be in the same
mein genre category, they most likely are not at the moment and we
might not be able to clean this mess up easily, as we cannot "move"
all entries to one genre because of discid collisions.

- Joerg

Ryan Fox

2002-08-29 19:35:01 UTC

Post by Ryan Fox
id int unique auto_increment not null primary key,
length int not null,
discid char(8) not null,

Why would you need the discid here at all?

I'm certainly not understanding something. Doesn't anyone ever search
on discid?

Post by Ryan Fox
They're meant for the "# Revision:" line. The format docs i looked at
said that this like could have up to 4 parts, and I figured if we're
storing them, we might as well store them separate. Not a big issue,
either way.

Yes, I think i misread it. My fault.

Post by Joerg Hevers
If you mean the person: privacy issue, should not be stored. If you
mean the program, we might think about storing the program version
separately.

I meant the program. Fixing the last problem should take care of this
as well.

Post by Joerg Hevers
Performance is most important - there will most likely be lots of
users querying the database at the same time. So we should test this
carefully...

Yep.

Post by Ryan Fox
That's fine. An empty string in MySQL is different than a NULL value,
so even though the column is defined as not null, it'll accept empty
strings.

/me should have known that.

That's ok. I tested it out just to be sure before replying. :)

Post by Ryan Fox
I'm not sure what you mean by an 'exact match', but if you give me the
criteria you're searching on, and what you want to return, I can craft
an sql query for this schema that will do that.

SELECT DISTINCT disc_id from discoffsets where subid='xxx'

Ok.

Post by Joerg Hevers
But I was thinking if it wouldn't be better to have a separate table
for the assignment of discids to our new IDs.

I had thought of that too, but I don't think there would be any
significant performance difference between the 2 ways, even under high
load.

Post by Joerg Hevers
btw: Don't forget that currently the genre category is part of our key
- I don't think that we can leave it out as easily as you probably
thought. So we might need to put the maingenre in the discoffsets
table as well, even though all LINKed entries _should_ be in the same
mein genre category, they most likely are not at the moment and we
might not be able to clean this mess up easily, as we cannot "move"
all entries to one genre because of discid collisions.

I think my schema is ok in this aspect. The main reason being that all
the tables are linked to the main album via the id index in the disc
table, not the discid. This would allow multiple albums to have the
same discid, and still be uniquely identified by the id index.

Thanks for your feedback. I'm certain I have a lot better feel for the
data structure now then when I started.

Thanks,
Ryan

t***@rogers.com

2002-08-29 19:59:15 UTC

Post by Ryan Fox
id int unique auto_increment not null primary key,
length int not null,
discid char(8) not null,

Why would you need the discid here at all?

I'm certainly not understanding something. Doesn't anyone
ever search on discid?

When someone searches on discid (+ offsets) it finds the entry in the
offsets table and then links to the album.

But I think we need this field (discid) here and that it should be an
array of genre+discid - ie. links into the offsets table. Or, if you
implement the new albumid-discid table that one of you mentioned, it
would need to go both ways. Can the DB do that?

Post by Ryan Fox
I think my schema is ok in this aspect. The main reason
being that all the tables are linked to the main album via
the id index in the disc table, not the discid. This would
allow multiple albums to have the same discid, and still be
uniquely identified by the id index.

That sounds like a good design, but would it work in practise? I guess
a CD-ripper program could work as follows: user inserts CD, a matching
entry is found and presented to user, user rejects this result and
submits a new album. I guess that would work. Good idea.

Tom.

Post by Ryan Fox
Thanks for your feedback. I'm certain I have a lot better
feel for the data structure now then when I started.
Thanks,
Ryan
_______________________________________________
fdb-dev mailing list
http://dtype.org/mailman/listinfo/fdb-dev

Brian K. Hughes

2002-08-29 20:50:29 UTC

Hi all,

I'm a developer for Acoustica, whose CD Burner product is due to be
released this month with freedb support. Here's a question one of you may
be able to answer that's sort of related to the schema discussion (if not,
feel free to tell me to go away and I'll repost with a different subject
*grin*).

I pull a disc ID from the TOC of a CD and go query freedb with it. The DB
gives me two possible choices for the album... both use the same artist,
title, and genre (which is confusing to the user to begin with), but each
has a different disc ID and none of them match the original disc ID. If
the search is based on the disc ID, how is it that A) I'm finding other
records that do not have my disc ID, and B) why do none of the results
returned have my disc ID? Incidentally, if I grab the track times and
calculate the disc ID myself, I get the same ID that's in the TOC of the CD.

Thanks!

----------
Brian K. Hughes
Programmer and Sound Byter
Acoustica.com - "Software should be easy to use!"
Voice: 559-692-2224, Fax: 559-692-2214
Web: www.acoustica.com, www.cdburner.com

----------

Joerg Hevers

2002-08-29 20:55:17 UTC

Hi,

Post by Brian K. Hughes
I'm a developer for Acoustica, whose CD Burner product is due to be
released this month with freedb support. Here's a question one of you may
be able to answer that's sort of related to the schema discussion (if not,
feel free to tell me to go away and I'll repost with a different subject
*grin*).

Well, it doesn't really fit in here, but anyway...

Post by Brian K. Hughes
I pull a disc ID from the TOC of a CD and go query freedb with it. The DB
gives me two possible choices for the album... both use the same artist,
title, and genre (which is confusing to the user to begin with), but each
has a different disc ID and none of them match the original disc ID. If
the search is based on the disc ID, how is it that A) I'm finding other
records that do not have my disc ID, and B) why do none of the results
returned have my disc ID?

The magic of fuzzy matching ;)
The server software includes a method for finding inexact matches.
So if the server doesn't find an exact match, it tries to find inexact
matches and if it finds some, it gives you a list of the possible
matches (return-code 211).

- Joerg

Joerg Hevers

2002-08-29 20:04:58 UTC

Hello,

Post by Ryan Fox
id int unique auto_increment not null primary key,
length int not null,
discid char(8) not null,

Why would you need the discid here at all?

I'm certainly not understanding something. Doesn't anyone ever search
on discid?

Yes, but if you search on discid you need to search through all the
entries, not just the "master" entries! So you need to look in the
trackoffsets table anyway - so why would you need the discid here?

The discid/genre pair must be unique - also for all the variations of
a disc. Otherwise the "cddb read" command won't work anymore. Let's
say you have an entry with master discid 11111111 in genre category
"rock" and a variation with discid 12111111. You have another master
entry with discid 12111111 in genre "rock". Which entry do you return
if someone does a "cddb read rock 12111111". The variation of the
first entry as well as the second entry would match this "cddb read".
We must not change the existing protocol! Backwards-compatibility
should be our main goal!

An interesting question for the implementation might be, how we can do
the transmission of new entries from the master to the mirrors, so
that they keep in sync - but that question can most likely not be
answered in this stage.

- Joerg

t***@rogers.com

2002-08-29 19:36:44 UTC

< create table disc (
< id int unique auto_increment not null,
< length int not null,
< discid char(8) not null unique primary key,

How big is a MySQL int? Is it signed or unsigned? 32bits would be
sufficient for the album id ('id' in your table).

DiscId shouldn't be char(8), it should be a 32bit unsigned integer.

data normalization, I think artist name and genre should have
separate tables. This would make it easier to search for
albums by the same artist or genre, as well as use less disk
space. I did not implement the disc table in this fashion
because of the difficulty in converting from the current
format to that normalized format. It would probably be
easier if we moved from the current format to this
intermediate sql format, and then moved towards data
normalization in a later version.

Having a seperate table for artist would be good, but there are
complications.

For genre though, why don't we simply use a 4 bit code (I think ID3v1
defines enumeration values for these genres). Then, (genre + discID)
could be used as a unique key. You will need to use this key in the
offsets table as well.

This key would be 36 bits, so you might as well make it 64 bits. I know
that windows (C/C++) has a 64bit unsigned integer type, do Linux, Java,
dotNet?

Perhaps we should also store when an entry was submitted (important

for generating update-> > archives (we should still think about
releasing the database archives).

Good idea.

Tom.

t***@rogers.com

2002-08-29 21:28:25 UTC

Sorry, I don't understand - why would we need that here? We
can reference all the information via the id field. All
information that can change between variations should not be
stored in the disc table (i.e. maingenre, length, discid).

I see, you've taken care of it in your discvariations table. I justed
wanted to be sure I could lookup all the offset variations given a disc
(master/album/whatever). For the reasons I mentioned in the 'searching'
E-mails.

create table disc (
id int unique auto_increment not null,
revision int not null,
program varchar(255) not null,
programversion varchar(255),
submitteddate date,
artist varchar(255) not null,
title varchar(255) not null,
releaseyear decimal(4),
genre varchar(255) not null,
numtitles int not null,
extendeddata text,
fulltext key(artist), fulltext key(title), fulltext
(artist,title), key(maingenre), key(releaseyear) );
create table discvariations (
var_id int unique auto_increment not null,
disc_id int not null, (--> see id in table disc)
subid int not null,

Instead of 'disc_id' could we call it 'ablum_id' or something, or else
it'll get confused with the existing DiscID field.

I take it that subid is DiscID. I'm missing something: how is the
freedb read implemented? Doesn't this table (discvariations) need to be
keyed on genre+subid (as well as disc_id)? And then doesn't var_id
become redundent? var_id is better then genre+subid because it is
shorter and less cumbersome, but we'll always need genre+subid.

Tom.

maingenre char(16) not null, (we can use a number
here as well)
length int not null, (length can vary between the variations)
key(disc_id), key(subid)
);
create table discoffsets (
var_id int not null, (--> see var_id in table variations)
number int not null, (we need the track number the offset
belongs to)
offset int not null,
key(var_id), key(number)
);
create table disctracks (
disc_id int not null,
number int not null,
name varchar(255) not null,
artist varchar(255) not null,
extendeddata text,
key(disc_id), key(name)
);
This way we don't even have to define a "master-entry". Disc
information and trackoffsets etc. are clearly separated.
- Joerg
_______________________________________________
fdb-dev mailing list
http://dtype.org/mailman/listinfo/fdb-dev

Joerg Hevers

2002-08-29 22:07:54 UTC