Home All Groups Group Topic Archive Search About

Quick question about rowguid column



Author
2 Mar 2005 8:43 PM
WB
Is there any issue with assigning the primary key to a rowguid column?
Thoughts?

WB

Author
2 Mar 2005 10:25 PM
pdxJaxon
Yes.....

1. Creates a VERY Wide Index  that is inefficient (Compared to INT for
example)

2. IF this is also the Clustered Index (I recommend against), Then ALL
Indexes must propogate this GUID in their lookup tables

3. Clustering on GUID can cause page splits,etc

Company I recently worked at just added an identity column to all tables in
DB and clustered on the Ident instead of the GUID. Inserts of records sped
up tremendously. I cant quote the exact number of records, bu the "Process"
went form 17 minutes to literally seconds....


Food for thought

Some companies (My current employer for example) want to use a GUID type of
identifier as opposed to integers for whatever reason. So what they do, to
ensure that the GUID values are Monotonically increasing in value (to avoid
page splits), is the create their own custom "GUID" type data generator in
which the algorithm ensure incrementing values.


Greg Jackson
PDX, Oregon




Show quoteHide quote
"WB" <none> wrote in message news:OOk1gk2HFHA.2456@TK2MSFTNGP09.phx.gbl...
> Is there any issue with assigning the primary key to a rowguid column?
> Thoughts?
>
> WB
>
>
Are all your drivers up to date? click for free checkup

Author
17 Mar 2005 2:57 PM
Anthony Thomas
Cluster Index definition on surrogate keys are ALWAYS a bad idea, INT or
GUID.  Usually, the Business Key the surrogate is proxiing for is a better
Cluster Index candidate.

Sincerely,


Anthony Thomas


--

"pdxJaxon" <GregoryAJack***@Hotmail.com> wrote in message
news:uaC7za3HFHA.1948@TK2MSFTNGP14.phx.gbl...
Yes.....

1. Creates a VERY Wide Index  that is inefficient (Compared to INT for
example)

2. IF this is also the Clustered Index (I recommend against), Then ALL
Indexes must propogate this GUID in their lookup tables

3. Clustering on GUID can cause page splits,etc

Company I recently worked at just added an identity column to all tables in
DB and clustered on the Ident instead of the GUID. Inserts of records sped
up tremendously. I cant quote the exact number of records, bu the "Process"
went form 17 minutes to literally seconds....


Food for thought

Some companies (My current employer for example) want to use a GUID type of
identifier as opposed to integers for whatever reason. So what they do, to
ensure that the GUID values are Monotonically increasing in value (to avoid
page splits), is the create their own custom "GUID" type data generator in
which the algorithm ensure incrementing values.


Greg Jackson
PDX, Oregon




Show quoteHide quote
"WB" <none> wrote in message news:OOk1gk2HFHA.2456@TK2MSFTNGP09.phx.gbl...
> Is there any issue with assigning the primary key to a rowguid column?
> Thoughts?
>
> WB
>
>
Author
17 Mar 2005 3:41 PM
Adam Machanic
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:uPDlrHwKFHA.3340@TK2MSFTNGP14.phx.gbl...
> Cluster Index definition on surrogate keys are ALWAYS a bad idea, INT or
> GUID.  Usually, the Business Key the surrogate is proxiing for is a better
> Cluster Index candidate.

    There's no such thing as "usually".  There are plenty of applications
for either type of clustering key -- and clustering on a sequential integer
can be especially beneficial in many situations.  You should be very careful
with that assumption.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
17 Mar 2005 4:23 PM
Anthony Thomas
Name one.

Sincerely,


Anthony Thomas


--

"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:OYusbfwKFHA.3132@TK2MSFTNGP12.phx.gbl...
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:uPDlrHwKFHA.3340@TK2MSFTNGP14.phx.gbl...
> Cluster Index definition on surrogate keys are ALWAYS a bad idea, INT or
> GUID.  Usually, the Business Key the surrogate is proxiing for is a better
> Cluster Index candidate.

    There's no such thing as "usually".  There are plenty of applications
for either type of clustering key -- and clustering on a sequential integer
can be especially beneficial in many situations.  You should be very careful
with that assumption.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
17 Mar 2005 4:59 PM
Adam Machanic
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:%23gGre3wKFHA.1176@TK2MSFTNGP12.phx.gbl...
> Name one.

    Inserting data.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
18 Mar 2005 8:27 AM
Anthony Thomas
Yes, and Inserting Data in Business Key order shouldn't cause an inordinate
amount of page splits if you have your fill factors and database maintenance
routines set correctly.  And, having the Cluster Index on the Business Key
will far outweigh any split issues compared to any scans you may encounter
and gather the covering aspects of the clustered index plus it is typlically
the Business Key that most range queries are conducted on.

Care to try again?  Are you really telling me that page splits is the BEST
reason you can come up with?  Please.  Are you really telling me that was
the whole design goal with having Clustered Indexes over Heaps to begin
with?  You've got to be kidding me.

Can I get a witness?

Sincerely,


Anthony Thomas


--

"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:OuxM5KxKFHA.4052@tk2msftngp13.phx.gbl...
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:%23gGre3wKFHA.1176@TK2MSFTNGP12.phx.gbl...
> Name one.

    Inserting data.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
18 Mar 2005 3:32 PM
Adam Machanic
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:eCMwAS5KFHA.2860@TK2MSFTNGP10.phx.gbl...
>
> Can I get a witness?


    No.  Never worked in a 24-7 (4-nines, and we're trying for 5 this year),
high-volume OLTP environment, have you?  Good luck maintaining those fill
factors when you have no maintenence window.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
18 Mar 2005 5:26 PM
Anthony Thomas
Would agree that the DBREINDEX can be troublesome in highly available
systems.  But the fill factors should be static from a design perspective.
Also, the DBREINDEX is semi-online in that it only needs to obtain exclusive
table access only on the table it is currently running against.  Running the
Cluster Index rebuild WITH DROP EXISTING can speed up the process.

But then, again, that is a management issue.  The 5 9's do NOT refer to
TOTAL TIME, but to SLA time.  And, EVERY SYSTEM HAS TO HAVE AN SLA, which
would include a maintenance window.  If you don't have an APPROPRIATE SLA,
then your 5 9's don't mean very much.  But, yes, we maintain 5, and even 6,
9's of availability, but properly measured against reasonable SLAs that have
actually been documented and constructed with the various Business Units
that are demanding the system availabilities.

I suppose that if your SLAs DEMANDED 24x7x52, with NO MAINTENANCE, as
delusional as the Business Units may be, then sacrificing performance over
page splits would be one reason to not care about the Cluster Index
placement.  But, then, why bother and just use a HEAP.  You're not going to
get much better performance from an ill-chosen Clustered Index. Some, but
not much.

Sincerely,


Anthony Thomas


--

  "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:%23Ra$B$8KFHA.2252@TK2MSFTNGP15.phx.gbl...
  "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
  news:eCMwAS5KFHA.2860@TK2MSFTNGP10.phx.gbl...
  >
  > Can I get a witness?


      No.  Never worked in a 24-7 (4-nines, and we're trying for 5 this
year),
  high-volume OLTP environment, have you?  Good luck maintaining those fill
  factors when you have no maintenence window.


  --
  Adam Machanic
  SQL Server MVP
  http://www.datamanipulation.net
  --
Author
18 Mar 2005 5:37 PM
Anthony Thomas
Since you brought it up...well, and I brought it up, the topic of SLAs and
availability metrics is an interestin topic in its own right.

Craig Mullins has some good articles on the topic you should check out.

http://www.craigsmullins.com/dbta_006.htm

Sincerely,


Anthony Thomas


--

"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:eXByZ$9KFHA.3832@TK2MSFTNGP12.phx.gbl...
Would agree that the DBREINDEX can be troublesome in highly available
systems.  But the fill factors should be static from a design perspective.
Also, the DBREINDEX is semi-online in that it only needs to obtain exclusive
table access only on the table it is currently running against.  Running the
Cluster Index rebuild WITH DROP EXISTING can speed up the process.

But then, again, that is a management issue.  The 5 9's do NOT refer to
TOTAL TIME, but to SLA time.  And, EVERY SYSTEM HAS TO HAVE AN SLA, which
would include a maintenance window.  If you don't have an APPROPRIATE SLA,
then your 5 9's don't mean very much.  But, yes, we maintain 5, and even 6,
9's of availability, but properly measured against reasonable SLAs that have
actually been documented and constructed with the various Business Units
that are demanding the system availabilities.

I suppose that if your SLAs DEMANDED 24x7x52, with NO MAINTENANCE, as
delusional as the Business Units may be, then sacrificing performance over
page splits would be one reason to not care about the Cluster Index
placement.  But, then, why bother and just use a HEAP.  You're not going to
get much better performance from an ill-chosen Clustered Index. Some, but
not much.

Sincerely,


Anthony Thomas


--

"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:%23Ra$B$8KFHA.2252@TK2MSFTNGP15.phx.gbl...
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:eCMwAS5KFHA.2860@TK2MSFTNGP10.phx.gbl...
>
> Can I get a witness?


    No.  Never worked in a 24-7 (4-nines, and we're trying for 5 this year),
high-volume OLTP environment, have you?  Good luck maintaining those fill
factors when you have no maintenence window.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
18 Mar 2005 5:39 PM
Anthony Thomas
http://www.dbta.com/columnists/craig_mullins/dba_corner_0902.html

http://www.craigsmullins.com/dbta_026.htm


--

"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:OH8L1F%23KFHA.2748@TK2MSFTNGP09.phx.gbl...
Since you brought it up...well, and I brought it up, the topic of SLAs and
availability metrics is an interestin topic in its own right.

Craig Mullins has some good articles on the topic you should check out.

http://www.craigsmullins.com/dbta_006.htm

Sincerely,


Anthony Thomas


--

"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:eXByZ$9KFHA.3832@TK2MSFTNGP12.phx.gbl...
Would agree that the DBREINDEX can be troublesome in highly available
systems.  But the fill factors should be static from a design perspective.
Also, the DBREINDEX is semi-online in that it only needs to obtain exclusive
table access only on the table it is currently running against.  Running the
Cluster Index rebuild WITH DROP EXISTING can speed up the process.

But then, again, that is a management issue.  The 5 9's do NOT refer to
TOTAL TIME, but to SLA time.  And, EVERY SYSTEM HAS TO HAVE AN SLA, which
would include a maintenance window.  If you don't have an APPROPRIATE SLA,
then your 5 9's don't mean very much.  But, yes, we maintain 5, and even 6,
9's of availability, but properly measured against reasonable SLAs that have
actually been documented and constructed with the various Business Units
that are demanding the system availabilities.

I suppose that if your SLAs DEMANDED 24x7x52, with NO MAINTENANCE, as
delusional as the Business Units may be, then sacrificing performance over
page splits would be one reason to not care about the Cluster Index
placement.  But, then, why bother and just use a HEAP.  You're not going to
get much better performance from an ill-chosen Clustered Index. Some, but
not much.

Sincerely,


Anthony Thomas


--

"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:%23Ra$B$8KFHA.2252@TK2MSFTNGP15.phx.gbl...
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:eCMwAS5KFHA.2860@TK2MSFTNGP10.phx.gbl...
>
> Can I get a witness?


    No.  Never worked in a 24-7 (4-nines, and we're trying for 5 this year),
high-volume OLTP environment, have you?  Good luck maintaining those fill
factors when you have no maintenence window.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
18 Mar 2005 9:34 PM
Adam Machanic
You should probably read Kimberly Tripp's Q&A site:

http://www.sqlskills.com/ConsolidatedQA.asp

This will clarify a lot of things for you, I think.  Heaps are certainly not a good choice for insert performance, as gaps will be filled if rows are deleted.  And I'm confused about your comment regarding sacrificing performance -- clustering on an IDENTITY or other sequential key will do the opposite in many cases.  Insert, and in many cases read performance will both benefit.  I've run extensive tests to prove this (I'm a load testing fanatic) and you'll find upon reading that web page that Kimberly apparently agrees with me.  I'm not sure what basis your arguments have, but you may want to run some tests for yourself.

Thanks for the links on SLAs -- I will send them to our business team :-)


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--


  "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message news:eXByZ$9KFHA.3832@TK2MSFTNGP12.phx.gbl...
  Would agree that the DBREINDEX can be troublesome in highly available systems.  But the fill factors should be static from a design perspective.  Also, the DBREINDEX is semi-online in that it only needs to obtain exclusive table access only on the table it is currently running against.  Running the Cluster Index rebuild WITH DROP EXISTING can speed up the process.

  But then, again, that is a management issue.  The 5 9's do NOT refer to TOTAL TIME, but to SLA time.  And, EVERY SYSTEM HAS TO HAVE AN SLA, which would include a maintenance window.  If you don't have an APPROPRIATE SLA, then your 5 9's don't mean very much.  But, yes, we maintain 5, and even 6, 9's of availability, but properly measured against reasonable SLAs that have actually been documented and constructed with the various Business Units that are demanding the system availabilities.

  I suppose that if your SLAs DEMANDED 24x7x52, with NO MAINTENANCE, as delusional as the Business Units may be, then sacrificing performance over page splits would be one reason to not care about the Cluster Index placement.  But, then, why bother and just use a HEAP.  You're not going to get much better performance from an ill-chosen Clustered Index. Some, but not much.

  Sincerely,


  Anthony Thomas


  --
Author
22 Mar 2005 1:05 PM
Anthony Thomas
Okay.  I took you up on your suggestions.

Yes, I read Kimberly Tripp's Q&A and you two agree, to an extent.  She has a
basis for "Defaulting" to clustering the PK IDENTITY regardless if it is a
surrogate key or not.  She also admits that this will create a "hot spot" in
the data file but assumes the end-users will immediately want to retrieve
this information after insertion and that the hot spots is mitigated by
having the recent insert already in cache.  She also conceedes that the
Business Key would be an alternative Cluster Index candidate in some
situations.

I also took your advice and ran my own tests.  Here's what I discovered:

The page splits happen only as a course of inserts; so, table types that
have many more inserts versus other CRUD or Query operations may benefit for
what you two are suggesting.

I also agree with the fact that HEAPs are detrimental for the myriad of
reasons you and Kimberly point out.

However, I still disagree with the notion of using IDENTITY Clustered
Indexes as a matter of default.  Here is why:

The majority of tables are not of the type the two of you suggest would be
beneficial for using an IDENTITY attribute.  The majority of tables are of
the reference type.  For example, a Customer, Author, Title, Orders, and the
various other look up and reference kinds of tables.  Once entered, these
tables are queried and/or joined for reference information far more often in
query type statements than they ever are versus the intitial INSERT.
Moreover, my previous comments ring true.  That more often than not, a
SELECT * or at least many of the columns for these reference tables are
included.  In which case, the Clustered Index is chosen more often than not
because it is a covering index.  However, these types of tables typically
JOIN and are FILTERED by the very Business Key that defines the Unique
Contraint.  When the Clustered Index is the surrogate IDENTITY you end up
forcing a Clustered Index Scan whereas having the Unique Constraint as the
Clustered Index, you end up with a Clustered Index Seek, which is more
efficient.

So, as a matter of "default," you will cover more tables if you choose the
Business Key as the Cluster Index instead of the surrogate IDENTITY and will
end up with more efficient queries.

So, what to do with the handful of transactional tables, that are low in
number of tables, but high in the quantity of data?  My tests have shown
that the page splits is controlled more by the FILL FACTOR, as I suggested,
than by choosing the surrogate key over the Business Key.  Now, you are
correct in that choosing increasing keys will force the splits at the ends,
which Kimberly and I both agree will create a local "hot spot."  What we
disagree on is whether or not this is desirable.  Next, having an
appropriate FILL Factor and choosing the Business Key for the Cluster Index,
can distribute this "hot spot" activity throughout the database files.  I
contend that this is more desirable and have found that even with the
default FILL Factor causes few splits than using the surrogate key as the
Cluster Index.

Finally, there are situations, only with the transaction tables, that the
IDENTITY is NOT a surrogate, but the Business Key itself.  In these
situations, only, I think we agree to use this as the Cluster Key, but only
because our definitions have coincided, not because of any concession on
either of our part.

I am sorry, but the tests I have just conducted, although emperical and
limited, at least suggest that what I am say bears truth.  I would be
curious to know the particulars of your tests in order that I attempt to
reproduce your results.

Thanks for all of your time.

Sincerely,


Anthony Thomas


--

"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:uha51IALFHA.568@TK2MSFTNGP09.phx.gbl...
You should probably read Kimberly Tripp's Q&A site:

http://www.sqlskills.com/ConsolidatedQA.asp

This will clarify a lot of things for you, I think.  Heaps are certainly not
a good choice for insert performance, as gaps will be filled if rows are
deleted.  And I'm confused about your comment regarding sacrificing
performance -- clustering on an IDENTITY or other sequential key will do the
opposite in many cases.  Insert, and in many cases read performance will
both benefit.  I've run extensive tests to prove this (I'm a load testing
fanatic) and you'll find upon reading that web page that Kimberly apparently
agrees with me.  I'm not sure what basis your arguments have, but you may
want to run some tests for yourself.

Thanks for the links on SLAs -- I will send them to our business team :-)


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--


"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:eXByZ$9KFHA.3832@TK2MSFTNGP12.phx.gbl...
Would agree that the DBREINDEX can be troublesome in highly available
systems.  But the fill factors should be static from a design perspective.
Also, the DBREINDEX is semi-online in that it only needs to obtain exclusive
table access only on the table it is currently running against.  Running the
Cluster Index rebuild WITH DROP EXISTING can speed up the process.

But then, again, that is a management issue.  The 5 9's do NOT refer to
TOTAL TIME, but to SLA time.  And, EVERY SYSTEM HAS TO HAVE AN SLA, which
would include a maintenance window.  If you don't have an APPROPRIATE SLA,
then your 5 9's don't mean very much.  But, yes, we maintain 5, and even 6,
9's of availability, but properly measured against reasonable SLAs that have
actually been documented and constructed with the various Business Units
that are demanding the system availabilities.

I suppose that if your SLAs DEMANDED 24x7x52, with NO MAINTENANCE, as
delusional as the Business Units may be, then sacrificing performance over
page splits would be one reason to not care about the Cluster Index
placement.  But, then, why bother and just use a HEAP.  You're not going to
get much better performance from an ill-chosen Clustered Index. Some, but
not much.

Sincerely,


Anthony Thomas


--
Author
22 Mar 2005 4:03 PM
Adam Machanic
Hi Anthony,

A few comments:

> The page splits happen only as a course of inserts; so, table types that
> have many more inserts versus other CRUD or Query operations may benefit
for
> what you two are suggesting.

This is the environment in which I work currently, and that was the
assumption I made when testing -- the systems at my place of work are used
to process orders from numerous e-commerce sites -- we do all of the order
logic externally (in a big queue system, primarily), then insert all of the
order data.  It's queried very rarely once inserted -- most customers will
never return to ask for their order data, and our systems don't need to
query it again 90% of the time.  That's one reason I mentioned OLTP -- most
of my experience before I came here was with data warehousing projects and
this would certainly not apply there!  I don't know what other OLTP
scenarios exist -- I assumed we were fairly consistent with the norm.  Is
that not true in your experience?

> However, I still disagree with the notion of using IDENTITY Clustered
> Indexes as a matter of default.  Here is why:

Just to clarify, I don't remember anyone saying "default"!  I certainly
wouldn't suggest a one-size fits all.  I believe our debate was about
whether or not they could be used in any situation -- if I implied that it's
good in every situation and that's what caused this debate, I apologize.
That was certainly not my thought on the matter.

> beneficial for using an IDENTITY attribute.  The majority of tables are of
> the reference type.  For example, a Customer, Author, Title, Orders, and
the
> various other look up and reference kinds of tables.  Once entered, these
> tables are queried and/or joined for reference information far more often
in
> query type statements than they ever are versus the intitial INSERT.

Customers/Orders are two of the most inserted-into tables in the systems I
work on (the most inserts happen on the Line Items table, since most orders
have more than one item).  But assuming you had an IDENTITY surrogate key, I
see no reason why clustering on it would not be a good idea even in tables
that don't change often, like Author.  Assume, that the table had all sorts
of columns for each author -- primary publisher, editor, city of residence,
maybe some biographical info, etc, etc, etc.  Most queries into the system,
however, would be for books by that author.  If the table were clustered by
the IDENTITY, a non-clustered index could be created on the Author Name
column, and since non-clustered indexes include the clustering key as the
leaf value, a bookmark lookup would not be necessary when using this table
to query, e.g., the Books table.  So that would be one case in which I still
think that clustering on the IDENTITY would be valuable.  What would you
cluster on?

> Moreover, my previous comments ring true.  That more often than not, a
> SELECT * or at least many of the columns for these reference tables are
> included.  In which case, the Clustered Index is chosen more often than
not

Not in my experience; see the Authors table listed above; and many other
'reference' tables have only a single column.  For example "Countries" will
have a column called "Country".  In that case, I agree, clustering on the
surrogate key (if there is one) is probably not necessary.  But in such
small tables as those, it really doesn't matter too much what you cluster
on -- the entire table is going to be in memory anyway.

> So, as a matter of "default," you will cover more tables if you choose the
> Business Key as the Cluster Index instead of the surrogate IDENTITY and
will
> end up with more efficient queries.

Again, there should never, ever, be a default!  We as data modelers should
always strive for the best solution, and one-size-fits-all is never the best
solution except in baseball cap manufacturing :)

> than by choosing the surrogate key over the Business Key.  Now, you are
> correct in that choosing increasing keys will force the splits at the
ends,
> which Kimberly and I both agree will create a local "hot spot."  What we
> disagree on is whether or not this is desirable.  Next, having an
> appropriate FILL Factor and choosing the Business Key for the Cluster
Index,
> can distribute this "hot spot" activity throughout the database files.  I
> contend that this is more desirable and have found that even with the
> default FILL Factor causes few splits than using the surrogate key as the
> Cluster Index.

This probably depends a great deal on the disk system that you're using --
it's not desirable, if you're doing heavy inserts, to have the write heads
jumping all around the disk.  It's much better to keep them at the end of
the disk, writing sequentially.  However, in certain RAID scenarios (and
probably SANs, which I don't have experience with), that equation changes.
What types of disks did you tesk on?  I ran my tests on a small RAID 10
setup -- but results appear to remain consistent on our larger production
system.

> I am sorry, but the tests I have just conducted, although emperical and
> limited, at least suggest that what I am say bears truth.  I would be
> curious to know the particulars of your tests in order that I attempt to
> reproduce your results.

    What were your tests?  Can you post them?  I've attached a couple of
very simple insert-only tests and their results, which show a huge
difference in time -- I'll try to find some of my other tests and post those
as well; I also tested various sequential and random reads, etc.  I did find
that clustering on "business" keys is beneficial for read in a variety of
cases, but at least here, that's totally overhsadowed by the need for fast
inserts.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--


[attached file: sequential_tests.zip]
Author
22 Mar 2005 10:38 PM
Hugo Kornelis
On Tue, 22 Mar 2005 07:05:30 -0600, Anthony Thomas wrote:

(snip)
(snip)
>Moreover, my previous comments ring true.  That more often than not, a
>SELECT * or at least many of the columns for these reference tables are
>included.  In which case, the Clustered Index is chosen more often than not
>because it is a covering index.  However, these types of tables typically
>JOIN and are FILTERED by the very Business Key that defines the Unique
>Contraint.  When the Clustered Index is the surrogate IDENTITY you end up
>forcing a Clustered Index Scan whereas having the Unique Constraint as the
>Clustered Index, you end up with a Clustered Index Seek, which is more
>efficient.
(snip)

Hi Anthony,

Yes, they are often filtered by the business key that defines the unique
constraint.

No, they are hardly ever joined by that business key. The whole point of
introducing an IDENTITY surrogate key is to use that integer value for
all references instead of the (often longer, sometimes spanned) business
key. So I'd expect joins to be using the identity surrogate key.

If most queries include a filter on the business key, AND the optimizer
decides to use that filter first in it's execution plan, than a
clustered index on the business key would be better than clustering on
the surrogate key. But if many queries don't filter the business key
(because the filters are on columns in other tables, and this table is
only joined in to display some extra columns), OR even if the queries do
filter on the business key, but the optimizer decides that the best plan
will filter other tables first, then join this table and apply the
remainging filter conditions, then the clustered index on the identity
column would be best.

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
22 Mar 2005 10:42 PM
Adam Machanic
"Hugo Kornelis" <hugo@pe_NO_rFact.in_SPAM_fo> wrote in message
news:887141lbp95i5ibkgarn8t1ga3812gpbs0@4ax.com...
>
> If most queries include a filter on the business key, AND the optimizer
> decides to use that filter first in it's execution plan, than a
> clustered index on the business key would be better than clustering on
> the surrogate key. But if many queries don't filter the business key
> (because the filters are on columns in other tables, and this table is
> only joined in to display some extra columns), OR even if the queries do
> filter on the business key, but the optimizer decides that the best plan
> will filter other tables first, then join this table and apply the
> remainging filter conditions, then the clustered index on the identity
> column would be best.

    Excellent point :)


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
23 Mar 2005 8:01 AM
Anthony Thomas
Ok.  A few replies...and I'll try to keep this short to minimize the
responding topics.

1. You are corrent in that I mispoke concerning a Joining on the Business
Key.  Yes, that is the whole point of the surrogate, but as the predominant
restriction condition is still true.

2. A Cluster Index on the surrogate key WILL NOT make the join process
faster.  Think about it.  If you have to process a MERGE or INNER LOOP join,
with either a prefetch or not, it will have to do a matched seek.  Both a
clustered index and a non-clustered index would be sufficient for the
singleton select lookups.  Your suggestion that IDENTITY CLUSTERED INDEXES
are more efficient JOIN candidates is hyped but not supported by evidence.

3. Yes, queries against the reference tables are my chief concern because I
see this every day with vended solutions that use the PK IDENTITY Clustered
Index approach, and it kills the systems.  Chiefly because the bulk of the
queries, like Customers, etc., are searched on Name Ranges, which the
IDENTITY Clustered Primary Key is the worst candidate.

4. This is a matter of "default."  The very Q & A Kimberly wrote says this
explicitly, the very link you provided.  And, it is what vendor after vendor
after vendor is shoving into the market place WITHOUT giving it a second
thought much less thought per individual table.  Plus, the tool, SQL Server,
does this by default if you do not have the foresight to tell it otherwise.

5. The Orders table being a reference table is a hybrid, it is usually the
Order Details tables that I was thinking of when I was attempting to
distinguish between the two, reference and transaction, respectively.  Note,
I have ONLY been talking about OLTP databases, just the table styles within
an OLTP database.

6. My tests showed that the IDENTITY CLUSTERED INDEX had MORE SPLITS, even
on a transaction type table than a Business Key.  Now, this boggled my mind;
it was almost 3 to 1 greater.  But think about it; these are B-Tree indexes.
If you use an increasing attribute, especially a monitonically increasing
one like an IDENTITY, you've unbalanced the index; you are making it
lop-sided.  You end up causing more Intermediate Node splits more often than
if you had randomly split the leafs throughout the table.  Now, there may be
a better explaination, but from the first pass tests I conducted, this is
what I saw.

7. The initial question was regarding INSERTS, splits, versus the payoff of
using the Business Key for queries.  So, my tests have been first conducted
on individual tables.  I am looking into multi-table, with relationships,
types of tests to conduct next.  For now, however, here is what I did.

CREATE TABLE MyTable
    (MyID INT IDENTITY NOT NULL
        PRIMARY KEY CLUSTERED
    ,MyName VARCHAR(30)    NOT NULL
        UNIQUE NONCLUSTERED
    ,MyDescription VARCHAR(50)    NOT NULL
    ,MyCreateDate    DATETIME    NOT NULL
        DEFAULT (GETDATE())
    )

CREATE NONCLUSTERED INDEX IX01_MyTable
    ON MyTable(MyCreateDate)

Now, I ran several passes changing the Clustered Index attribute, single
column, and then reconducting the test.

The test was to monitor the split behavior, both at the leaf as well as the
node level and then look at the fragmentation once completed with the
inserts of 1,000,000 records.  Each insert was a single row at a time to
simulate individual transacitons.

The MyID was IDENTITY; so, no explicit value inserted.  MyName was preceeded
with a CHAR(x) value randomly created, then the length was varied between
the single character to the maximum value.  MyDescription was just garbage,
but the length was randomly filled.  MyCreateDate was allowed to assume the
transaction time default.

After the inserts and fragmentation analysis, a set of predefined queries
were ran with the execution plan, client statistics, I/O and Time
statistics.

The queries were:

SELECT * FROM MyTable
SELECT * FROM MyTable WHERE MyID = 550000
SELECT * FROM MyTable WHERE MyName LIKE 'M%'
SELECT * FROM MyTable WHERE MyName = 'M'
SELECT * FROM MyTable WHERE MyDate BETWEEN <1/2 way through the run time>
and <3/4 way through the run time>
SELECT * FROM MyTable WHERE MyDate = <a randomly selected time from the run
time>

SELECT MyName, MyCreateDate FROM MyTable
SELECT MyName, MyCreateDate FROM MyTable WHERE MyID = 550000
SELECT MyName, MyCreateDate FROM MyTable WHERE MyName LIKE 'M%'
SELECT MyName, MyCreateDate FROM MyTable WHERE MyName = 'M'
SELECT MyName, MyCreateDate FROM MyTable WHERE MyDate BETWEEN <1/2 way
through the run time> and <3/4 way through the run time>
SELECT MyName, MyCreateDate FROM MyTable WHERE MyDate = <a randomly selected
time from the run time>

I made two passes through these, once by just letting it run, the other by
running the DBCC FREEPROCCACHE and DBCC DROPCLEANBUFFERS in between
statements.

This results were saved and then the whole thing with the same constraints
but just a diffenent choice of clustered index: IDENTITY (MyID), Business
Key (MyName), and then an increasing but not necessarily unique attribute
(MyCreateDate).

I did my analysis on both sides, the fragmentation and efficiency, time and
I/O statistics, of the inserts, and the execution of the above queries.

At the moment, I am attempting to summarize the results, given they are
quite extensive.  However, the examination still bears some of what everyone
else is saying, but there are hidden dangers, the Extent Fragmentation for
one.  And, the Business Key still produced the best executions for these
queries.

Now, I would not claim that these tests are scientific research, but they
did give me better insight in to what I had originaly had considered, but
nothing would suggest that I detract from my initial statement, that as a
matter of default, which is what most solutions that are present to me are,
the Business Key is still the best candidate for the Clustered Index, which,
as the DBA, is one of the few, but most effective influence I can apply to
an already designed solution.

Sincerely,


Anthony Thomas

--

Show quoteHide quote
"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:OZP1pBzLFHA.1956@TK2MSFTNGP15.phx.gbl...
"Hugo Kornelis" <hugo@pe_NO_rFact.in_SPAM_fo> wrote in message
news:887141lbp95i5ibkgarn8t1ga3812gpbs0@4ax.com...
>
> If most queries include a filter on the business key, AND the optimizer
> decides to use that filter first in it's execution plan, than a
> clustered index on the business key would be better than clustering on
> the surrogate key. But if many queries don't filter the business key
> (because the filters are on columns in other tables, and this table is
> only joined in to display some extra columns), OR even if the queries do
> filter on the business key, but the optimizer decides that the best plan
> will filter other tables first, then join this table and apply the
> remainging filter conditions, then the clustered index on the identity
> column would be best.

    Excellent point :)


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
23 Mar 2005 10:47 AM
Hugo Kornelis
On Wed, 23 Mar 2005 02:01:00 -0600, Anthony Thomas wrote:

(snip)
>2. A Cluster Index on the surrogate key WILL NOT make the join process
>faster.  Think about it.  If you have to process a MERGE or INNER LOOP join,
>with either a prefetch or not, it will have to do a matched seek.  Both a
>clustered index and a non-clustered index would be sufficient for the
>singleton select lookups.  Your suggestion that IDENTITY CLUSTERED INDEXES
>are more efficient JOIN candidates is hyped but not supported by evidence.

Hi Anthony,

I've thought about it.

If the (expected) number of rows is small, a clustered index would
typically require a clustered index seek; a nonclustered index would
require an index seek followed by a bookmark lookup. Doubles the number
of logical reads and increases the number of physical reaads.

If the (expected) number of rows is large, a clustered index would give
the optimizer the option to use a merge join; I don't see that as a
viable option in the nonclustered case, unless the index is covering
(merge join the index, then do 100,000 bookmark lookups? sort the table,
then merge join, bypassing the index?)


(snip)
>4. This is a matter of "default."  The very Q & A Kimberly wrote says this
>explicitly, the very link you provided.  And, it is what vendor after vendor
>after vendor is shoving into the market place WITHOUT giving it a second
>thought much less thought per individual table.  Plus, the tool, SQL Server,
>does this by default if you do not have the foresight to tell it otherwise.

I think we'll both agree that defaults are fine if you just need to make
a simple DB and don't want nor need to spend time tuning. If you're
taking development seriously and you're producing a DB that is of any
serious significance to the business, you'll take none of the defaults
for granted. This applies to clustered indexes just as it doess to
configuration options, recovery models, security settings, maintenance
plans, etc.

For the purpose of a simple DB with no need for advanced tuning, the SQL
Server default is (IMO) fine as it is - all tables (should) have a
primary key, not all have a unique constraint or other supporting index,
and since most tables perform better with clustered key than as a heap,
it makes sense to default the PK to clustered.

Vendors that fail to finetune their product is a royal pain - but not
really related to this discussion.


(snip)
>Now, I would not claim that these tests are scientific research, but they
>did give me better insight in to what I had originaly had considered, but
>nothing would suggest that I detract from my initial statement, that as a
>matter of default, which is what most solutions that are present to me are,
>the Business Key is still the best candidate for the Clustered Index, which,
>as the DBA, is one of the few, but most effective influence I can apply to
>an already designed solution.

The tests are good, but they only test typical one-table searches. Most
queries in a typical database use joins or subqueries to combine data
from two or more tables. I don't have the time to run extensive tests
like you did, but if you have, you might try to add a few tests. Some
typical queries you might encounter in an ordering system and that might
be interesting to test are:

SELECT     c.CustName, c.CustAddr, o.OrdDate, o.HandledBy,
           p.ProdName, p.ListPrice, od.Qty
FROM       Orders AS o
INNER JOIN Customers AS c
      ON   c.CustID = o.CustID
INNER JOIN OrderDetails AS od
      ON   od.OrderID = o.OrderID
INNER JOIN Products AS p
      ON   p.ProdID = od.ProdID
WHERE      c.CustName = 'Nanotech Inc.'

or

WHERE      o.OrdDate BETWEEN '20050321' AND '20050323'

or

WHERE      p.ProdName = '17" TFT monitor'

This assumes surrogate keys on c.CustID, o.OrderID, p.ProdID (no
surrogate key for od), and natural keys on c.CustName, p.ProdName,
(o.CustID + o.OrdDate), (od.OrderID + od.ProdID). My gut feeling is that
clustering the surrogate key will be faster than clustering the natural
key. If you remove c.CustAddr, o.HandledBy and p.ListPrice, clustering
on the natural key will win (as only the covering index needs to be
used).

Of course, a complete test would also involve testing with EXISTS / NOT
EXISTS subqueries, with GROUP BY and aggregates, etc. But I think the
above woould be a good starting point. :-)

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
23 Mar 2005 3:59 PM
Anthony Thomas
Thank you for your comments.  Yes, I do agree that the test need to start
broading there scope into the addition, and more typical kinds of queries.

The intial discussion however began with a question regarding the use of a
ROWGUID as a clustering index.  I made an generalist statement that
surrogate keys make lousy cluster index candidates, which the response
initiated this thread.  The proof offered was a matter of lessing splits.

My test were solely directed at the question of splits versus the querying
benefit of clustering the Business Key.  But, yes, I am interested in the
additional questions now that I have started tearing the issue apart to
examine.

As to the above question, I have evidence that not only do single table type
queries perform better by having the Business Key clustered, the split
factor offered as a counter argument has actually proven false by clustering
the PK IDENTITY surrogate.  There are an enourmous amount of EXTENT
FRAGMENTATION.  Adam also suggested that the surrogate was a good choice
because given his system, he does not have the luxury of running the
DBREINDEX maintenance.  My results show that he has to that he is not
getting what he thought he was.

Now, about "defaults," I agree.  Unfortuantely, we have the majority of our
systems vended, by BIG SOFTWARE companies, that take the advice and the
default behavior of the SQLEM tool and generate hundreds of tables, all with
surrogate PK IDENTITY clustered, as a matter of default, precisely because
the tool does it and people like Kimberly Tripp and Adam Mechanic, and,
perhaps, even you, espouse their use.

The point of these was to attempt to give general advice that would prove
more beneficial than what has been typically given.  And, then, proceeded to
prove my point.

So, I would like someone else to conduct a similar test as I have presented
and report on whether or not they same the same FRAGMENTATION
characteristics and perfomance issues when ran under these various
scenarios.  1 M row inserts are not necessary.  I saw the same behaviour at
the 10 K rows level, it is just magnified when you can stress the system as
I did.

Sincerely,


Anthony Thomas


--

"Hugo Kornelis" <hugo@pe_NO_rFact.in_SPAM_fo> wrote in message
news:3mg241phmksolj3da4it1vf4g380pvu5dg@4ax.com...
On Wed, 23 Mar 2005 02:01:00 -0600, Anthony Thomas wrote:

(snip)
>2. A Cluster Index on the surrogate key WILL NOT make the join process
>faster.  Think about it.  If you have to process a MERGE or INNER LOOP
join,
>with either a prefetch or not, it will have to do a matched seek.  Both a
>clustered index and a non-clustered index would be sufficient for the
>singleton select lookups.  Your suggestion that IDENTITY CLUSTERED INDEXES
>are more efficient JOIN candidates is hyped but not supported by evidence.

Hi Anthony,

I've thought about it.

If the (expected) number of rows is small, a clustered index would
typically require a clustered index seek; a nonclustered index would
require an index seek followed by a bookmark lookup. Doubles the number
of logical reads and increases the number of physical reaads.

If the (expected) number of rows is large, a clustered index would give
the optimizer the option to use a merge join; I don't see that as a
viable option in the nonclustered case, unless the index is covering
(merge join the index, then do 100,000 bookmark lookups? sort the table,
then merge join, bypassing the index?)


(snip)
>4. This is a matter of "default."  The very Q & A Kimberly wrote says this
>explicitly, the very link you provided.  And, it is what vendor after
vendor
>after vendor is shoving into the market place WITHOUT giving it a second
>thought much less thought per individual table.  Plus, the tool, SQL
Server,
>does this by default if you do not have the foresight to tell it otherwise.

I think we'll both agree that defaults are fine if you just need to make
a simple DB and don't want nor need to spend time tuning. If you're
taking development seriously and you're producing a DB that is of any
serious significance to the business, you'll take none of the defaults
for granted. This applies to clustered indexes just as it doess to
configuration options, recovery models, security settings, maintenance
plans, etc.

For the purpose of a simple DB with no need for advanced tuning, the SQL
Server default is (IMO) fine as it is - all tables (should) have a
primary key, not all have a unique constraint or other supporting index,
and since most tables perform better with clustered key than as a heap,
it makes sense to default the PK to clustered.

Vendors that fail to finetune their product is a royal pain - but not
really related to this discussion.


(snip)
>Now, I would not claim that these tests are scientific research, but they
>did give me better insight in to what I had originaly had considered, but
>nothing would suggest that I detract from my initial statement, that as a
>matter of default, which is what most solutions that are present to me are,
>the Business Key is still the best candidate for the Clustered Index,
which,
>as the DBA, is one of the few, but most effective influence I can apply to
>an already designed solution.

The tests are good, but they only test typical one-table searches. Most
queries in a typical database use joins or subqueries to combine data
from two or more tables. I don't have the time to run extensive tests
like you did, but if you have, you might try to add a few tests. Some
typical queries you might encounter in an ordering system and that might
be interesting to test are:

SELECT     c.CustName, c.CustAddr, o.OrdDate, o.HandledBy,
           p.ProdName, p.ListPrice, od.Qty
FROM       Orders AS o
INNER JOIN Customers AS c
      ON   c.CustID = o.CustID
INNER JOIN OrderDetails AS od
      ON   od.OrderID = o.OrderID
INNER JOIN Products AS p
      ON   p.ProdID = od.ProdID
WHERE      c.CustName = 'Nanotech Inc.'

or

WHERE      o.OrdDate BETWEEN '20050321' AND '20050323'

or

WHERE      p.ProdName = '17" TFT monitor'

This assumes surrogate keys on c.CustID, o.OrderID, p.ProdID (no
surrogate key for od), and natural keys on c.CustName, p.ProdName,
(o.CustID + o.OrdDate), (od.OrderID + od.ProdID). My gut feeling is that
clustering the surrogate key will be faster than clustering the natural
key. If you remove c.CustAddr, o.HandledBy and p.ListPrice, clustering
on the natural key will win (as only the covering index needs to be
used).

Of course, a complete test would also involve testing with EXISTS / NOT
EXISTS subqueries, with GROUP BY and aggregates, etc. But I think the
above woould be a good starting point. :-)

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
23 Mar 2005 6:13 PM
Adam Machanic
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:O%23Rn%23F8LFHA.2420@TK2MSFTNGP12.phx.gbl...
>
> factor offered as a counter argument has actually proven false by
clustering
> the PK IDENTITY surrogate.  There are an enourmous amount of EXTENT
> FRAGMENTATION.  Adam also suggested that the surrogate was a good choice

    What's the basis for these conclusions?  I see no correlation between
what you're asserting and what I've seen in the real world.

    For example, I just re-ran the "ident.sql" script I posted a day or two
ago in this thread.  Here is the result of DBCC SHOWCONTIG after inserting
10,000,000 rows clustered on an IDENTITY column:

DBCC SHOWCONTIG scanning 'somestuff' table...
Table: 'somestuff' (2073058421); index ID: 1, database ID: 14
TABLE level scan performed.
- Pages Scanned................................: 279686
- Extents Scanned..............................: 35080
- Extent Switches..............................: 35079
- Avg. Pages per Extent........................: 8.0
- Scan Density [Best Count:Actual Count].......: 99.66% [34961:35080]
- Logical Scan Fragmentation ..................: 0.00%
- Extent Scan Fragmentation ...................: 9.39%
- Avg. Bytes Free per Page.....................: 140.8
- Avg. Page Density (full).....................: 98.26%


Is 9% really "enormous"?

.... I'll leave it as an exercise for you to run the "noident" script, since
it takes many hours and I don't have the patience right now to wait that
long.  But I know you won't be pleased with the fragmentation reported at
the end of that run.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--


Show quoteHide quote
> because given his system, he does not have the luxury of running the
> DBREINDEX maintenance.  My results show that he has to that he is not
> getting what he thought he was.
>
> Now, about "defaults," I agree.  Unfortuantely, we have the majority of
our
> systems vended, by BIG SOFTWARE companies, that take the advice and the
> default behavior of the SQLEM tool and generate hundreds of tables, all
with
> surrogate PK IDENTITY clustered, as a matter of default, precisely because
> the tool does it and people like Kimberly Tripp and Adam Mechanic, and,
> perhaps, even you, espouse their use.
>
> The point of these was to attempt to give general advice that would prove
> more beneficial than what has been typically given.  And, then, proceeded
to
> prove my point.
>
> So, I would like someone else to conduct a similar test as I have
presented
> and report on whether or not they same the same FRAGMENTATION
> characteristics and perfomance issues when ran under these various
> scenarios.  1 M row inserts are not necessary.  I saw the same behaviour
at
> the 10 K rows level, it is just magnified when you can stress the system
as
> I did.
>
> Sincerely,
>
>
> Anthony Thomas
>
>
> --
>
> "Hugo Kornelis" <hugo@pe_NO_rFact.in_SPAM_fo> wrote in message
> news:3mg241phmksolj3da4it1vf4g380pvu5dg@4ax.com...
> On Wed, 23 Mar 2005 02:01:00 -0600, Anthony Thomas wrote:
>
> (snip)
> >2. A Cluster Index on the surrogate key WILL NOT make the join process
> >faster.  Think about it.  If you have to process a MERGE or INNER LOOP
> join,
> >with either a prefetch or not, it will have to do a matched seek.  Both a
> >clustered index and a non-clustered index would be sufficient for the
> >singleton select lookups.  Your suggestion that IDENTITY CLUSTERED
INDEXES
> >are more efficient JOIN candidates is hyped but not supported by
evidence.
>
> Hi Anthony,
>
> I've thought about it.
>
> If the (expected) number of rows is small, a clustered index would
> typically require a clustered index seek; a nonclustered index would
> require an index seek followed by a bookmark lookup. Doubles the number
> of logical reads and increases the number of physical reaads.
>
> If the (expected) number of rows is large, a clustered index would give
> the optimizer the option to use a merge join; I don't see that as a
> viable option in the nonclustered case, unless the index is covering
> (merge join the index, then do 100,000 bookmark lookups? sort the table,
> then merge join, bypassing the index?)
>
>
> (snip)
> >4. This is a matter of "default."  The very Q & A Kimberly wrote says
this
> >explicitly, the very link you provided.  And, it is what vendor after
> vendor
> >after vendor is shoving into the market place WITHOUT giving it a second
> >thought much less thought per individual table.  Plus, the tool, SQL
> Server,
> >does this by default if you do not have the foresight to tell it
otherwise.
>
> I think we'll both agree that defaults are fine if you just need to make
> a simple DB and don't want nor need to spend time tuning. If you're
> taking development seriously and you're producing a DB that is of any
> serious significance to the business, you'll take none of the defaults
> for granted. This applies to clustered indexes just as it doess to
> configuration options, recovery models, security settings, maintenance
> plans, etc.
>
> For the purpose of a simple DB with no need for advanced tuning, the SQL
> Server default is (IMO) fine as it is - all tables (should) have a
> primary key, not all have a unique constraint or other supporting index,
> and since most tables perform better with clustered key than as a heap,
> it makes sense to default the PK to clustered.
>
> Vendors that fail to finetune their product is a royal pain - but not
> really related to this discussion.
>
>
> (snip)
> >Now, I would not claim that these tests are scientific research, but they
> >did give me better insight in to what I had originaly had considered, but
> >nothing would suggest that I detract from my initial statement, that as a
> >matter of default, which is what most solutions that are present to me
are,
> >the Business Key is still the best candidate for the Clustered Index,
> which,
> >as the DBA, is one of the few, but most effective influence I can apply
to
> >an already designed solution.
>
> The tests are good, but they only test typical one-table searches. Most
> queries in a typical database use joins or subqueries to combine data
> from two or more tables. I don't have the time to run extensive tests
> like you did, but if you have, you might try to add a few tests. Some
> typical queries you might encounter in an ordering system and that might
> be interesting to test are:
>
> SELECT     c.CustName, c.CustAddr, o.OrdDate, o.HandledBy,
>            p.ProdName, p.ListPrice, od.Qty
> FROM       Orders AS o
> INNER JOIN Customers AS c
>       ON   c.CustID = o.CustID
> INNER JOIN OrderDetails AS od
>       ON   od.OrderID = o.OrderID
> INNER JOIN Products AS p
>       ON   p.ProdID = od.ProdID
> WHERE      c.CustName = 'Nanotech Inc.'
>
> or
>
> WHERE      o.OrdDate BETWEEN '20050321' AND '20050323'
>
> or
>
> WHERE      p.ProdName = '17" TFT monitor'
>
> This assumes surrogate keys on c.CustID, o.OrderID, p.ProdID (no
> surrogate key for od), and natural keys on c.CustName, p.ProdName,
> (o.CustID + o.OrdDate), (od.OrderID + od.ProdID). My gut feeling is that
> clustering the surrogate key will be faster than clustering the natural
> key. If you remove c.CustAddr, o.HandledBy and p.ListPrice, clustering
> on the natural key will win (as only the covering index needs to be
> used).
>
> Of course, a complete test would also involve testing with EXISTS / NOT
> EXISTS subqueries, with GROUP BY and aggregates, etc. But I think the
> above woould be a good starting point. :-)
>
> Best, Hugo
> --
>
> (Remove _NO_ and _SPAM_ to get my e-mail address)
>
>
Author
24 Mar 2005 7:36 AM
Anthony Thomas
My apologies, but I'll have to go back through the thread and grab a copy of
your script.  I'm curious now as to whether or not I have a configuration
issue on my test installation.

No, 9% is not enormous, but what I saw on the one I posted and ran with, at
the 1,000, 10,000, 100,000, and 1,000,000 row runs was nearer to 90%.  Do
you think it flips over around 10,000,000?  I don't see how that's possible
if at all the other scales were showing similar results and none as what you
posted here.

Thanks everyone for keeping up with this as I do want to get to a conclusive
discription as what we have been observing across our installations.

I've looked back through the thread and although you mention your script in
one of the posts you did not include it, or I can't get access to it through
my new reader.  Could you post yours again?

Because my results don't even come close to what yours did as far as Extent
Fragmentation, I reran a quick 1 million row insert on a completely
different installation, different OS, different SQL Server build, different
SAN, I still get the same results, which are different than yours.  I'd
really love to conduct the same test as yours.

Here's my quick results on the new box:

Clustered PK IDENTITY 1,000,000 Rows Test

cur_date                                               cntr_value
------------------------------------------------------ -----------
2005-03-24 00:35:07.577                                       7247

cur_date                                               cntr_value
------------------------------------------------------ -----------
2005-03-24 00:59:08.170                                      23390


DBCC SHOWCONTIG scanning 'MyTableCIndexTest' table...
Table: 'MyTableCIndexTest' (213575799); index ID: 1, database ID: 6
TABLE level scan performed.
- Pages Scanned................................: 10056
- Extents Scanned..............................: 1262
- Extent Switches..............................: 1261
- Avg. Pages per Extent........................: 8.0
- Scan Density [Best Count:Actual Count].......: 99.60% [1257:1262]
- Logical Scan Fragmentation ..................: 0.01%
- Extent Scan Fragmentation ...................: 52.38%
- Avg. Bytes Free per Page.....................: 41.2
- Avg. Page Density (full).....................: 99.49%
DBCC SHOWCONTIG scanning 'MyTableCIndexTest' table...
Table: 'MyTableCIndexTest' (213575799); index ID: 2, database ID: 6
LEAF level scan performed.
- Pages Scanned................................: 4164
- Extents Scanned..............................: 525
- Extent Switches..............................: 4162
- Avg. Pages per Extent........................: 7.9
- Scan Density [Best Count:Actual Count].......: 12.52% [521:4163]
- Logical Scan Fragmentation ..................: 9.70%
- Extent Scan Fragmentation ...................: 98.86%
- Avg. Bytes Free per Page.....................: 412.3
- Avg. Page Density (full).....................: 94.91%
DBCC SHOWCONTIG scanning 'MyTableCIndexTest' table...
Table: 'MyTableCIndexTest' (213575799); index ID: 3, database ID: 6
LEAF level scan performed.
- Pages Scanned................................: 1856
- Extents Scanned..............................: 234
- Extent Switches..............................: 233
- Avg. Pages per Extent........................: 7.9
- Scan Density [Best Count:Actual Count].......: 99.15% [232:234]
- Logical Scan Fragmentation ..................: 0.11%
- Extent Scan Fragmentation ...................: 99.57%
- Avg. Bytes Free per Page.....................: 14.1
- Avg. Page Density (full).....................: 99.83%
DBCC execution completed. If DBCC printed error messages, contact your
system administrator.

I also ran the DBCC SHOWCONTIG as a resultset with ALL_LEVELS and I'm get
fragmentation even at the node levels.

I am running a 10 million test just to see if this has anything to do with
it--although I doubt it--but it calls into question how two totally
different installations could produce similar result and yet be completely
different than yours.

Sincerely,


Anthony Thomas


--

"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:OcxzXQ9LFHA.3420@tk2msftngp13.phx.gbl...
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:O%23Rn%23F8LFHA.2420@TK2MSFTNGP12.phx.gbl...
>
> factor offered as a counter argument has actually proven false by
clustering
> the PK IDENTITY surrogate.  There are an enourmous amount of EXTENT
> FRAGMENTATION.  Adam also suggested that the surrogate was a good choice

    What's the basis for these conclusions?  I see no correlation between
what you're asserting and what I've seen in the real world.

    For example, I just re-ran the "ident.sql" script I posted a day or two
ago in this thread.  Here is the result of DBCC SHOWCONTIG after inserting
10,000,000 rows clustered on an IDENTITY column:

DBCC SHOWCONTIG scanning 'somestuff' table...
Table: 'somestuff' (2073058421); index ID: 1, database ID: 14
TABLE level scan performed.
- Pages Scanned................................: 279686
- Extents Scanned..............................: 35080
- Extent Switches..............................: 35079
- Avg. Pages per Extent........................: 8.0
- Scan Density [Best Count:Actual Count].......: 99.66% [34961:35080]
- Logical Scan Fragmentation ..................: 0.00%
- Extent Scan Fragmentation ...................: 9.39%
- Avg. Bytes Free per Page.....................: 140.8
- Avg. Page Density (full).....................: 98.26%


Is 9% really "enormous"?

.... I'll leave it as an exercise for you to run the "noident" script, since
it takes many hours and I don't have the patience right now to wait that
long.  But I know you won't be pleased with the fragmentation reported at
the end of that run.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--


Show quoteHide quote
> because given his system, he does not have the luxury of running the
> DBREINDEX maintenance.  My results show that he has to that he is not
> getting what he thought he was.
>
> Now, about "defaults," I agree.  Unfortuantely, we have the majority of
our
> systems vended, by BIG SOFTWARE companies, that take the advice and the
> default behavior of the SQLEM tool and generate hundreds of tables, all
with
> surrogate PK IDENTITY clustered, as a matter of default, precisely because
> the tool does it and people like Kimberly Tripp and Adam Mechanic, and,
> perhaps, even you, espouse their use.
>
> The point of these was to attempt to give general advice that would prove
> more beneficial than what has been typically given.  And, then, proceeded
to
> prove my point.
>
> So, I would like someone else to conduct a similar test as I have
presented
> and report on whether or not they same the same FRAGMENTATION
> characteristics and perfomance issues when ran under these various
> scenarios.  1 M row inserts are not necessary.  I saw the same behaviour
at
> the 10 K rows level, it is just magnified when you can stress the system
as
> I did.
>
> Sincerely,
>
>
> Anthony Thomas
>
>
> --
>
> "Hugo Kornelis" <hugo@pe_NO_rFact.in_SPAM_fo> wrote in message
> news:3mg241phmksolj3da4it1vf4g380pvu5dg@4ax.com...
> On Wed, 23 Mar 2005 02:01:00 -0600, Anthony Thomas wrote:
>
> (snip)
> >2. A Cluster Index on the surrogate key WILL NOT make the join process
> >faster.  Think about it.  If you have to process a MERGE or INNER LOOP
> join,
> >with either a prefetch or not, it will have to do a matched seek.  Both a
> >clustered index and a non-clustered index would be sufficient for the
> >singleton select lookups.  Your suggestion that IDENTITY CLUSTERED
INDEXES
> >are more efficient JOIN candidates is hyped but not supported by
evidence.
>
> Hi Anthony,
>
> I've thought about it.
>
> If the (expected) number of rows is small, a clustered index would
> typically require a clustered index seek; a nonclustered index would
> require an index seek followed by a bookmark lookup. Doubles the number
> of logical reads and increases the number of physical reaads.
>
> If the (expected) number of rows is large, a clustered index would give
> the optimizer the option to use a merge join; I don't see that as a
> viable option in the nonclustered case, unless the index is covering
> (merge join the index, then do 100,000 bookmark lookups? sort the table,
> then merge join, bypassing the index?)
>
>
> (snip)
> >4. This is a matter of "default."  The very Q & A Kimberly wrote says
this
> >explicitly, the very link you provided.  And, it is what vendor after
> vendor
> >after vendor is shoving into the market place WITHOUT giving it a second
> >thought much less thought per individual table.  Plus, the tool, SQL
> Server,
> >does this by default if you do not have the foresight to tell it
otherwise.
>
> I think we'll both agree that defaults are fine if you just need to make
> a simple DB and don't want nor need to spend time tuning. If you're
> taking development seriously and you're producing a DB that is of any
> serious significance to the business, you'll take none of the defaults
> for granted. This applies to clustered indexes just as it doess to
> configuration options, recovery models, security settings, maintenance
> plans, etc.
>
> For the purpose of a simple DB with no need for advanced tuning, the SQL
> Server default is (IMO) fine as it is - all tables (should) have a
> primary key, not all have a unique constraint or other supporting index,
> and since most tables perform better with clustered key than as a heap,
> it makes sense to default the PK to clustered.
>
> Vendors that fail to finetune their product is a royal pain - but not
> really related to this discussion.
>
>
> (snip)
> >Now, I would not claim that these tests are scientific research, but they
> >did give me better insight in to what I had originaly had considered, but
> >nothing would suggest that I detract from my initial statement, that as a
> >matter of default, which is what most solutions that are present to me
are,
> >the Business Key is still the best candidate for the Clustered Index,
> which,
> >as the DBA, is one of the few, but most effective influence I can apply
to
> >an already designed solution.
>
> The tests are good, but they only test typical one-table searches. Most
> queries in a typical database use joins or subqueries to combine data
> from two or more tables. I don't have the time to run extensive tests
> like you did, but if you have, you might try to add a few tests. Some
> typical queries you might encounter in an ordering system and that might
> be interesting to test are:
>
> SELECT     c.CustName, c.CustAddr, o.OrdDate, o.HandledBy,
>            p.ProdName, p.ListPrice, od.Qty
> FROM       Orders AS o
> INNER JOIN Customers AS c
>       ON   c.CustID = o.CustID
> INNER JOIN OrderDetails AS od
>       ON   od.OrderID = o.OrderID
> INNER JOIN Products AS p
>       ON   p.ProdID = od.ProdID
> WHERE      c.CustName = 'Nanotech Inc.'
>
> or
>
> WHERE      o.OrdDate BETWEEN '20050321' AND '20050323'
>
> or
>
> WHERE      p.ProdName = '17" TFT monitor'
>
> This assumes surrogate keys on c.CustID, o.OrderID, p.ProdID (no
> surrogate key for od), and natural keys on c.CustName, p.ProdName,
> (o.CustID + o.OrdDate), (od.OrderID + od.ProdID). My gut feeling is that
> clustering the surrogate key will be faster than clustering the natural
> key. If you remove c.CustAddr, o.HandledBy and p.ListPrice, clustering
> on the natural key will win (as only the covering index needs to be
> used).
>
> Of course, a complete test would also involve testing with EXISTS / NOT
> EXISTS subqueries, with GROUP BY and aggregates, etc. But I think the
> above woould be a good starting point. :-)
>
> Best, Hugo
> --
>
> (Remove _NO_ and _SPAM_ to get my e-mail address)
>
>
Author
24 Mar 2005 2:16 PM
Adam Machanic
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:%23sxc0REMFHA.3500@TK2MSFTNGP14.phx.gbl...
>
> I've looked back through the thread and although you mention your script
in
> one of the posts you did not include it, or I can't get access to it
through
> my new reader.  Could you post yours again?


I posted it as an attachment; maybe it didn't work for some reason.  You can
download it here instead:

http://www.datamanipulation.net/sequential_tests.zip



--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
24 Mar 2005 3:13 PM
Anthony Thomas
Thanks.  I'll start looking into it.

By the way, the preliminary results on the 10 million rows test are looking
bleek.  Same results so far.  The SHOWCONTIG is reporting the same scanned
extents as switches, but then reports anywhere from 50% to 98% Extent
Fragmentation, both at the report level and the tableresults at each of the
node levels.

I do not think this is a matter of installation, given similar results
across several different hosts, but, perhaps, the method of my tests.

Any ideas would be appreciated.

Sincerely,


Anthony Thomas


--

"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:eEwgMwHMFHA.1172@TK2MSFTNGP12.phx.gbl...
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:%23sxc0REMFHA.3500@TK2MSFTNGP14.phx.gbl...
>
> I've looked back through the thread and although you mention your script
in
> one of the posts you did not include it, or I can't get access to it
through
> my new reader.  Could you post yours again?


I posted it as an attachment; maybe it didn't work for some reason.  You can
download it here instead:

http://www.datamanipulation.net/sequential_tests.zip



--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
24 Mar 2005 3:31 PM
Adam Machanic
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:umohbRIMFHA.3296@TK2MSFTNGP15.phx.gbl...
>
> By the way, the preliminary results on the 10 million rows test are
looking
> bleek.  Same results so far.  The SHOWCONTIG is reporting the same scanned
> extents as switches, but then reports anywhere from 50% to 98% Extent
> Fragmentation, both at the report level and the tableresults at each of
the
> node levels.

    This is the identity-based test I posted that you're referring to?  Or
one of the tests you posted?


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--


Show quoteHide quote
>
> I do not think this is a matter of installation, given similar results
> across several different hosts, but, perhaps, the method of my tests.
>
> Any ideas would be appreciated.
>
> Sincerely,
>
>
> Anthony Thomas
>
>
> --
>
> "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
> news:eEwgMwHMFHA.1172@TK2MSFTNGP12.phx.gbl...
> "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
> news:%23sxc0REMFHA.3500@TK2MSFTNGP14.phx.gbl...
> >
> > I've looked back through the thread and although you mention your script
> in
> > one of the posts you did not include it, or I can't get access to it
> through
> > my new reader.  Could you post yours again?
>
>
> I posted it as an attachment; maybe it didn't work for some reason.  You
can
> download it here instead:
>
> http://www.datamanipulation.net/sequential_tests.zip
>
>
>
> --
> Adam Machanic
> SQL Server MVP
> http://www.datamanipulation.net
> --
>
>
Author
24 Mar 2005 3:51 PM
Anthony Thomas
A replay of one of my tests, but on a diffenent server, entirely, and at the
10 M level.

I just got yours, but as soon as this round of my earlier results are
completed, I'm going to be running your's on the same hardware.

Sincerely,


Anthony Thomas


--

"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:uHqteaIMFHA.1308@tk2msftngp13.phx.gbl...
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:umohbRIMFHA.3296@TK2MSFTNGP15.phx.gbl...
>
> By the way, the preliminary results on the 10 million rows test are
looking
> bleek.  Same results so far.  The SHOWCONTIG is reporting the same scanned
> extents as switches, but then reports anywhere from 50% to 98% Extent
> Fragmentation, both at the report level and the tableresults at each of
the
> node levels.

    This is the identity-based test I posted that you're referring to?  Or
one of the tests you posted?


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--


Show quoteHide quote
>
> I do not think this is a matter of installation, given similar results
> across several different hosts, but, perhaps, the method of my tests.
>
> Any ideas would be appreciated.
>
> Sincerely,
>
>
> Anthony Thomas
>
>
> --
>
> "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
> news:eEwgMwHMFHA.1172@TK2MSFTNGP12.phx.gbl...
> "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
> news:%23sxc0REMFHA.3500@TK2MSFTNGP14.phx.gbl...
> >
> > I've looked back through the thread and although you mention your script
> in
> > one of the posts you did not include it, or I can't get access to it
> through
> > my new reader.  Could you post yours again?
>
>
> I posted it as an attachment; maybe it didn't work for some reason.  You
can
> download it here instead:
>
> http://www.datamanipulation.net/sequential_tests.zip
>
>
>
> --
> Adam Machanic
> SQL Server MVP
> http://www.datamanipulation.net
> --
>
>
Author
24 Mar 2005 4:00 PM
Adam Machanic
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:ON9mnmIMFHA.3548@TK2MSFTNGP10.phx.gbl...
> A replay of one of my tests, but on a diffenent server, entirely, and at
the
> 10 M level.

Please post a script similar to mine for your test -- it should create a new
database, go into the database, and run the full test.  That way we can test
the exact same things on our respective hardware.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
11 Apr 2005 7:24 AM
Anthony Thomas
Whew!  This has been a long time in coming, but I think it will be well
worth it to everyone who has been following the discussion.

My apologies for the delay, but had to do quite a bit of research to get a
handle on what is going on—and I still haven't gotten all my questions
answered—but here it goes regardless, not to mention some real work now and
again at my job.

Not to be an SOB but, Adam, Kimberly, and anyone else that has been touting
this PK IDENTITY CLUSTERED index nonsense are WRONG!

How's that for an opener?  Ok, now seriously, I don't really intend to be
that rude, but I have been inundated with this kind of horse pucky for so
long and having this same argument with vendors routinely, that it is nice
to have gotten some cooperation in getting some real proof to shove back in
everyone's faces.

Since this has been a very long thread, let me try to summarize what the
extent of these tests.

There was a general question regarding the use of various data types for the
PK, specifically GUIDs.  My reply was generalistic, but in essence, a better
recommendation, from my point of view, than what has been pushed around to
date.  I recommended to use the Business Key as the clustered index.  That a
surrogate key was fine for the PK definition, but not the clustered
index...always.  It was the always part that prompted further response.

Two examples were suggested as plausible scenarios for when one might want
to choose an IDENTITY as the clustered index: for INSERTS in order to reduce
the number of page splitting, and for multi-table JOIN conditions.  As the
JOIN scenario was suggested later in the thread and a much more complicated
condition to test, the nature of this investigation was limited to the issue
of page splits.

Furthermore, although there might be some disagreement, I believe there is
general acceptance that the Business Key provides the best clustered index
candidate base solely on its usage in SELECT Queries.  Although I have
evidence from these test to support this claim, I would rather focus on the
discussion of page splits in this response.

Basically, I ran tests and showed that regardless of which attribute was
chosen as the clustered index, they all showed evidence of excessive page
splitting, but that the Business Key candidate provided better querying
performance.  Adam ran tests that showed that a reduction of page splitting
provided better INSERT performance when using the IDENTITY.  In order to
rule out hardware differneces, Adam graciously provided me his scripts to
run.  And the results are...if I run Adam's scripts, I get similar results
to what he claims.

Ok, before you run off and claim that I am full of horse pucky, I need to
provide the results of the last several weeks worth of my subsequent tests
that explains these differences.  As this forum is lousy as a medium for
presenting results, I will digest them here and present validation on
subsequent responses, as warranted.

1. I was surprised at how similar our two test scripts turned out to be.  We
both used very similar mechanisms for generating 10 million row INSERT
simulations; however, we did differ in the base table configuration and
attribute definitions.  It is these differences that explains quite a bit of
the interesting results of my subsequent tests.

2. The claim was that an IDENTITY as clustered index provided better
performance than a Business Key as the clustered index.  The problem was
that Adam's counter-test was not a KEY; it was not an IDENTITY, and it was
not unique, far from it.

I used a VARCHAR(30) attibute with a random leading character, with a random
length, up to 30 characters, while he used another INT, just not an
IDENTITY, but randomized.  The problem is that he only used a random INT
between 0 and 300,000.  But this was a test for 10 million rows inserted.
So, this attribute as clustered can not possibly come close to uniqueness,
300,000/10,000,000.

Now, even Kimberly Tripp's site that Adam used as reference for his claims
recommends that the clustered index be nearly unique, if not unique,
exclusively.  However, by using IDENTITY, this is assured; whereas, the
attribute Adam used in his counter example only had a cardinality of 3% at
best.

My tests claimed that a Business Key, a Key, which is unique by definition,
just as an IDENTITY would be, provides just as good INSERT performance as an
IDENTITY.  In Adam's tests, the IDENTITY performed about 2.5 to 5 times
faster than the randomized, but not selective, INT.  My unique, VARCHAR KEY
performed just as well, within 15 minutes as fast as the use of the
IDENTITY.  I even went as far as testing a DATETIME default attirbute
clustered index, which too performed nearly as well, which was NOT unique,
but still highly selective, about 30% to 50% unique, but also step-wise
increasing.

So, my tests showed high fragmentation, but also showed that the IDENTITY,
the unique Business Key VARCHAR(30), and a DATETIME default all performed
similarly to 10 million inserts when each was defined as the clustered
index.  Only Adam's non-unique, non-selective INT performed badly.  So,
don't use a non-selective candidate as your clustered index if you are
worried about INSERT performance.  However, the original claim was that a
Business Key would prove better than an IDENTITY and my tests prove this.

I went as far as keeping Adam's non-ident test, but made the clustered index
INT attribute random but unique by replacing the following code in his
script:

insert somestuff(custid, somedata)
    values(convert(int, rand() * 300000)
    ,replicate('0', convert(int, rand() * 400)))

With the following:

insert somestuff(custid, somedata)
    values(convert(int, rand() * 213.74836470) * 10000000 + @i
    ,replicate('0', convert(int, rand() * 400))
    )

The new unique, but random, INT performed just as well as the clustered
index as the IDENTITY did.  So, it is the selectivity of the attribute that
controls the performance, not the page splits.

Also, since I was getting about 0.45 millisecond response per INSERT and
since SQL Server's DATETIME resolution is only 3.33 milliseconds, I was
getting about 14% unique values; however, the GETDATE() functions values,
which I used a a default, are still increasing.  So, this attribute too
performed well as the clustered index, as far as INSERTs go.  The query
performance given this clustered index was mixed depending on the query
types, which was the other half of the examination and justification for
recommending choosing the Business Key as the clustered index.

Two more points I'd like to discuss.  First, even though Adam's ident test
produced a DBCC SHOWCONTIG result that reported only a 9% Extent
Fragmentation value, when you ran DBCC SHOWCONTIG WITH TABLERESULTS, you see
that the nodes were 99% Extent fragmented.

The next point concerns the Extent Fragmentation on my tests.  In my table
definition, I had a generic description attribute that was populated with
both random characters and random length.  Adam's also contained a
description attribute, but was populated with fixed '0' but random length.
Moreover, my attribute was only VARCHAR(50) while Adam's was VARCHAR(400).

My clustered IDENTITY test produced 50% Table level Extent Fragmention,
Adam's only 9%, also at the table level; however, after modifying my
description to still be randomized characters, but allowed to reach a
randomized legth to match Adam's 400 characters, my Table level Extent
Fragmentation droped to only 17%.  My guess is that if I also used a fixed
'0' character, my fragmention would also have drop to around the 9% mark.

DBCC SHOWCONTIG scanning 'MyTableCIndexTest' table...
Table: 'MyTableCIndexTest' (469576711); index ID: 1, database ID: 5
TABLE level scan performed.
- Pages Scanned................................: 319688
- Extents Scanned..............................: 40096
- Extent Switches..............................: 40095
- Avg. Pages per Extent........................: 8.0
- Scan Density [Best Count:Actual Count].......: 99.66% [39961:40096]
- Logical Scan Fragmentation ..................: 0.00%
- Extent Scan Fragmentation ...................: 17.13%
- Avg. Bytes Free per Page.....................: 150.2
- Avg. Page Density (full).....................: 98.14%

I hope these results have proven as interesting to you as they did to me.

Therefore,  I still claim that the Business Key is the best candidate for
the clustered index...always.  Except, now I have to construct the
multi-table test so I can prove Hugo wrong as well.

Sincerely,


Anthony Thomas





--

"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:eseabqIMFHA.4092@tk2msftngp13.phx.gbl...
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:ON9mnmIMFHA.3548@TK2MSFTNGP10.phx.gbl...
> A replay of one of my tests, but on a diffenent server, entirely, and at
the
> 10 M level.

Please post a script similar to mine for your test -- it should create a new
database, go into the database, and run the full test.  That way we can test
the exact same things on our respective hardware.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
11 Apr 2005 4:46 PM
Adam Machanic
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:eBKcSfmPFHA.1476@TK2MSFTNGP09.phx.gbl...
>
> I used a VARCHAR(30) attibute with a random leading character, with a
random
> length, up to 30 characters, while he used another INT, just not an
> IDENTITY, but randomized.  The problem is that he only used a random INT
> between 0 and 300,000.  But this was a test for 10 million rows inserted.
> So, this attribute as clustered can not possibly come close to uniqueness,
> 300,000/10,000,000.

    I should have explained better when I sent the script -- it is intended
to simulate a Web site with repeat visitors.  Assuming 300,000 customers, a
lot of them coming back (somewhat randomly, as customers do) to place new
orders.  I'm not sure what kind of unique business key you could use in that
case, at least not one that isn't absolutely gigantic!  The question was,
would clustering on the CustomerId be better for performance -- given that
the majority of queries against the table are based on CustomerId (customers
asking for their own orders / order history), this was quite a valid
question.  The answer was that selecting from the table with this cluster
was definitely faster -- but not enough to balance out the insert cost.  And
I think you'll find in that case that defining a cluster based on a
"business key" will be expensive enough to far outweigh any perceived
benefit.  What columns would make up this business key?  And how large will
it end up being?  Do you really want to add 20+ bytes (just an estimate; I
haven't actually figured out how to generate a good key for that situation)
to every row of your non-clustered indexes?

    Again, I think you need to broaden your outlook a bit.  No solution is
going to be right every time.  Stop trying to find a formula ("IDENTITY is
always good" vs. "IDENTITY is always bad").  Every system is different, and
if you try to find a magic one-size-fits-all solution you'll very quickly
find situations in which it doesn't apply.  That's why the best answer
that's given to most of the questions in these forums is, "it depends" --
which is a superset of the answer that's usually given, "test it on your
end!"  If you can figure out problem classes in which IDENTITY does not
apply, that's great.  Sounds like a good article for SQL Server Magazine or
SQL Server Professional.  But don't assume that your answer will hold true
in all environments or for all business problems.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
11 Apr 2005 6:05 PM
Anthony Thomas
No, no, thank's for the additional info.  However, I realize NO one size
fits all solution exists and each entity within the context of the Business
Model needs to be evaluated, independently and as a member of the overall
design.  But, and this is a big but, Kimberly's site, as well as others,
have attempted to suggest that having a surrogate IDENTITY PK and making it
clustered, although not in all, but MOST situations is superior.  I have
just proven them wrong.

In your specific case, I understand what you are saying, but please hear me.
I said the Business Key, that is, THE KEY that makes the Entity at least 1NF
(first normal form).  If you don't have one, then you are not even close to
an RDBMS solution.

Furthermore, in your specific case, that IDENTITY IS NOT A SURROGATE.  This
situation is not necessarily unique, but given what you are doing, you do
not want to define your table by anything other than a unique instance of
some event, a session in this case.  It would make complete sense in this
situation to have duplicate records based on the CustomerID.  It is the
session information that makes this entity unique, which is typically
described by just such an IDENTITY.  But, in this case the IDENTITY IS THE
BUSINESS KEY.

However, in this discussion and subsequent testing, I have also proven that
a not necessisarily unique but piecewise increasing attribute, such as
SessionLoginDateTime or SessionStartDateTime as a DATETIME data type,
provides JUST AS GOOD INSERT performance as the IDENTITY when it is defined
as the Clustered Index.  So, with these two choices, IDENTITY, which no one
would query, or filter by, often, versus a DATETIME field, which would be
used quite often, by both singleton selects as well as range queries, would
still be a better Candidate Attribute for the Clustered Index than the
IDENTITY.  I have also measured the size impact using this definition versus
an Identity and it doesn't even come close to the problems associated with
another "default," the use of UNICODE.  So, the sizing of the indexes
because of this choice of Clustered Index is trival in comparison.

As a practioner, you must understand that most vended solutions hosted on
SQL Server ASSUME the defaults far too often.  Moreover, sites such as
Kimberly's that actually reccomend using such a default, IDENTITY surrogates
as PK and Clustered Index, also drive such horrid solutions.

My hyperbole was meant to counteract such lame implementations.  I think we
both agree that a thoughtful design provides the best solution; however,
given we rarely get to work with such designs, if you take my advice over
what is being touted as the alternative, both your SELECT and CRUD
operations will tend be more productive versus just the CRUD operations.

One final comment, I do not disagree with the use of Surrogates; however,
the definition itself requires that there actually already preexist an
Alternate KEY; otherwise, what are you surrogating.  This Alternate Key
should be protected in order to safegaurd the data integrity through the use
of a Unique Contraint.  I am suggesting that declaring CONSTRAINT name
UNIQUE CLUSTERED along with INT IDENTITY(whatever seed, whatever increment)
PRIMARY KEY NONCLUSTERED is a better physical design than the PRIMARY KEY
CLUSTERED solution alone and/or alternative.

This situation should not be taken as lightly as you seem to be taking it.
And it shouldn't be brushed away as simple hyperbole.  This is the number
one performance, locking, blocking, and deadlock issue that we have accross
our more than 200 databases, 30 servers, 5 clustered installations, and this
one generic, holistic, one size fits all solution is making a marked
difference in the performance of our systems, unanimously.

I do appreciate your engagement in this excerise and discussion.

Sincerely,

Anthony Thomas


--

"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:OaMg1XrPFHA.2348@tk2msftngp13.phx.gbl...
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:eBKcSfmPFHA.1476@TK2MSFTNGP09.phx.gbl...
>
> I used a VARCHAR(30) attibute with a random leading character, with a
random
> length, up to 30 characters, while he used another INT, just not an
> IDENTITY, but randomized.  The problem is that he only used a random INT
> between 0 and 300,000.  But this was a test for 10 million rows inserted.
> So, this attribute as clustered can not possibly come close to uniqueness,
> 300,000/10,000,000.

    I should have explained better when I sent the script -- it is intended
to simulate a Web site with repeat visitors.  Assuming 300,000 customers, a
lot of them coming back (somewhat randomly, as customers do) to place new
orders.  I'm not sure what kind of unique business key you could use in that
case, at least not one that isn't absolutely gigantic!  The question was,
would clustering on the CustomerId be better for performance -- given that
the majority of queries against the table are based on CustomerId (customers
asking for their own orders / order history), this was quite a valid
question.  The answer was that selecting from the table with this cluster
was definitely faster -- but not enough to balance out the insert cost.  And
I think you'll find in that case that defining a cluster based on a
"business key" will be expensive enough to far outweigh any perceived
benefit.  What columns would make up this business key?  And how large will
it end up being?  Do you really want to add 20+ bytes (just an estimate; I
haven't actually figured out how to generate a good key for that situation)
to every row of your non-clustered indexes?

    Again, I think you need to broaden your outlook a bit.  No solution is
going to be right every time.  Stop trying to find a formula ("IDENTITY is
always good" vs. "IDENTITY is always bad").  Every system is different, and
if you try to find a magic one-size-fits-all solution you'll very quickly
find situations in which it doesn't apply.  That's why the best answer
that's given to most of the questions in these forums is, "it depends" --
which is a superset of the answer that's usually given, "test it on your
end!"  If you can figure out problem classes in which IDENTITY does not
apply, that's great.  Sounds like a good article for SQL Server Magazine or
SQL Server Professional.  But don't assume that your answer will hold true
in all environments or for all business problems.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
11 Apr 2005 6:23 PM
Anthony Thomas
As another thought in regards to my other response I just made, perhaps the
IDENTITY is NOT the Business Key as I had first assumed with regards to your
session table.  Perhaps the composite DATETIME and CustomerID is the
Business Key, and, again, we are still talking about surrogate versus
Business Key as the proper choice for the Clustered Index.  Certainly, this
Alternate Key must at least be a Business Key candidate and would be worth
examining.

I realize that everyone is probably getting bored with such an academic
discussion, but you are helping me drive the questions that are helping me
to justify my position to all of the vendors I argue with on a daily basis
over this topic.

It has occurred to me that, in addition to Hugo's suggestion that a PK
IDENTITY clustered index may be preferrable for complex, multi-table,
queries, I have yet to test the performance of a composite clustered index.

If anyone is interested, I will post the results here as the
others--shortly, I hope.

Sincerely,


Anthony Thomas


--

"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
news:OaMg1XrPFHA.2348@tk2msftngp13.phx.gbl...
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:eBKcSfmPFHA.1476@TK2MSFTNGP09.phx.gbl...
>
> I used a VARCHAR(30) attibute with a random leading character, with a
random
> length, up to 30 characters, while he used another INT, just not an
> IDENTITY, but randomized.  The problem is that he only used a random INT
> between 0 and 300,000.  But this was a test for 10 million rows inserted.
> So, this attribute as clustered can not possibly come close to uniqueness,
> 300,000/10,000,000.

    I should have explained better when I sent the script -- it is intended
to simulate a Web site with repeat visitors.  Assuming 300,000 customers, a
lot of them coming back (somewhat randomly, as customers do) to place new
orders.  I'm not sure what kind of unique business key you could use in that
case, at least not one that isn't absolutely gigantic!  The question was,
would clustering on the CustomerId be better for performance -- given that
the majority of queries against the table are based on CustomerId (customers
asking for their own orders / order history), this was quite a valid
question.  The answer was that selecting from the table with this cluster
was definitely faster -- but not enough to balance out the insert cost.  And
I think you'll find in that case that defining a cluster based on a
"business key" will be expensive enough to far outweigh any perceived
benefit.  What columns would make up this business key?  And how large will
it end up being?  Do you really want to add 20+ bytes (just an estimate; I
haven't actually figured out how to generate a good key for that situation)
to every row of your non-clustered indexes?

    Again, I think you need to broaden your outlook a bit.  No solution is
going to be right every time.  Stop trying to find a formula ("IDENTITY is
always good" vs. "IDENTITY is always bad").  Every system is different, and
if you try to find a magic one-size-fits-all solution you'll very quickly
find situations in which it doesn't apply.  That's why the best answer
that's given to most of the questions in these forums is, "it depends" --
which is a superset of the answer that's usually given, "test it on your
end!"  If you can figure out problem classes in which IDENTITY does not
apply, that's great.  Sounds like a good article for SQL Server Magazine or
SQL Server Professional.  But don't assume that your answer will hold true
in all environments or for all business problems.


--
Adam Machanic
SQL Server MVP
http://www.datamanipulation.net
--
Author
11 Apr 2005 9:33 PM
Daniel Joskovski
Hi Antony,
I regularly read your posts, and find them very educative.
But in this case I think you are wrong. I will be very short in explaining
my point of view.
You talk about surrogate versus business keys like candidates for clustered
index.
If you have good database design (3NF), you will have tables with business
keys only.
Tables with surrogate keys are (almost without exceptions) result of
normalization, so when you have surrogate key that key is result of
normalization to 2NF so you will not have good alternate business key in the
table with surrogate key. So If I have to chose Surrogate versus business
keys is like choosing between normalization and denormalization. Tests will
not prove nothing since sometimes we prefer denormalization for better
performing queries. But that is only my point of view.

Regards,
Daniel


Show quoteHide quote
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:#y4PJPsPFHA.2604@TK2MSFTNGP10.phx.gbl...
> As another thought in regards to my other response I just made, perhaps
the
> IDENTITY is NOT the Business Key as I had first assumed with regards to
your
> session table.  Perhaps the composite DATETIME and CustomerID is the
> Business Key, and, again, we are still talking about surrogate versus
> Business Key as the proper choice for the Clustered Index.  Certainly,
this
> Alternate Key must at least be a Business Key candidate and would be worth
> examining.
>
> I realize that everyone is probably getting bored with such an academic
> discussion, but you are helping me drive the questions that are helping me
> to justify my position to all of the vendors I argue with on a daily basis
> over this topic.
>
> It has occurred to me that, in addition to Hugo's suggestion that a PK
> IDENTITY clustered index may be preferrable for complex, multi-table,
> queries, I have yet to test the performance of a composite clustered
index.
>
> If anyone is interested, I will post the results here as the
> others--shortly, I hope.
>
> Sincerely,
>
>
> Anthony Thomas
>
>
> --
>
> "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
> news:OaMg1XrPFHA.2348@tk2msftngp13.phx.gbl...
> "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
> news:eBKcSfmPFHA.1476@TK2MSFTNGP09.phx.gbl...
> >
> > I used a VARCHAR(30) attibute with a random leading character, with a
> random
> > length, up to 30 characters, while he used another INT, just not an
> > IDENTITY, but randomized.  The problem is that he only used a random INT
> > between 0 and 300,000.  But this was a test for 10 million rows
inserted.
> > So, this attribute as clustered can not possibly come close to
uniqueness,
> > 300,000/10,000,000.
>
>     I should have explained better when I sent the script -- it is
intended
> to simulate a Web site with repeat visitors.  Assuming 300,000 customers,
a
> lot of them coming back (somewhat randomly, as customers do) to place new
> orders.  I'm not sure what kind of unique business key you could use in
that
> case, at least not one that isn't absolutely gigantic!  The question was,
> would clustering on the CustomerId be better for performance -- given that
> the majority of queries against the table are based on CustomerId
(customers
> asking for their own orders / order history), this was quite a valid
> question.  The answer was that selecting from the table with this cluster
> was definitely faster -- but not enough to balance out the insert cost.
And
> I think you'll find in that case that defining a cluster based on a
> "business key" will be expensive enough to far outweigh any perceived
> benefit.  What columns would make up this business key?  And how large
will
> it end up being?  Do you really want to add 20+ bytes (just an estimate; I
> haven't actually figured out how to generate a good key for that
situation)
> to every row of your non-clustered indexes?
>
>     Again, I think you need to broaden your outlook a bit.  No solution is
> going to be right every time.  Stop trying to find a formula ("IDENTITY is
> always good" vs. "IDENTITY is always bad").  Every system is different,
and
> if you try to find a magic one-size-fits-all solution you'll very quickly
> find situations in which it doesn't apply.  That's why the best answer
> that's given to most of the questions in these forums is, "it depends" --
> which is a superset of the answer that's usually given, "test it on your
> end!"  If you can figure out problem classes in which IDENTITY does not
> apply, that's great.  Sounds like a good article for SQL Server Magazine
or
> SQL Server Professional.  But don't assume that your answer will hold true
> in all environments or for all business problems.
>
>
> --
> Adam Machanic
> SQL Server MVP
> http://www.datamanipulation.net
> --
>
>
Author
12 Apr 2005 10:42 AM
Anthony Thomas
Normalization to 2NF only creates new etities with FK relations.  The FK
uses and refers back to the parent surrogate; however, the new entitiy
itself still needs to be in 1NF, which means it must define a KEY.  One
could use yet another surrogate, a new one, to use on this new table, but it
still needs the independent key for which you are surrogating.

If this is not the case, then you probably goofed your normalization.  For 1
to 1 relations, not all, but some, are called identifying relationships.  In
this case, I could see where the surrogate key of the parent becomes the
business key on the child.  But then, as I keep arguing with Adam, that
IDENTITY is the business key and you wouldn't have too many other choices
for the clustered index.

I appreciate your comments, but you should go back to your normalization and
review.  Rule #1: every entity has a KEY and this exists BEFORE the creation
of any surrogates.

Sincerely,


Anthony Thomas

--

"Daniel Joskovski" <omnis@NOSPAMunetREMOVECAPS.com.mk> wrote in message
news:ewxmM2tPFHA.2460@TK2MSFTNGP10.phx.gbl...
Hi Antony,
I regularly read your posts, and find them very educative.
But in this case I think you are wrong. I will be very short in explaining
my point of view.
You talk about surrogate versus business keys like candidates for clustered
index.
If you have good database design (3NF), you will have tables with business
keys only.
Tables with surrogate keys are (almost without exceptions) result of
normalization, so when you have surrogate key that key is result of
normalization to 2NF so you will not have good alternate business key in the
table with surrogate key. So If I have to chose Surrogate versus business
keys is like choosing between normalization and denormalization. Tests will
not prove nothing since sometimes we prefer denormalization for better
performing queries. But that is only my point of view.

Regards,
Daniel


Show quoteHide quote
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:#y4PJPsPFHA.2604@TK2MSFTNGP10.phx.gbl...
> As another thought in regards to my other response I just made, perhaps
the
> IDENTITY is NOT the Business Key as I had first assumed with regards to
your
> session table.  Perhaps the composite DATETIME and CustomerID is the
> Business Key, and, again, we are still talking about surrogate versus
> Business Key as the proper choice for the Clustered Index.  Certainly,
this
> Alternate Key must at least be a Business Key candidate and would be worth
> examining.
>
> I realize that everyone is probably getting bored with such an academic
> discussion, but you are helping me drive the questions that are helping me
> to justify my position to all of the vendors I argue with on a daily basis
> over this topic.
>
> It has occurred to me that, in addition to Hugo's suggestion that a PK
> IDENTITY clustered index may be preferrable for complex, multi-table,
> queries, I have yet to test the performance of a composite clustered
index.
>
> If anyone is interested, I will post the results here as the
> others--shortly, I hope.
>
> Sincerely,
>
>
> Anthony Thomas
>
>
> --
>
> "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
> news:OaMg1XrPFHA.2348@tk2msftngp13.phx.gbl...
> "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
> news:eBKcSfmPFHA.1476@TK2MSFTNGP09.phx.gbl...
> >
> > I used a VARCHAR(30) attibute with a random leading character, with a
> random
> > length, up to 30 characters, while he used another INT, just not an
> > IDENTITY, but randomized.  The problem is that he only used a random INT
> > between 0 and 300,000.  But this was a test for 10 million rows
inserted.
> > So, this attribute as clustered can not possibly come close to
uniqueness,
> > 300,000/10,000,000.
>
>     I should have explained better when I sent the script -- it is
intended
> to simulate a Web site with repeat visitors.  Assuming 300,000 customers,
a
> lot of them coming back (somewhat randomly, as customers do) to place new
> orders.  I'm not sure what kind of unique business key you could use in
that
> case, at least not one that isn't absolutely gigantic!  The question was,
> would clustering on the CustomerId be better for performance -- given that
> the majority of queries against the table are based on CustomerId
(customers
> asking for their own orders / order history), this was quite a valid
> question.  The answer was that selecting from the table with this cluster
> was definitely faster -- but not enough to balance out the insert cost.
And
> I think you'll find in that case that defining a cluster based on a
> "business key" will be expensive enough to far outweigh any perceived
> benefit.  What columns would make up this business key?  And how large
will
> it end up being?  Do you really want to add 20+ bytes (just an estimate; I
> haven't actually figured out how to generate a good key for that
situation)
> to every row of your non-clustered indexes?
>
>     Again, I think you need to broaden your outlook a bit.  No solution is
> going to be right every time.  Stop trying to find a formula ("IDENTITY is
> always good" vs. "IDENTITY is always bad").  Every system is different,
and
> if you try to find a magic one-size-fits-all solution you'll very quickly
> find situations in which it doesn't apply.  That's why the best answer
> that's given to most of the questions in these forums is, "it depends" --
> which is a superset of the answer that's usually given, "test it on your
> end!"  If you can figure out problem classes in which IDENTITY does not
> apply, that's great.  Sounds like a good article for SQL Server Magazine
or
> SQL Server Professional.  But don't assume that your answer will hold true
> in all environments or for all business problems.
>
>
> --
> Adam Machanic
> SQL Server MVP
> http://www.datamanipulation.net
> --
>
>
Author
15 Apr 2005 12:48 PM
Anthony Thomas
Well, I just finished running the Composite Index tests and here are the
results.

I used Adam's original NOIDENT script but added a DATETIME attribute with a
DEFUALT (GETDATE()) constraint.

I then left the NONCLUSTERED INDEX on the IDENTITY and another on the
CustID.  I then created a CLUSTERED INDEX on the composite somedate and
custid attributes.  I also left Adam original logic for loading the 10
million rows where he randomized custid with only 300,000 possible values,
which in this many inserts, only produced 3% unique rows based solely on
custid.  However, as SQL Server's resolution is about 3.33 ms and I achieved
about 2 records inserted per 1 ms, the somedate DATETIME attribute produced
about 14% unique on this attribute, but it is piecewise increasing.

The results were similar to the strict datetime clustered index results I
ran earlier in my own tests at the total run time was nearly the same as the
IDENT test provided by Adam.  The fragmentation was a little higher, 0%
Logical Fragmentation and 10% Table Level Extent Fragmentation versus the 9%
for the IDENTITY Clustered.  However, as with ALL the other tests, Adams or
mine, the NODE level Extent Fragementation ranged from 50% up to 99%, but
this was the same when using the IDENTITY as the clustered index.

So, the point is that even the composite (DATETIME, INTEGER) index performed
as well as the IDENTITY.

Let's be very clear here.  Everyone has agreed so far that having a Business
Key, or candidate, as the clustered index provides better SELECT performance
on single table queries.  What we have been investigating is the INSERT
performance of having the IDENTITY clustered as an alternative with the
thought that reducing page splits would speed up operations on a system that
primarily performs this activity.

The results are, for single table manipulation, INSERT or SELECT, having a
Business Key or cadidate is better, or as good, as using the surrogate
IDENTITY attribute.

So, I go back to my original statement, WHEN IS IT BETTER to choose the
IDENTITY as the clustered index?  I claim that it is NEVER.

Hugo, however, has also suggested that I need to examine more complicated
operations due to potential joining benefits by having the clustered index
on the IDENTITY.  I disagree, but will abstain until I perform those tests
as well.

Any other thoughts?

Sincerely,


Anthony Thomas


--

"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:OXupmy0PFHA.1932@tk2msftngp13.phx.gbl...
Normalization to 2NF only creates new etities with FK relations.  The FK
uses and refers back to the parent surrogate; however, the new entitiy
itself still needs to be in 1NF, which means it must define a KEY.  One
could use yet another surrogate, a new one, to use on this new table, but it
still needs the independent key for which you are surrogating.

If this is not the case, then you probably goofed your normalization.  For 1
to 1 relations, not all, but some, are called identifying relationships.  In
this case, I could see where the surrogate key of the parent becomes the
business key on the child.  But then, as I keep arguing with Adam, that
IDENTITY is the business key and you wouldn't have too many other choices
for the clustered index.

I appreciate your comments, but you should go back to your normalization and
review.  Rule #1: every entity has a KEY and this exists BEFORE the creation
of any surrogates.

Sincerely,


Anthony Thomas

--

"Daniel Joskovski" <omnis@NOSPAMunetREMOVECAPS.com.mk> wrote in message
news:ewxmM2tPFHA.2460@TK2MSFTNGP10.phx.gbl...
Hi Antony,
I regularly read your posts, and find them very educative.
But in this case I think you are wrong. I will be very short in explaining
my point of view.
You talk about surrogate versus business keys like candidates for clustered
index.
If you have good database design (3NF), you will have tables with business
keys only.
Tables with surrogate keys are (almost without exceptions) result of
normalization, so when you have surrogate key that key is result of
normalization to 2NF so you will not have good alternate business key in the
table with surrogate key. So If I have to chose Surrogate versus business
keys is like choosing between normalization and denormalization. Tests will
not prove nothing since sometimes we prefer denormalization for better
performing queries. But that is only my point of view.

Regards,
Daniel


Show quoteHide quote
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
news:#y4PJPsPFHA.2604@TK2MSFTNGP10.phx.gbl...
> As another thought in regards to my other response I just made, perhaps
the
> IDENTITY is NOT the Business Key as I had first assumed with regards to
your
> session table.  Perhaps the composite DATETIME and CustomerID is the
> Business Key, and, again, we are still talking about surrogate versus
> Business Key as the proper choice for the Clustered Index.  Certainly,
this
> Alternate Key must at least be a Business Key candidate and would be worth
> examining.
>
> I realize that everyone is probably getting bored with such an academic
> discussion, but you are helping me drive the questions that are helping me
> to justify my position to all of the vendors I argue with on a daily basis
> over this topic.
>
> It has occurred to me that, in addition to Hugo's suggestion that a PK
> IDENTITY clustered index may be preferrable for complex, multi-table,
> queries, I have yet to test the performance of a composite clustered
index.
>
> If anyone is interested, I will post the results here as the
> others--shortly, I hope.
>
> Sincerely,
>
>
> Anthony Thomas
>
>
> --
>
> "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message
> news:OaMg1XrPFHA.2348@tk2msftngp13.phx.gbl...
> "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message
> news:eBKcSfmPFHA.1476@TK2MSFTNGP09.phx.gbl...
> >
> > I used a VARCHAR(30) attibute with a random leading character, with a
> random
> > length, up to 30 characters, while he used another INT, just not an
> > IDENTITY, but randomized.  The problem is that he only used a random INT
> > between 0 and 300,000.  But this was a test for 10 million rows
inserted.
> > So, this attribute as clustered can not possibly come close to
uniqueness,
> > 300,000/10,000,000.
>
>     I should have explained better when I sent the script -- it is
intended
> to simulate a Web site with repeat visitors.  Assuming 300,000 customers,
a
> lot of them coming back (somewhat randomly, as customers do) to place new
> orders.  I'm not sure what kind of unique business key you could use in
that
> case, at least not one that isn't absolutely gigantic!  The question was,
> would clustering on the CustomerId be better for performance -- given that
> the majority of queries against the table are based on CustomerId
(customers
> asking for their own orders / order history), this was quite a valid
> question.  The answer was that selecting from the table with this cluster
> was definitely faster -- but not enough to balance out the insert cost.
And
> I think you'll find in that case that defining a cluster based on a
> "business key" will be expensive enough to far outweigh any perceived
> benefit.  What columns would make up this business key?  And how large
will
> it end up being?  Do you really want to add 20+ bytes (just an estimate; I
> haven't actually figured out how to generate a good key for that
situation)
> to every row of your non-clustered indexes?
>
>     Again, I think you need to broaden your outlook a bit.  No solution is
> going to be right every time.  Stop trying to find a formula ("IDENTITY is
> always good" vs. "IDENTITY is always bad").  Every system is different,
and
> if you try to find a magic one-size-fits-all solution you'll very quickly
> find situations in which it doesn't apply.  That's why the best answer
> that's given to most of the questions in these forums is, "it depends" --
> which is a superset of the answer that's usually given, "test it on your
> end!"  If you can figure out problem classes in which IDENTITY does not
> apply, that's great.  Sounds like a good article for SQL Server Magazine
or
> SQL Server Professional.  But don't assume that your answer will hold true
> in all environments or for all business problems.
>
>
> --
> Adam Machanic
> SQL Server MVP
> http://www.datamanipulation.net
> --
>
>
Author
23 Mar 2005 9:30 PM
Hugo Kornelis
On Wed, 23 Mar 2005 09:59:05 -0600, Anthony Thomas wrote:

>Thank you for your comments.  Yes, I do agree that the test need to start
>broading there scope into the addition, and more typical kinds of queries.
>
>The intial discussion however began with a question regarding the use of a
>ROWGUID as a clustering index.  I made an generalist statement that
>surrogate keys make lousy cluster index candidates, which the response
>initiated this thread.

Hi Anthony,

I think this thread is mainly a result of you posting a statement that
was too generalist:

>>> Cluster Index definition on surrogate keys are ALWAYS a bad idea, INT or
>>> GUID.  Usually, the Business Key the surrogate is proxiing for is a better
>>> Cluster Index candidate.

In situations like this, there is no "ALWAYS", and hardly ever a
"Usually". I consider this quote (from your original post to this
thread) bad advice - not because it's false, but because it gives the
impression that there might be one single "always correct" solution,
where there really isn't.

>  The proof offered was a matter of lessing splits.

Actually, you didn't ask for a proof, but for one example. :-)

Anyway, since I don't consider myself an expert in the field of mass
inserts and their effect on page allocation, I've decided not to mingle
in this part of your exchange with Adam. I do read it with great
interest, though!


(snip text about fragmentation)
>
>Now, about "defaults," I agree.  Unfortuantely, we have the majority of our
>systems vended, by BIG SOFTWARE companies, that take the advice and the
>default behavior of the SQLEM tool and generate hundreds of tables, all with
>surrogate PK IDENTITY clustered, as a matter of default, precisely because
>the tool does it and people like Kimberly Tripp and Adam Mechanic, and,
>perhaps, even you, espouse their use.

You clearly haven't read many of my newsgroup posts. For if you had,
you'd never have suggested that I, perhaps, might espouse the use of
identity PKs, either clustered or nonclustered. In fact, I've posted
many messages where I fulminate against the mindless slapping of an
identity key on any table.

Just to clarify things, my stance toward surrogate keys is:
a) Always define the natural key during data modeling, before even
starting to create tables.
b) Usually, the natural key is the best candidate for primary key.
c) In some cases, a surrogate key might be worth consideration: if the
natural key is composite (2 columns is not a problem, 4 or more is a
definite candidate for contemplating a surrogate key) or very long
(usually varchar columns with names or descriptions), AND the table is
being referred to in other tables, then you should consider a surrogate
key.
d) If a surrogate key is used, it should always be IN ADDITION TO the
natural key - the natural key columns should still be included in the
table and defined in a UNIQUE constraint.
e) If a surrogate key is used, identity is usually a better choice than
uniqueidentifier.
f) The surrogate key should only be used internally and never exposed to
the end user. They should use the natural key to identify instances of
an entity.

And in questions regarding clustered vs nonclustered index, my usual
reply is that it depends and should be tested in the environment where
it will run (or one as similar as possible). I do sometimes advise a
specific choice if a specific query is included in the question, but
never without the warning that any change to indexexs might impact
performance of other parts of the application and should therefor be
tested thoroughly.


>The point of these was to attempt to give general advice that would prove
>more beneficial than what has been typically given.  And, then, proceeded to
>prove my point.

I think that the only general advice that can safely be given in SQL
Server related issues is "it depends" or "you really should test all
variants on your end and check the results".


(snip text about fragmentation)

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
24 Mar 2005 4:56 AM
Anthony Thomas
As to the first, you are correct on all counts.

As to the generalist advice, yep.  It was hyperbole expressly because of
articles like Kimberly's Q & A site that Adam hooked me up with that
expressly gives as opinion that most tables should be defined with a
surrogate Identity and that this should be the definition of the clustered
key.

I got exactly what I intended: to antagonize someone into a discussion on
the topic just so this conversion could occur-I often find one gets engaged
when irritated.

So, now, we have quite a healthy thread occurring and some research
expressed precisely because I was so beligerent.

As for your design issues?  RIGHT ON!  Couldn't have said it better myself.
Have been preaching this mantra for what seems like forever.  One of these
days, maybe some of our vendors will actually listen.

Thanks for all of your feedback.  Now, beligerence aside, I know that I am
among like-minded folk.

Sincerely,


Anthony Thomas


--

"Hugo Kornelis" <hugo@pe_NO_rFact.in_SPAM_fo> wrote in message
news:ilm341ds2bgl8235vi90mlgifsvvfn5384@4ax.com...
On Wed, 23 Mar 2005 09:59:05 -0600, Anthony Thomas wrote:

>Thank you for your comments.  Yes, I do agree that the test need to start
>broading there scope into the addition, and more typical kinds of queries.
>
>The intial discussion however began with a question regarding the use of a
>ROWGUID as a clustering index.  I made an generalist statement that
>surrogate keys make lousy cluster index candidates, which the response
>initiated this thread.

Hi Anthony,

I think this thread is mainly a result of you posting a statement that
was too generalist:

>>> Cluster Index definition on surrogate keys are ALWAYS a bad idea, INT or
>>> GUID.  Usually, the Business Key the surrogate is proxiing for is a
better
>>> Cluster Index candidate.

In situations like this, there is no "ALWAYS", and hardly ever a
"Usually". I consider this quote (from your original post to this
thread) bad advice - not because it's false, but because it gives the
impression that there might be one single "always correct" solution,
where there really isn't.

>  The proof offered was a matter of lessing splits.

Actually, you didn't ask for a proof, but for one example. :-)

Anyway, since I don't consider myself an expert in the field of mass
inserts and their effect on page allocation, I've decided not to mingle
in this part of your exchange with Adam. I do read it with great
interest, though!


(snip text about fragmentation)
>
>Now, about "defaults," I agree.  Unfortuantely, we have the majority of our
>systems vended, by BIG SOFTWARE companies, that take the advice and the
>default behavior of the SQLEM tool and generate hundreds of tables, all
with
>surrogate PK IDENTITY clustered, as a matter of default, precisely because
>the tool does it and people like Kimberly Tripp and Adam Mechanic, and,
>perhaps, even you, espouse their use.

You clearly haven't read many of my newsgroup posts. For if you had,
you'd never have suggested that I, perhaps, might espouse the use of
identity PKs, either clustered or nonclustered. In fact, I've posted
many messages where I fulminate against the mindless slapping of an
identity key on any table.

Just to clarify things, my stance toward surrogate keys is:
a) Always define the natural key during data modeling, before even
starting to create tables.
b) Usually, the natural key is the best candidate for primary key.
c) In some cases, a surrogate key might be worth consideration: if the
natural key is composite (2 columns is not a problem, 4 or more is a
definite candidate for contemplating a surrogate key) or very long
(usually varchar columns with names or descriptions), AND the table is
being referred to in other tables, then you should consider a surrogate
key.
d) If a surrogate key is used, it should always be IN ADDITION TO the
natural key - the natural key columns should still be included in the
table and defined in a UNIQUE constraint.
e) If a surrogate key is used, identity is usually a better choice than
uniqueidentifier.
f) The surrogate key should only be used internally and never exposed to
the end user. They should use the natural key to identify instances of
an entity.

And in questions regarding clustered vs nonclustered index, my usual
reply is that it depends and should be tested in the environment where
it will run (or one as similar as possible). I do sometimes advise a
specific choice if a specific query is included in the question, but
never without the warning that any change to indexexs might impact
performance of other parts of the application and should therefor be
tested thoroughly.


>The point of these was to attempt to give general advice that would prove
>more beneficial than what has been typically given.  And, then, proceeded
to
>prove my point.

I think that the only general advice that can safely be given in SQL
Server related issues is "it depends" or "you really should test all
variants on your end and check the results".


(snip text about fragmentation)

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)
Author
25 Mar 2005 9:42 PM
Hugo Kornelis
On Wed, 23 Mar 2005 22:56:47 -0600, Anthony Thomas wrote:

>As to the first, you are correct on all counts.
>
>As to the generalist advice, yep.  It was hyperbole expressly because of
>articles like Kimberly's Q & A site that Adam hooked me up with that
>expressly gives as opinion that most tables should be defined with a
>surrogate Identity and that this should be the definition of the clustered
>key.
>
>I got exactly what I intended: to antagonize someone into a discussion on
>the topic just so this conversion could occur-I often find one gets engaged
>when irritated.

Hi Anthony,

And a truly interesting discussion it is! I do hope that you and Adam
will continue it, in only to find out the cause of the differences
between your test results. I'm intrigued!

>As for your design issues?  RIGHT ON!  Couldn't have said it better myself.
>Have been preaching this mantra for what seems like forever.

Great! The more this mantra is preached, the better.

(After reading a days worth newsgroup posts, I often find myself hoping
that the typical design posted in the typical newsgroup question is not
exemplary for the average computer professional in the world - but I
also often find myself fearing that my hope is in vain...)

Best, Hugo
--

(Remove _NO_ and _SPAM_ to get my e-mail address)

Bookmark and Share