|
sql
newsgroups
|
|||||||||||||||||||||||
|
|||||||||||||||||||||||
Quick question about rowguid column
Is there any issue with assigning the primary key to a rowguid column?
Thoughts? WB Yes.....
1. Creates a VERY Wide Index that is inefficient (Compared to INT for example) 2. IF this is also the Clustered Index (I recommend against), Then ALL Indexes must propogate this GUID in their lookup tables 3. Clustering on GUID can cause page splits,etc Company I recently worked at just added an identity column to all tables in DB and clustered on the Ident instead of the GUID. Inserts of records sped up tremendously. I cant quote the exact number of records, bu the "Process" went form 17 minutes to literally seconds.... Food for thought Some companies (My current employer for example) want to use a GUID type of identifier as opposed to integers for whatever reason. So what they do, to ensure that the GUID values are Monotonically increasing in value (to avoid page splits), is the create their own custom "GUID" type data generator in which the algorithm ensure incrementing values. Greg Jackson PDX, Oregon Show quoteHide quote "WB" <none> wrote in message news:OOk1gk2HFHA.2456@TK2MSFTNGP09.phx.gbl... > Is there any issue with assigning the primary key to a rowguid column? > Thoughts? > > WB > > Cluster Index definition on surrogate keys are ALWAYS a bad idea, INT or
GUID. Usually, the Business Key the surrogate is proxiing for is a better Cluster Index candidate. Sincerely, Anthony Thomas -- "pdxJaxon" <GregoryAJack***@Hotmail.com> wrote in message 1. Creates a VERY Wide Index that is inefficient (Compared to INT fornews:uaC7za3HFHA.1948@TK2MSFTNGP14.phx.gbl... Yes..... example) 2. IF this is also the Clustered Index (I recommend against), Then ALL Indexes must propogate this GUID in their lookup tables 3. Clustering on GUID can cause page splits,etc Company I recently worked at just added an identity column to all tables in DB and clustered on the Ident instead of the GUID. Inserts of records sped up tremendously. I cant quote the exact number of records, bu the "Process" went form 17 minutes to literally seconds.... Food for thought Some companies (My current employer for example) want to use a GUID type of identifier as opposed to integers for whatever reason. So what they do, to ensure that the GUID values are Monotonically increasing in value (to avoid page splits), is the create their own custom "GUID" type data generator in which the algorithm ensure incrementing values. Greg Jackson PDX, Oregon Show quoteHide quote "WB" <none> wrote in message news:OOk1gk2HFHA.2456@TK2MSFTNGP09.phx.gbl... > Is there any issue with assigning the primary key to a rowguid column? > Thoughts? > > WB > > "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message There's no such thing as "usually". There are plenty of applicationsnews:uPDlrHwKFHA.3340@TK2MSFTNGP14.phx.gbl... > Cluster Index definition on surrogate keys are ALWAYS a bad idea, INT or > GUID. Usually, the Business Key the surrogate is proxiing for is a better > Cluster Index candidate. for either type of clustering key -- and clustering on a sequential integer can be especially beneficial in many situations. You should be very careful with that assumption. Name one.
Sincerely, Anthony Thomas -- "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message There's no such thing as "usually". There are plenty of applicationsnews:OYusbfwKFHA.3132@TK2MSFTNGP12.phx.gbl... "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message news:uPDlrHwKFHA.3340@TK2MSFTNGP14.phx.gbl... > Cluster Index definition on surrogate keys are ALWAYS a bad idea, INT or > GUID. Usually, the Business Key the surrogate is proxiing for is a better > Cluster Index candidate. for either type of clustering key -- and clustering on a sequential integer can be especially beneficial in many situations. You should be very careful with that assumption. "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message Inserting data.news:%23gGre3wKFHA.1176@TK2MSFTNGP12.phx.gbl... > Name one. Yes, and Inserting Data in Business Key order shouldn't cause an inordinate
amount of page splits if you have your fill factors and database maintenance routines set correctly. And, having the Cluster Index on the Business Key will far outweigh any split issues compared to any scans you may encounter and gather the covering aspects of the clustered index plus it is typlically the Business Key that most range queries are conducted on. Care to try again? Are you really telling me that page splits is the BEST reason you can come up with? Please. Are you really telling me that was the whole design goal with having Clustered Indexes over Heaps to begin with? You've got to be kidding me. Can I get a witness? Sincerely, Anthony Thomas -- "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message Inserting data.news:OuxM5KxKFHA.4052@tk2msftngp13.phx.gbl... "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message news:%23gGre3wKFHA.1176@TK2MSFTNGP12.phx.gbl... > Name one. "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message No. Never worked in a 24-7 (4-nines, and we're trying for 5 this year),news:eCMwAS5KFHA.2860@TK2MSFTNGP10.phx.gbl... > > Can I get a witness? high-volume OLTP environment, have you? Good luck maintaining those fill factors when you have no maintenence window. Would agree that the DBREINDEX can be troublesome in highly available
systems. But the fill factors should be static from a design perspective. Also, the DBREINDEX is semi-online in that it only needs to obtain exclusive table access only on the table it is currently running against. Running the Cluster Index rebuild WITH DROP EXISTING can speed up the process. But then, again, that is a management issue. The 5 9's do NOT refer to TOTAL TIME, but to SLA time. And, EVERY SYSTEM HAS TO HAVE AN SLA, which would include a maintenance window. If you don't have an APPROPRIATE SLA, then your 5 9's don't mean very much. But, yes, we maintain 5, and even 6, 9's of availability, but properly measured against reasonable SLAs that have actually been documented and constructed with the various Business Units that are demanding the system availabilities. I suppose that if your SLAs DEMANDED 24x7x52, with NO MAINTENANCE, as delusional as the Business Units may be, then sacrificing performance over page splits would be one reason to not care about the Cluster Index placement. But, then, why bother and just use a HEAP. You're not going to get much better performance from an ill-chosen Clustered Index. Some, but not much. Sincerely, Anthony Thomas -- "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message No. Never worked in a 24-7 (4-nines, and we're trying for 5 thisnews:%23Ra$B$8KFHA.2252@TK2MSFTNGP15.phx.gbl... "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message news:eCMwAS5KFHA.2860@TK2MSFTNGP10.phx.gbl... > > Can I get a witness? year), high-volume OLTP environment, have you? Good luck maintaining those fill factors when you have no maintenence window. -- Adam Machanic SQL Server MVP http://www.datamanipulation.net -- Since you brought it up...well, and I brought it up, the topic of SLAs and
availability metrics is an interestin topic in its own right. Craig Mullins has some good articles on the topic you should check out. http://www.craigsmullins.com/dbta_006.htm Sincerely, Anthony Thomas -- "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message Would agree that the DBREINDEX can be troublesome in highly availablenews:eXByZ$9KFHA.3832@TK2MSFTNGP12.phx.gbl... systems. But the fill factors should be static from a design perspective. Also, the DBREINDEX is semi-online in that it only needs to obtain exclusive table access only on the table it is currently running against. Running the Cluster Index rebuild WITH DROP EXISTING can speed up the process. But then, again, that is a management issue. The 5 9's do NOT refer to TOTAL TIME, but to SLA time. And, EVERY SYSTEM HAS TO HAVE AN SLA, which would include a maintenance window. If you don't have an APPROPRIATE SLA, then your 5 9's don't mean very much. But, yes, we maintain 5, and even 6, 9's of availability, but properly measured against reasonable SLAs that have actually been documented and constructed with the various Business Units that are demanding the system availabilities. I suppose that if your SLAs DEMANDED 24x7x52, with NO MAINTENANCE, as delusional as the Business Units may be, then sacrificing performance over page splits would be one reason to not care about the Cluster Index placement. But, then, why bother and just use a HEAP. You're not going to get much better performance from an ill-chosen Clustered Index. Some, but not much. Sincerely, Anthony Thomas -- "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message No. Never worked in a 24-7 (4-nines, and we're trying for 5 this year),news:%23Ra$B$8KFHA.2252@TK2MSFTNGP15.phx.gbl... "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message news:eCMwAS5KFHA.2860@TK2MSFTNGP10.phx.gbl... > > Can I get a witness? high-volume OLTP environment, have you? Good luck maintaining those fill factors when you have no maintenence window.
http://www.dbta.com/columnists/craig_mullins/dba_corner_0902.html
http://www.craigsmullins.com/dbta_026.htm --
"Anthony Thomas" <ALTho***@kc.rr.com> wrote in message Since you brought it up...well, and I brought it up, the topic of SLAs andnews:OH8L1F%23KFHA.2748@TK2MSFTNGP09.phx.gbl... availability metrics is an interestin topic in its own right. Craig Mullins has some good articles on the topic you should check out. http://www.craigsmullins.com/dbta_006.htm Sincerely, Anthony Thomas -- "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message Would agree that the DBREINDEX can be troublesome in highly availablenews:eXByZ$9KFHA.3832@TK2MSFTNGP12.phx.gbl... systems. But the fill factors should be static from a design perspective. Also, the DBREINDEX is semi-online in that it only needs to obtain exclusive table access only on the table it is currently running against. Running the Cluster Index rebuild WITH DROP EXISTING can speed up the process. But then, again, that is a management issue. The 5 9's do NOT refer to TOTAL TIME, but to SLA time. And, EVERY SYSTEM HAS TO HAVE AN SLA, which would include a maintenance window. If you don't have an APPROPRIATE SLA, then your 5 9's don't mean very much. But, yes, we maintain 5, and even 6, 9's of availability, but properly measured against reasonable SLAs that have actually been documented and constructed with the various Business Units that are demanding the system availabilities. I suppose that if your SLAs DEMANDED 24x7x52, with NO MAINTENANCE, as delusional as the Business Units may be, then sacrificing performance over page splits would be one reason to not care about the Cluster Index placement. But, then, why bother and just use a HEAP. You're not going to get much better performance from an ill-chosen Clustered Index. Some, but not much. Sincerely, Anthony Thomas -- "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message No. Never worked in a 24-7 (4-nines, and we're trying for 5 this year),news:%23Ra$B$8KFHA.2252@TK2MSFTNGP15.phx.gbl... "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message news:eCMwAS5KFHA.2860@TK2MSFTNGP10.phx.gbl... > > Can I get a witness? high-volume OLTP environment, have you? Good luck maintaining those fill factors when you have no maintenence window. You should probably read Kimberly Tripp's Q&A site:
http://www.sqlskills.com/ConsolidatedQA.asp This will clarify a lot of things for you, I think. Heaps are certainly not a good choice for insert performance, as gaps will be filled if rows are deleted. And I'm confused about your comment regarding sacrificing performance -- clustering on an IDENTITY or other sequential key will do the opposite in many cases. Insert, and in many cases read performance will both benefit. I've run extensive tests to prove this (I'm a load testing fanatic) and you'll find upon reading that web page that Kimberly apparently agrees with me. I'm not sure what basis your arguments have, but you may want to run some tests for yourself. Thanks for the links on SLAs -- I will send them to our business team :-) "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message news:eXByZ$9KFHA.3832@TK2MSFTNGP12.phx.gbl... Would agree that the DBREINDEX can be troublesome in highly available systems. But the fill factors should be static from a design perspective. Also, the DBREINDEX is semi-online in that it only needs to obtain exclusive table access only on the table it is currently running against. Running the Cluster Index rebuild WITH DROP EXISTING can speed up the process.But then, again, that is a management issue. The 5 9's do NOT refer to TOTAL TIME, but to SLA time. And, EVERY SYSTEM HAS TO HAVE AN SLA, which would include a maintenance window. If you don't have an APPROPRIATE SLA, then your 5 9's don't mean very much. But, yes, we maintain 5, and even 6, 9's of availability, but properly measured against reasonable SLAs that have actually been documented and constructed with the various Business Units that are demanding the system availabilities. I suppose that if your SLAs DEMANDED 24x7x52, with NO MAINTENANCE, as delusional as the Business Units may be, then sacrificing performance over page splits would be one reason to not care about the Cluster Index placement. But, then, why bother and just use a HEAP. You're not going to get much better performance from an ill-chosen Clustered Index. Some, but not much. Sincerely, Anthony Thomas -- Okay. I took you up on your suggestions.
Yes, I read Kimberly Tripp's Q&A and you two agree, to an extent. She has a basis for "Defaulting" to clustering the PK IDENTITY regardless if it is a surrogate key or not. She also admits that this will create a "hot spot" in the data file but assumes the end-users will immediately want to retrieve this information after insertion and that the hot spots is mitigated by having the recent insert already in cache. She also conceedes that the Business Key would be an alternative Cluster Index candidate in some situations. I also took your advice and ran my own tests. Here's what I discovered: The page splits happen only as a course of inserts; so, table types that have many more inserts versus other CRUD or Query operations may benefit for what you two are suggesting. I also agree with the fact that HEAPs are detrimental for the myriad of reasons you and Kimberly point out. However, I still disagree with the notion of using IDENTITY Clustered Indexes as a matter of default. Here is why: The majority of tables are not of the type the two of you suggest would be beneficial for using an IDENTITY attribute. The majority of tables are of the reference type. For example, a Customer, Author, Title, Orders, and the various other look up and reference kinds of tables. Once entered, these tables are queried and/or joined for reference information far more often in query type statements than they ever are versus the intitial INSERT. Moreover, my previous comments ring true. That more often than not, a SELECT * or at least many of the columns for these reference tables are included. In which case, the Clustered Index is chosen more often than not because it is a covering index. However, these types of tables typically JOIN and are FILTERED by the very Business Key that defines the Unique Contraint. When the Clustered Index is the surrogate IDENTITY you end up forcing a Clustered Index Scan whereas having the Unique Constraint as the Clustered Index, you end up with a Clustered Index Seek, which is more efficient. So, as a matter of "default," you will cover more tables if you choose the Business Key as the Cluster Index instead of the surrogate IDENTITY and will end up with more efficient queries. So, what to do with the handful of transactional tables, that are low in number of tables, but high in the quantity of data? My tests have shown that the page splits is controlled more by the FILL FACTOR, as I suggested, than by choosing the surrogate key over the Business Key. Now, you are correct in that choosing increasing keys will force the splits at the ends, which Kimberly and I both agree will create a local "hot spot." What we disagree on is whether or not this is desirable. Next, having an appropriate FILL Factor and choosing the Business Key for the Cluster Index, can distribute this "hot spot" activity throughout the database files. I contend that this is more desirable and have found that even with the default FILL Factor causes few splits than using the surrogate key as the Cluster Index. Finally, there are situations, only with the transaction tables, that the IDENTITY is NOT a surrogate, but the Business Key itself. In these situations, only, I think we agree to use this as the Cluster Key, but only because our definitions have coincided, not because of any concession on either of our part. I am sorry, but the tests I have just conducted, although emperical and limited, at least suggest that what I am say bears truth. I would be curious to know the particulars of your tests in order that I attempt to reproduce your results. Thanks for all of your time. Sincerely, Anthony Thomas -- "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message You should probably read Kimberly Tripp's Q&A site:news:uha51IALFHA.568@TK2MSFTNGP09.phx.gbl... http://www.sqlskills.com/ConsolidatedQA.asp This will clarify a lot of things for you, I think. Heaps are certainly not a good choice for insert performance, as gaps will be filled if rows are deleted. And I'm confused about your comment regarding sacrificing performance -- clustering on an IDENTITY or other sequential key will do the opposite in many cases. Insert, and in many cases read performance will both benefit. I've run extensive tests to prove this (I'm a load testing fanatic) and you'll find upon reading that web page that Kimberly apparently agrees with me. I'm not sure what basis your arguments have, but you may want to run some tests for yourself. Thanks for the links on SLAs -- I will send them to our business team :-) "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message Would agree that the DBREINDEX can be troublesome in highly availablenews:eXByZ$9KFHA.3832@TK2MSFTNGP12.phx.gbl... systems. But the fill factors should be static from a design perspective. Also, the DBREINDEX is semi-online in that it only needs to obtain exclusive table access only on the table it is currently running against. Running the Cluster Index rebuild WITH DROP EXISTING can speed up the process. But then, again, that is a management issue. The 5 9's do NOT refer to TOTAL TIME, but to SLA time. And, EVERY SYSTEM HAS TO HAVE AN SLA, which would include a maintenance window. If you don't have an APPROPRIATE SLA, then your 5 9's don't mean very much. But, yes, we maintain 5, and even 6, 9's of availability, but properly measured against reasonable SLAs that have actually been documented and constructed with the various Business Units that are demanding the system availabilities. I suppose that if your SLAs DEMANDED 24x7x52, with NO MAINTENANCE, as delusional as the Business Units may be, then sacrificing performance over page splits would be one reason to not care about the Cluster Index placement. But, then, why bother and just use a HEAP. You're not going to get much better performance from an ill-chosen Clustered Index. Some, but not much. Sincerely, Anthony Thomas -- Hi Anthony,
A few comments: > The page splits happen only as a course of inserts; so, table types that This is the environment in which I work currently, and that was the> have many more inserts versus other CRUD or Query operations may benefit for > what you two are suggesting. assumption I made when testing -- the systems at my place of work are used to process orders from numerous e-commerce sites -- we do all of the order logic externally (in a big queue system, primarily), then insert all of the order data. It's queried very rarely once inserted -- most customers will never return to ask for their order data, and our systems don't need to query it again 90% of the time. That's one reason I mentioned OLTP -- most of my experience before I came here was with data warehousing projects and this would certainly not apply there! I don't know what other OLTP scenarios exist -- I assumed we were fairly consistent with the norm. Is that not true in your experience? > However, I still disagree with the notion of using IDENTITY Clustered Just to clarify, I don't remember anyone saying "default"! I certainly> Indexes as a matter of default. Here is why: wouldn't suggest a one-size fits all. I believe our debate was about whether or not they could be used in any situation -- if I implied that it's good in every situation and that's what caused this debate, I apologize. That was certainly not my thought on the matter. > beneficial for using an IDENTITY attribute. The majority of tables are of Customers/Orders are two of the most inserted-into tables in the systems I> the reference type. For example, a Customer, Author, Title, Orders, and the > various other look up and reference kinds of tables. Once entered, these > tables are queried and/or joined for reference information far more often in > query type statements than they ever are versus the intitial INSERT. work on (the most inserts happen on the Line Items table, since most orders have more than one item). But assuming you had an IDENTITY surrogate key, I see no reason why clustering on it would not be a good idea even in tables that don't change often, like Author. Assume, that the table had all sorts of columns for each author -- primary publisher, editor, city of residence, maybe some biographical info, etc, etc, etc. Most queries into the system, however, would be for books by that author. If the table were clustered by the IDENTITY, a non-clustered index could be created on the Author Name column, and since non-clustered indexes include the clustering key as the leaf value, a bookmark lookup would not be necessary when using this table to query, e.g., the Books table. So that would be one case in which I still think that clustering on the IDENTITY would be valuable. What would you cluster on? > Moreover, my previous comments ring true. That more often than not, a Not in my experience; see the Authors table listed above; and many other> SELECT * or at least many of the columns for these reference tables are > included. In which case, the Clustered Index is chosen more often than not 'reference' tables have only a single column. For example "Countries" will have a column called "Country". In that case, I agree, clustering on the surrogate key (if there is one) is probably not necessary. But in such small tables as those, it really doesn't matter too much what you cluster on -- the entire table is going to be in memory anyway. > So, as a matter of "default," you will cover more tables if you choose the Again, there should never, ever, be a default! We as data modelers should> Business Key as the Cluster Index instead of the surrogate IDENTITY and will > end up with more efficient queries. always strive for the best solution, and one-size-fits-all is never the best solution except in baseball cap manufacturing :) > than by choosing the surrogate key over the Business Key. Now, you are This probably depends a great deal on the disk system that you're using -- > correct in that choosing increasing keys will force the splits at the ends, > which Kimberly and I both agree will create a local "hot spot." What we > disagree on is whether or not this is desirable. Next, having an > appropriate FILL Factor and choosing the Business Key for the Cluster Index, > can distribute this "hot spot" activity throughout the database files. I > contend that this is more desirable and have found that even with the > default FILL Factor causes few splits than using the surrogate key as the > Cluster Index. it's not desirable, if you're doing heavy inserts, to have the write heads jumping all around the disk. It's much better to keep them at the end of the disk, writing sequentially. However, in certain RAID scenarios (and probably SANs, which I don't have experience with), that equation changes. What types of disks did you tesk on? I ran my tests on a small RAID 10 setup -- but results appear to remain consistent on our larger production system. > I am sorry, but the tests I have just conducted, although emperical and What were your tests? Can you post them? I've attached a couple of> limited, at least suggest that what I am say bears truth. I would be > curious to know the particulars of your tests in order that I attempt to > reproduce your results. very simple insert-only tests and their results, which show a huge difference in time -- I'll try to find some of my other tests and post those as well; I also tested various sequential and random reads, etc. I did find that clustering on "business" keys is beneficial for read in a variety of cases, but at least here, that's totally overhsadowed by the need for fast inserts. -- Adam Machanic SQL Server MVP http://www.datamanipulation.net -- [attached file: sequential_tests.zip] On Tue, 22 Mar 2005 07:05:30 -0600, Anthony Thomas wrote:
(snip) (snip) >Moreover, my previous comments ring true. That more often than not, a Hi Anthony,>SELECT * or at least many of the columns for these reference tables are >included. In which case, the Clustered Index is chosen more often than not >because it is a covering index. However, these types of tables typically >JOIN and are FILTERED by the very Business Key that defines the Unique >Contraint. When the Clustered Index is the surrogate IDENTITY you end up >forcing a Clustered Index Scan whereas having the Unique Constraint as the >Clustered Index, you end up with a Clustered Index Seek, which is more >efficient. (snip) Yes, they are often filtered by the business key that defines the unique constraint. No, they are hardly ever joined by that business key. The whole point of introducing an IDENTITY surrogate key is to use that integer value for all references instead of the (often longer, sometimes spanned) business key. So I'd expect joins to be using the identity surrogate key. If most queries include a filter on the business key, AND the optimizer decides to use that filter first in it's execution plan, than a clustered index on the business key would be better than clustering on the surrogate key. But if many queries don't filter the business key (because the filters are on columns in other tables, and this table is only joined in to display some extra columns), OR even if the queries do filter on the business key, but the optimizer decides that the best plan will filter other tables first, then join this table and apply the remainging filter conditions, then the clustered index on the identity column would be best. Best, Hugo -- (Remove _NO_ and _SPAM_ to get my e-mail address) "Hugo Kornelis" <hugo@pe_NO_rFact.in_SPAM_fo> wrote in message Excellent point :)news:887141lbp95i5ibkgarn8t1ga3812gpbs0@4ax.com... > > If most queries include a filter on the business key, AND the optimizer > decides to use that filter first in it's execution plan, than a > clustered index on the business key would be better than clustering on > the surrogate key. But if many queries don't filter the business key > (because the filters are on columns in other tables, and this table is > only joined in to display some extra columns), OR even if the queries do > filter on the business key, but the optimizer decides that the best plan > will filter other tables first, then join this table and apply the > remainging filter conditions, then the clustered index on the identity > column would be best. Ok. A few replies...and I'll try to keep this short to minimize the
responding topics. 1. You are corrent in that I mispoke concerning a Joining on the Business Key. Yes, that is the whole point of the surrogate, but as the predominant restriction condition is still true. 2. A Cluster Index on the surrogate key WILL NOT make the join process faster. Think about it. If you have to process a MERGE or INNER LOOP join, with either a prefetch or not, it will have to do a matched seek. Both a clustered index and a non-clustered index would be sufficient for the singleton select lookups. Your suggestion that IDENTITY CLUSTERED INDEXES are more efficient JOIN candidates is hyped but not supported by evidence. 3. Yes, queries against the reference tables are my chief concern because I see this every day with vended solutions that use the PK IDENTITY Clustered Index approach, and it kills the systems. Chiefly because the bulk of the queries, like Customers, etc., are searched on Name Ranges, which the IDENTITY Clustered Primary Key is the worst candidate. 4. This is a matter of "default." The very Q & A Kimberly wrote says this explicitly, the very link you provided. And, it is what vendor after vendor after vendor is shoving into the market place WITHOUT giving it a second thought much less thought per individual table. Plus, the tool, SQL Server, does this by default if you do not have the foresight to tell it otherwise. 5. The Orders table being a reference table is a hybrid, it is usually the Order Details tables that I was thinking of when I was attempting to distinguish between the two, reference and transaction, respectively. Note, I have ONLY been talking about OLTP databases, just the table styles within an OLTP database. 6. My tests showed that the IDENTITY CLUSTERED INDEX had MORE SPLITS, even on a transaction type table than a Business Key. Now, this boggled my mind; it was almost 3 to 1 greater. But think about it; these are B-Tree indexes. If you use an increasing attribute, especially a monitonically increasing one like an IDENTITY, you've unbalanced the index; you are making it lop-sided. You end up causing more Intermediate Node splits more often than if you had randomly split the leafs throughout the table. Now, there may be a better explaination, but from the first pass tests I conducted, this is what I saw. 7. The initial question was regarding INSERTS, splits, versus the payoff of using the Business Key for queries. So, my tests have been first conducted on individual tables. I am looking into multi-table, with relationships, types of tests to conduct next. For now, however, here is what I did. CREATE TABLE MyTable (MyID INT IDENTITY NOT NULL PRIMARY KEY CLUSTERED ,MyName VARCHAR(30) NOT NULL UNIQUE NONCLUSTERED ,MyDescription VARCHAR(50) NOT NULL ,MyCreateDate DATETIME NOT NULL DEFAULT (GETDATE()) ) CREATE NONCLUSTERED INDEX IX01_MyTable ON MyTable(MyCreateDate) Now, I ran several passes changing the Clustered Index attribute, single column, and then reconducting the test. The test was to monitor the split behavior, both at the leaf as well as the node level and then look at the fragmentation once completed with the inserts of 1,000,000 records. Each insert was a single row at a time to simulate individual transacitons. The MyID was IDENTITY; so, no explicit value inserted. MyName was preceeded with a CHAR(x) value randomly created, then the length was varied between the single character to the maximum value. MyDescription was just garbage, but the length was randomly filled. MyCreateDate was allowed to assume the transaction time default. After the inserts and fragmentation analysis, a set of predefined queries were ran with the execution plan, client statistics, I/O and Time statistics. The queries were: SELECT * FROM MyTable SELECT * FROM MyTable WHERE MyID = 550000 SELECT * FROM MyTable WHERE MyName LIKE 'M%' SELECT * FROM MyTable WHERE MyName = 'M' SELECT * FROM MyTable WHERE MyDate BETWEEN <1/2 way through the run time> and <3/4 way through the run time> SELECT * FROM MyTable WHERE MyDate = <a randomly selected time from the run time> SELECT MyName, MyCreateDate FROM MyTable SELECT MyName, MyCreateDate FROM MyTable WHERE MyID = 550000 SELECT MyName, MyCreateDate FROM MyTable WHERE MyName LIKE 'M%' SELECT MyName, MyCreateDate FROM MyTable WHERE MyName = 'M' SELECT MyName, MyCreateDate FROM MyTable WHERE MyDate BETWEEN <1/2 way through the run time> and <3/4 way through the run time> SELECT MyName, MyCreateDate FROM MyTable WHERE MyDate = <a randomly selected time from the run time> I made two passes through these, once by just letting it run, the other by running the DBCC FREEPROCCACHE and DBCC DROPCLEANBUFFERS in between statements. This results were saved and then the whole thing with the same constraints but just a diffenent choice of clustered index: IDENTITY (MyID), Business Key (MyName), and then an increasing but not necessarily unique attribute (MyCreateDate). I did my analysis on both sides, the fragmentation and efficiency, time and I/O statistics, of the inserts, and the execution of the above queries. At the moment, I am attempting to summarize the results, given they are quite extensive. However, the examination still bears some of what everyone else is saying, but there are hidden dangers, the Extent Fragmentation for one. And, the Business Key still produced the best executions for these queries. Now, I would not claim that these tests are scientific research, but they did give me better insight in to what I had originaly had considered, but nothing would suggest that I detract from my initial statement, that as a matter of default, which is what most solutions that are present to me are, the Business Key is still the best candidate for the Clustered Index, which, as the DBA, is one of the few, but most effective influence I can apply to an already designed solution. Sincerely, Anthony Thomas -- Show quoteHide quote"Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message Excellent point :)news:OZP1pBzLFHA.1956@TK2MSFTNGP15.phx.gbl... "Hugo Kornelis" <hugo@pe_NO_rFact.in_SPAM_fo> wrote in message news:887141lbp95i5ibkgarn8t1ga3812gpbs0@4ax.com... > > If most queries include a filter on the business key, AND the optimizer > decides to use that filter first in it's execution plan, than a > clustered index on the business key would be better than clustering on > the surrogate key. But if many queries don't filter the business key > (because the filters are on columns in other tables, and this table is > only joined in to display some extra columns), OR even if the queries do > filter on the business key, but the optimizer decides that the best plan > will filter other tables first, then join this table and apply the > remainging filter conditions, then the clustered index on the identity > column would be best. On Wed, 23 Mar 2005 02:01:00 -0600, Anthony Thomas wrote:
(snip) >2. A Cluster Index on the surrogate key WILL NOT make the join process Hi Anthony,>faster. Think about it. If you have to process a MERGE or INNER LOOP join, >with either a prefetch or not, it will have to do a matched seek. Both a >clustered index and a non-clustered index would be sufficient for the >singleton select lookups. Your suggestion that IDENTITY CLUSTERED INDEXES >are more efficient JOIN candidates is hyped but not supported by evidence. I've thought about it. If the (expected) number of rows is small, a clustered index would typically require a clustered index seek; a nonclustered index would require an index seek followed by a bookmark lookup. Doubles the number of logical reads and increases the number of physical reaads. If the (expected) number of rows is large, a clustered index would give the optimizer the option to use a merge join; I don't see that as a viable option in the nonclustered case, unless the index is covering (merge join the index, then do 100,000 bookmark lookups? sort the table, then merge join, bypassing the index?) (snip) >4. This is a matter of "default." The very Q & A Kimberly wrote says this I think we'll both agree that defaults are fine if you just need to make>explicitly, the very link you provided. And, it is what vendor after vendor >after vendor is shoving into the market place WITHOUT giving it a second >thought much less thought per individual table. Plus, the tool, SQL Server, >does this by default if you do not have the foresight to tell it otherwise. a simple DB and don't want nor need to spend time tuning. If you're taking development seriously and you're producing a DB that is of any serious significance to the business, you'll take none of the defaults for granted. This applies to clustered indexes just as it doess to configuration options, recovery models, security settings, maintenance plans, etc. For the purpose of a simple DB with no need for advanced tuning, the SQL Server default is (IMO) fine as it is - all tables (should) have a primary key, not all have a unique constraint or other supporting index, and since most tables perform better with clustered key than as a heap, it makes sense to default the PK to clustered. Vendors that fail to finetune their product is a royal pain - but not really related to this discussion. (snip) >Now, I would not claim that these tests are scientific research, but they The tests are good, but they only test typical one-table searches. Most>did give me better insight in to what I had originaly had considered, but >nothing would suggest that I detract from my initial statement, that as a >matter of default, which is what most solutions that are present to me are, >the Business Key is still the best candidate for the Clustered Index, which, >as the DBA, is one of the few, but most effective influence I can apply to >an already designed solution. queries in a typical database use joins or subqueries to combine data from two or more tables. I don't have the time to run extensive tests like you did, but if you have, you might try to add a few tests. Some typical queries you might encounter in an ordering system and that might be interesting to test are: SELECT c.CustName, c.CustAddr, o.OrdDate, o.HandledBy, p.ProdName, p.ListPrice, od.Qty FROM Orders AS o INNER JOIN Customers AS c ON c.CustID = o.CustID INNER JOIN OrderDetails AS od ON od.OrderID = o.OrderID INNER JOIN Products AS p ON p.ProdID = od.ProdID WHERE c.CustName = 'Nanotech Inc.' or WHERE o.OrdDate BETWEEN '20050321' AND '20050323' or WHERE p.ProdName = '17" TFT monitor' This assumes surrogate keys on c.CustID, o.OrderID, p.ProdID (no surrogate key for od), and natural keys on c.CustName, p.ProdName, (o.CustID + o.OrdDate), (od.OrderID + od.ProdID). My gut feeling is that clustering the surrogate key will be faster than clustering the natural key. If you remove c.CustAddr, o.HandledBy and p.ListPrice, clustering on the natural key will win (as only the covering index needs to be used). Of course, a complete test would also involve testing with EXISTS / NOT EXISTS subqueries, with GROUP BY and aggregates, etc. But I think the above woould be a good starting point. :-) Best, Hugo -- (Remove _NO_ and _SPAM_ to get my e-mail address) Thank you for your comments. Yes, I do agree that the test need to start
broading there scope into the addition, and more typical kinds of queries. The intial discussion however began with a question regarding the use of a ROWGUID as a clustering index. I made an generalist statement that surrogate keys make lousy cluster index candidates, which the response initiated this thread. The proof offered was a matter of lessing splits. My test were solely directed at the question of splits versus the querying benefit of clustering the Business Key. But, yes, I am interested in the additional questions now that I have started tearing the issue apart to examine. As to the above question, I have evidence that not only do single table type queries perform better by having the Business Key clustered, the split factor offered as a counter argument has actually proven false by clustering the PK IDENTITY surrogate. There are an enourmous amount of EXTENT FRAGMENTATION. Adam also suggested that the surrogate was a good choice because given his system, he does not have the luxury of running the DBREINDEX maintenance. My results show that he has to that he is not getting what he thought he was. Now, about "defaults," I agree. Unfortuantely, we have the majority of our systems vended, by BIG SOFTWARE companies, that take the advice and the default behavior of the SQLEM tool and generate hundreds of tables, all with surrogate PK IDENTITY clustered, as a matter of default, precisely because the tool does it and people like Kimberly Tripp and Adam Mechanic, and, perhaps, even you, espouse their use. The point of these was to attempt to give general advice that would prove more beneficial than what has been typically given. And, then, proceeded to prove my point. So, I would like someone else to conduct a similar test as I have presented and report on whether or not they same the same FRAGMENTATION characteristics and perfomance issues when ran under these various scenarios. 1 M row inserts are not necessary. I saw the same behaviour at the 10 K rows level, it is just magnified when you can stress the system as I did. Sincerely, Anthony Thomas -- "Hugo Kornelis" <hugo@pe_NO_rFact.in_SPAM_fo> wrote in message On Wed, 23 Mar 2005 02:01:00 -0600, Anthony Thomas wrote:news:3mg241phmksolj3da4it1vf4g380pvu5dg@4ax.com... (snip) >2. A Cluster Index on the surrogate key WILL NOT make the join process Hi Anthony,>faster. Think about it. If you have to process a MERGE or INNER LOOP join, >with either a prefetch or not, it will have to do a matched seek. Both a >clustered index and a non-clustered index would be sufficient for the >singleton select lookups. Your suggestion that IDENTITY CLUSTERED INDEXES >are more efficient JOIN candidates is hyped but not supported by evidence. I've thought about it. If the (expected) number of rows is small, a clustered index would typically require a clustered index seek; a nonclustered index would require an index seek followed by a bookmark lookup. Doubles the number of logical reads and increases the number of physical reaads. If the (expected) number of rows is large, a clustered index would give the optimizer the option to use a merge join; I don't see that as a viable option in the nonclustered case, unless the index is covering (merge join the index, then do 100,000 bookmark lookups? sort the table, then merge join, bypassing the index?) (snip) >4. This is a matter of "default." The very Q & A Kimberly wrote says this I think we'll both agree that defaults are fine if you just need to make>explicitly, the very link you provided. And, it is what vendor after vendor >after vendor is shoving into the market place WITHOUT giving it a second >thought much less thought per individual table. Plus, the tool, SQL Server, >does this by default if you do not have the foresight to tell it otherwise. a simple DB and don't want nor need to spend time tuning. If you're taking development seriously and you're producing a DB that is of any serious significance to the business, you'll take none of the defaults for granted. This applies to clustered indexes just as it doess to configuration options, recovery models, security settings, maintenance plans, etc. For the purpose of a simple DB with no need for advanced tuning, the SQL Server default is (IMO) fine as it is - all tables (should) have a primary key, not all have a unique constraint or other supporting index, and since most tables perform better with clustered key than as a heap, it makes sense to default the PK to clustered. Vendors that fail to finetune their product is a royal pain - but not really related to this discussion. (snip) >Now, I would not claim that these tests are scientific research, but they The tests are good, but they only test typical one-table searches. Most>did give me better insight in to what I had originaly had considered, but >nothing would suggest that I detract from my initial statement, that as a >matter of default, which is what most solutions that are present to me are, >the Business Key is still the best candidate for the Clustered Index, which, >as the DBA, is one of the few, but most effective influence I can apply to >an already designed solution. queries in a typical database use joins or subqueries to combine data from two or more tables. I don't have the time to run extensive tests like you did, but if you have, you might try to add a few tests. Some typical queries you might encounter in an ordering system and that might be interesting to test are: SELECT c.CustName, c.CustAddr, o.OrdDate, o.HandledBy, p.ProdName, p.ListPrice, od.Qty FROM Orders AS o INNER JOIN Customers AS c ON c.CustID = o.CustID INNER JOIN OrderDetails AS od ON od.OrderID = o.OrderID INNER JOIN Products AS p ON p.ProdID = od.ProdID WHERE c.CustName = 'Nanotech Inc.' or WHERE o.OrdDate BETWEEN '20050321' AND '20050323' or WHERE p.ProdName = '17" TFT monitor' This assumes surrogate keys on c.CustID, o.OrderID, p.ProdID (no surrogate key for od), and natural keys on c.CustName, p.ProdName, (o.CustID + o.OrdDate), (od.OrderID + od.ProdID). My gut feeling is that clustering the surrogate key will be faster than clustering the natural key. If you remove c.CustAddr, o.HandledBy and p.ListPrice, clustering on the natural key will win (as only the covering index needs to be used). Of course, a complete test would also involve testing with EXISTS / NOT EXISTS subqueries, with GROUP BY and aggregates, etc. But I think the above woould be a good starting point. :-) Best, Hugo -- (Remove _NO_ and _SPAM_ to get my e-mail address) "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message What's the basis for these conclusions? I see no correlation betweennews:O%23Rn%23F8LFHA.2420@TK2MSFTNGP12.phx.gbl... > > factor offered as a counter argument has actually proven false by clustering > the PK IDENTITY surrogate. There are an enourmous amount of EXTENT > FRAGMENTATION. Adam also suggested that the surrogate was a good choice what you're asserting and what I've seen in the real world. For example, I just re-ran the "ident.sql" script I posted a day or two ago in this thread. Here is the result of DBCC SHOWCONTIG after inserting 10,000,000 rows clustered on an IDENTITY column: DBCC SHOWCONTIG scanning 'somestuff' table... Table: 'somestuff' (2073058421); index ID: 1, database ID: 14 TABLE level scan performed. - Pages Scanned................................: 279686 - Extents Scanned..............................: 35080 - Extent Switches..............................: 35079 - Avg. Pages per Extent........................: 8.0 - Scan Density [Best Count:Actual Count].......: 99.66% [34961:35080] - Logical Scan Fragmentation ..................: 0.00% - Extent Scan Fragmentation ...................: 9.39% - Avg. Bytes Free per Page.....................: 140.8 - Avg. Page Density (full).....................: 98.26% Is 9% really "enormous"? .... I'll leave it as an exercise for you to run the "noident" script, since it takes many hours and I don't have the patience right now to wait that long. But I know you won't be pleased with the fragmentation reported at the end of that run. Show quoteHide quote > because given his system, he does not have the luxury of running the > DBREINDEX maintenance. My results show that he has to that he is not > getting what he thought he was. > > Now, about "defaults," I agree. Unfortuantely, we have the majority of our > systems vended, by BIG SOFTWARE companies, that take the advice and the > default behavior of the SQLEM tool and generate hundreds of tables, all with > surrogate PK IDENTITY clustered, as a matter of default, precisely because > the tool does it and people like Kimberly Tripp and Adam Mechanic, and, > perhaps, even you, espouse their use. > > The point of these was to attempt to give general advice that would prove > more beneficial than what has been typically given. And, then, proceeded to > prove my point. > > So, I would like someone else to conduct a similar test as I have presented > and report on whether or not they same the same FRAGMENTATION > characteristics and perfomance issues when ran under these various > scenarios. 1 M row inserts are not necessary. I saw the same behaviour at > the 10 K rows level, it is just magnified when you can stress the system as > I did. > > Sincerely, > > > Anthony Thomas > > > -- > > "Hugo Kornelis" <hugo@pe_NO_rFact.in_SPAM_fo> wrote in message > news:3mg241phmksolj3da4it1vf4g380pvu5dg@4ax.com... > On Wed, 23 Mar 2005 02:01:00 -0600, Anthony Thomas wrote: > > (snip) > >2. A Cluster Index on the surrogate key WILL NOT make the join process > >faster. Think about it. If you have to process a MERGE or INNER LOOP > join, > >with either a prefetch or not, it will have to do a matched seek. Both a > >clustered index and a non-clustered index would be sufficient for the > >singleton select lookups. Your suggestion that IDENTITY CLUSTERED INDEXES > >are more efficient JOIN candidates is hyped but not supported by evidence. > > Hi Anthony, > > I've thought about it. > > If the (expected) number of rows is small, a clustered index would > typically require a clustered index seek; a nonclustered index would > require an index seek followed by a bookmark lookup. Doubles the number > of logical reads and increases the number of physical reaads. > > If the (expected) number of rows is large, a clustered index would give > the optimizer the option to use a merge join; I don't see that as a > viable option in the nonclustered case, unless the index is covering > (merge join the index, then do 100,000 bookmark lookups? sort the table, > then merge join, bypassing the index?) > > > (snip) > >4. This is a matter of "default." The very Q & A Kimberly wrote says this > >explicitly, the very link you provided. And, it is what vendor after > vendor > >after vendor is shoving into the market place WITHOUT giving it a second > >thought much less thought per individual table. Plus, the tool, SQL > Server, > >does this by default if you do not have the foresight to tell it otherwise. > > I think we'll both agree that defaults are fine if you just need to make > a simple DB and don't want nor need to spend time tuning. If you're > taking development seriously and you're producing a DB that is of any > serious significance to the business, you'll take none of the defaults > for granted. This applies to clustered indexes just as it doess to > configuration options, recovery models, security settings, maintenance > plans, etc. > > For the purpose of a simple DB with no need for advanced tuning, the SQL > Server default is (IMO) fine as it is - all tables (should) have a > primary key, not all have a unique constraint or other supporting index, > and since most tables perform better with clustered key than as a heap, > it makes sense to default the PK to clustered. > > Vendors that fail to finetune their product is a royal pain - but not > really related to this discussion. > > > (snip) > >Now, I would not claim that these tests are scientific research, but they > >did give me better insight in to what I had originaly had considered, but > >nothing would suggest that I detract from my initial statement, that as a > >matter of default, which is what most solutions that are present to me are, > >the Business Key is still the best candidate for the Clustered Index, > which, > >as the DBA, is one of the few, but most effective influence I can apply to > >an already designed solution. > > The tests are good, but they only test typical one-table searches. Most > queries in a typical database use joins or subqueries to combine data > from two or more tables. I don't have the time to run extensive tests > like you did, but if you have, you might try to add a few tests. Some > typical queries you might encounter in an ordering system and that might > be interesting to test are: > > SELECT c.CustName, c.CustAddr, o.OrdDate, o.HandledBy, > p.ProdName, p.ListPrice, od.Qty > FROM Orders AS o > INNER JOIN Customers AS c > ON c.CustID = o.CustID > INNER JOIN OrderDetails AS od > ON od.OrderID = o.OrderID > INNER JOIN Products AS p > ON p.ProdID = od.ProdID > WHERE c.CustName = 'Nanotech Inc.' > > or > > WHERE o.OrdDate BETWEEN '20050321' AND '20050323' > > or > > WHERE p.ProdName = '17" TFT monitor' > > This assumes surrogate keys on c.CustID, o.OrderID, p.ProdID (no > surrogate key for od), and natural keys on c.CustName, p.ProdName, > (o.CustID + o.OrdDate), (od.OrderID + od.ProdID). My gut feeling is that > clustering the surrogate key will be faster than clustering the natural > key. If you remove c.CustAddr, o.HandledBy and p.ListPrice, clustering > on the natural key will win (as only the covering index needs to be > used). > > Of course, a complete test would also involve testing with EXISTS / NOT > EXISTS subqueries, with GROUP BY and aggregates, etc. But I think the > above woould be a good starting point. :-) > > Best, Hugo > -- > > (Remove _NO_ and _SPAM_ to get my e-mail address) > > My apologies, but I'll have to go back through the thread and grab a copy of
your script. I'm curious now as to whether or not I have a configuration issue on my test installation. No, 9% is not enormous, but what I saw on the one I posted and ran with, at the 1,000, 10,000, 100,000, and 1,000,000 row runs was nearer to 90%. Do you think it flips over around 10,000,000? I don't see how that's possible if at all the other scales were showing similar results and none as what you posted here. Thanks everyone for keeping up with this as I do want to get to a conclusive discription as what we have been observing across our installations. I've looked back through the thread and although you mention your script in one of the posts you did not include it, or I can't get access to it through my new reader. Could you post yours again? Because my results don't even come close to what yours did as far as Extent Fragmentation, I reran a quick 1 million row insert on a completely different installation, different OS, different SQL Server build, different SAN, I still get the same results, which are different than yours. I'd really love to conduct the same test as yours. Here's my quick results on the new box: Clustered PK IDENTITY 1,000,000 Rows Test cur_date cntr_value ------------------------------------------------------ ----------- 2005-03-24 00:35:07.577 7247 cur_date cntr_value ------------------------------------------------------ ----------- 2005-03-24 00:59:08.170 23390 DBCC SHOWCONTIG scanning 'MyTableCIndexTest' table... Table: 'MyTableCIndexTest' (213575799); index ID: 1, database ID: 6 TABLE level scan performed. - Pages Scanned................................: 10056 - Extents Scanned..............................: 1262 - Extent Switches..............................: 1261 - Avg. Pages per Extent........................: 8.0 - Scan Density [Best Count:Actual Count].......: 99.60% [1257:1262] - Logical Scan Fragmentation ..................: 0.01% - Extent Scan Fragmentation ...................: 52.38% - Avg. Bytes Free per Page.....................: 41.2 - Avg. Page Density (full).....................: 99.49% DBCC SHOWCONTIG scanning 'MyTableCIndexTest' table... Table: 'MyTableCIndexTest' (213575799); index ID: 2, database ID: 6 LEAF level scan performed. - Pages Scanned................................: 4164 - Extents Scanned..............................: 525 - Extent Switches..............................: 4162 - Avg. Pages per Extent........................: 7.9 - Scan Density [Best Count:Actual Count].......: 12.52% [521:4163] - Logical Scan Fragmentation ..................: 9.70% - Extent Scan Fragmentation ...................: 98.86% - Avg. Bytes Free per Page.....................: 412.3 - Avg. Page Density (full).....................: 94.91% DBCC SHOWCONTIG scanning 'MyTableCIndexTest' table... Table: 'MyTableCIndexTest' (213575799); index ID: 3, database ID: 6 LEAF level scan performed. - Pages Scanned................................: 1856 - Extents Scanned..............................: 234 - Extent Switches..............................: 233 - Avg. Pages per Extent........................: 7.9 - Scan Density [Best Count:Actual Count].......: 99.15% [232:234] - Logical Scan Fragmentation ..................: 0.11% - Extent Scan Fragmentation ...................: 99.57% - Avg. Bytes Free per Page.....................: 14.1 - Avg. Page Density (full).....................: 99.83% DBCC execution completed. If DBCC printed error messages, contact your system administrator. I also ran the DBCC SHOWCONTIG as a resultset with ALL_LEVELS and I'm get fragmentation even at the node levels. I am running a 10 million test just to see if this has anything to do with it--although I doubt it--but it calls into question how two totally different installations could produce similar result and yet be completely different than yours. Sincerely, Anthony Thomas -- "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message What's the basis for these conclusions? I see no correlation betweennews:OcxzXQ9LFHA.3420@tk2msftngp13.phx.gbl... "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message news:O%23Rn%23F8LFHA.2420@TK2MSFTNGP12.phx.gbl... > > factor offered as a counter argument has actually proven false by clustering > the PK IDENTITY surrogate. There are an enourmous amount of EXTENT > FRAGMENTATION. Adam also suggested that the surrogate was a good choice what you're asserting and what I've seen in the real world. For example, I just re-ran the "ident.sql" script I posted a day or two ago in this thread. Here is the result of DBCC SHOWCONTIG after inserting 10,000,000 rows clustered on an IDENTITY column: DBCC SHOWCONTIG scanning 'somestuff' table... Table: 'somestuff' (2073058421); index ID: 1, database ID: 14 TABLE level scan performed. - Pages Scanned................................: 279686 - Extents Scanned..............................: 35080 - Extent Switches..............................: 35079 - Avg. Pages per Extent........................: 8.0 - Scan Density [Best Count:Actual Count].......: 99.66% [34961:35080] - Logical Scan Fragmentation ..................: 0.00% - Extent Scan Fragmentation ...................: 9.39% - Avg. Bytes Free per Page.....................: 140.8 - Avg. Page Density (full).....................: 98.26% Is 9% really "enormous"? .... I'll leave it as an exercise for you to run the "noident" script, since it takes many hours and I don't have the patience right now to wait that long. But I know you won't be pleased with the fragmentation reported at the end of that run. Show quoteHide quote > because given his system, he does not have the luxury of running the > DBREINDEX maintenance. My results show that he has to that he is not > getting what he thought he was. > > Now, about "defaults," I agree. Unfortuantely, we have the majority of our > systems vended, by BIG SOFTWARE companies, that take the advice and the > default behavior of the SQLEM tool and generate hundreds of tables, all with > surrogate PK IDENTITY clustered, as a matter of default, precisely because > the tool does it and people like Kimberly Tripp and Adam Mechanic, and, > perhaps, even you, espouse their use. > > The point of these was to attempt to give general advice that would prove > more beneficial than what has been typically given. And, then, proceeded to > prove my point. > > So, I would like someone else to conduct a similar test as I have presented > and report on whether or not they same the same FRAGMENTATION > characteristics and perfomance issues when ran under these various > scenarios. 1 M row inserts are not necessary. I saw the same behaviour at > the 10 K rows level, it is just magnified when you can stress the system as > I did. > > Sincerely, > > > Anthony Thomas > > > -- > > "Hugo Kornelis" <hugo@pe_NO_rFact.in_SPAM_fo> wrote in message > news:3mg241phmksolj3da4it1vf4g380pvu5dg@4ax.com... > On Wed, 23 Mar 2005 02:01:00 -0600, Anthony Thomas wrote: > > (snip) > >2. A Cluster Index on the surrogate key WILL NOT make the join process > >faster. Think about it. If you have to process a MERGE or INNER LOOP > join, > >with either a prefetch or not, it will have to do a matched seek. Both a > >clustered index and a non-clustered index would be sufficient for the > >singleton select lookups. Your suggestion that IDENTITY CLUSTERED INDEXES > >are more efficient JOIN candidates is hyped but not supported by evidence. > > Hi Anthony, > > I've thought about it. > > If the (expected) number of rows is small, a clustered index would > typically require a clustered index seek; a nonclustered index would > require an index seek followed by a bookmark lookup. Doubles the number > of logical reads and increases the number of physical reaads. > > If the (expected) number of rows is large, a clustered index would give > the optimizer the option to use a merge join; I don't see that as a > viable option in the nonclustered case, unless the index is covering > (merge join the index, then do 100,000 bookmark lookups? sort the table, > then merge join, bypassing the index?) > > > (snip) > >4. This is a matter of "default." The very Q & A Kimberly wrote says this > >explicitly, the very link you provided. And, it is what vendor after > vendor > >after vendor is shoving into the market place WITHOUT giving it a second > >thought much less thought per individual table. Plus, the tool, SQL > Server, > >does this by default if you do not have the foresight to tell it otherwise. > > I think we'll both agree that defaults are fine if you just need to make > a simple DB and don't want nor need to spend time tuning. If you're > taking development seriously and you're producing a DB that is of any > serious significance to the business, you'll take none of the defaults > for granted. This applies to clustered indexes just as it doess to > configuration options, recovery models, security settings, maintenance > plans, etc. > > For the purpose of a simple DB with no need for advanced tuning, the SQL > Server default is (IMO) fine as it is - all tables (should) have a > primary key, not all have a unique constraint or other supporting index, > and since most tables perform better with clustered key than as a heap, > it makes sense to default the PK to clustered. > > Vendors that fail to finetune their product is a royal pain - but not > really related to this discussion. > > > (snip) > >Now, I would not claim that these tests are scientific research, but they > >did give me better insight in to what I had originaly had considered, but > >nothing would suggest that I detract from my initial statement, that as a > >matter of default, which is what most solutions that are present to me are, > >the Business Key is still the best candidate for the Clustered Index, > which, > >as the DBA, is one of the few, but most effective influence I can apply to > >an already designed solution. > > The tests are good, but they only test typical one-table searches. Most > queries in a typical database use joins or subqueries to combine data > from two or more tables. I don't have the time to run extensive tests > like you did, but if you have, you might try to add a few tests. Some > typical queries you might encounter in an ordering system and that might > be interesting to test are: > > SELECT c.CustName, c.CustAddr, o.OrdDate, o.HandledBy, > p.ProdName, p.ListPrice, od.Qty > FROM Orders AS o > INNER JOIN Customers AS c > ON c.CustID = o.CustID > INNER JOIN OrderDetails AS od > ON od.OrderID = o.OrderID > INNER JOIN Products AS p > ON p.ProdID = od.ProdID > WHERE c.CustName = 'Nanotech Inc.' > > or > > WHERE o.OrdDate BETWEEN '20050321' AND '20050323' > > or > > WHERE p.ProdName = '17" TFT monitor' > > This assumes surrogate keys on c.CustID, o.OrderID, p.ProdID (no > surrogate key for od), and natural keys on c.CustName, p.ProdName, > (o.CustID + o.OrdDate), (od.OrderID + od.ProdID). My gut feeling is that > clustering the surrogate key will be faster than clustering the natural > key. If you remove c.CustAddr, o.HandledBy and p.ListPrice, clustering > on the natural key will win (as only the covering index needs to be > used). > > Of course, a complete test would also involve testing with EXISTS / NOT > EXISTS subqueries, with GROUP BY and aggregates, etc. But I think the > above woould be a good starting point. :-) > > Best, Hugo > -- > > (Remove _NO_ and _SPAM_ to get my e-mail address) > > "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message I posted it as an attachment; maybe it didn't work for some reason. You cannews:%23sxc0REMFHA.3500@TK2MSFTNGP14.phx.gbl... > > I've looked back through the thread and although you mention your script in > one of the posts you did not include it, or I can't get access to it through > my new reader. Could you post yours again? download it here instead: http://www.datamanipulation.net/sequential_tests.zip Thanks. I'll start looking into it.
By the way, the preliminary results on the 10 million rows test are looking bleek. Same results so far. The SHOWCONTIG is reporting the same scanned extents as switches, but then reports anywhere from 50% to 98% Extent Fragmentation, both at the report level and the tableresults at each of the node levels. I do not think this is a matter of installation, given similar results across several different hosts, but, perhaps, the method of my tests. Any ideas would be appreciated. Sincerely, Anthony Thomas -- "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message I posted it as an attachment; maybe it didn't work for some reason. You cannews:eEwgMwHMFHA.1172@TK2MSFTNGP12.phx.gbl... "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message news:%23sxc0REMFHA.3500@TK2MSFTNGP14.phx.gbl... > > I've looked back through the thread and although you mention your script in > one of the posts you did not include it, or I can't get access to it through > my new reader. Could you post yours again? download it here instead: http://www.datamanipulation.net/sequential_tests.zip "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message This is the identity-based test I posted that you're referring to? Ornews:umohbRIMFHA.3296@TK2MSFTNGP15.phx.gbl... > > By the way, the preliminary results on the 10 million rows test are looking > bleek. Same results so far. The SHOWCONTIG is reporting the same scanned > extents as switches, but then reports anywhere from 50% to 98% Extent > Fragmentation, both at the report level and the tableresults at each of the > node levels. one of the tests you posted? Show quoteHide quote > > I do not think this is a matter of installation, given similar results > across several different hosts, but, perhaps, the method of my tests. > > Any ideas would be appreciated. > > Sincerely, > > > Anthony Thomas > > > -- > > "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message > news:eEwgMwHMFHA.1172@TK2MSFTNGP12.phx.gbl... > "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message > news:%23sxc0REMFHA.3500@TK2MSFTNGP14.phx.gbl... > > > > I've looked back through the thread and although you mention your script > in > > one of the posts you did not include it, or I can't get access to it > through > > my new reader. Could you post yours again? > > > I posted it as an attachment; maybe it didn't work for some reason. You can > download it here instead: > > http://www.datamanipulation.net/sequential_tests.zip > > > > -- > Adam Machanic > SQL Server MVP > http://www.datamanipulation.net > -- > > A replay of one of my tests, but on a diffenent server, entirely, and at the
10 M level. I just got yours, but as soon as this round of my earlier results are completed, I'm going to be running your's on the same hardware. Sincerely, Anthony Thomas -- "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message This is the identity-based test I posted that you're referring to? Ornews:uHqteaIMFHA.1308@tk2msftngp13.phx.gbl... "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message news:umohbRIMFHA.3296@TK2MSFTNGP15.phx.gbl... > > By the way, the preliminary results on the 10 million rows test are looking > bleek. Same results so far. The SHOWCONTIG is reporting the same scanned > extents as switches, but then reports anywhere from 50% to 98% Extent > Fragmentation, both at the report level and the tableresults at each of the > node levels. one of the tests you posted? Show quoteHide quote > > I do not think this is a matter of installation, given similar results > across several different hosts, but, perhaps, the method of my tests. > > Any ideas would be appreciated. > > Sincerely, > > > Anthony Thomas > > > -- > > "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message > news:eEwgMwHMFHA.1172@TK2MSFTNGP12.phx.gbl... > "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message > news:%23sxc0REMFHA.3500@TK2MSFTNGP14.phx.gbl... > > > > I've looked back through the thread and although you mention your script > in > > one of the posts you did not include it, or I can't get access to it > through > > my new reader. Could you post yours again? > > > I posted it as an attachment; maybe it didn't work for some reason. You can > download it here instead: > > http://www.datamanipulation.net/sequential_tests.zip > > > > -- > Adam Machanic > SQL Server MVP > http://www.datamanipulation.net > -- > > "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message Please post a script similar to mine for your test -- it should create a newnews:ON9mnmIMFHA.3548@TK2MSFTNGP10.phx.gbl... > A replay of one of my tests, but on a diffenent server, entirely, and at the > 10 M level. database, go into the database, and run the full test. That way we can test the exact same things on our respective hardware. Whew! This has been a long time in coming, but I think it will be well
worth it to everyone who has been following the discussion. My apologies for the delay, but had to do quite a bit of research to get a handle on what is going on—and I still haven't gotten all my questions answered—but here it goes regardless, not to mention some real work now and again at my job. Not to be an SOB but, Adam, Kimberly, and anyone else that has been touting this PK IDENTITY CLUSTERED index nonsense are WRONG! How's that for an opener? Ok, now seriously, I don't really intend to be that rude, but I have been inundated with this kind of horse pucky for so long and having this same argument with vendors routinely, that it is nice to have gotten some cooperation in getting some real proof to shove back in everyone's faces. Since this has been a very long thread, let me try to summarize what the extent of these tests. There was a general question regarding the use of various data types for the PK, specifically GUIDs. My reply was generalistic, but in essence, a better recommendation, from my point of view, than what has been pushed around to date. I recommended to use the Business Key as the clustered index. That a surrogate key was fine for the PK definition, but not the clustered index...always. It was the always part that prompted further response. Two examples were suggested as plausible scenarios for when one might want to choose an IDENTITY as the clustered index: for INSERTS in order to reduce the number of page splitting, and for multi-table JOIN conditions. As the JOIN scenario was suggested later in the thread and a much more complicated condition to test, the nature of this investigation was limited to the issue of page splits. Furthermore, although there might be some disagreement, I believe there is general acceptance that the Business Key provides the best clustered index candidate base solely on its usage in SELECT Queries. Although I have evidence from these test to support this claim, I would rather focus on the discussion of page splits in this response. Basically, I ran tests and showed that regardless of which attribute was chosen as the clustered index, they all showed evidence of excessive page splitting, but that the Business Key candidate provided better querying performance. Adam ran tests that showed that a reduction of page splitting provided better INSERT performance when using the IDENTITY. In order to rule out hardware differneces, Adam graciously provided me his scripts to run. And the results are...if I run Adam's scripts, I get similar results to what he claims. Ok, before you run off and claim that I am full of horse pucky, I need to provide the results of the last several weeks worth of my subsequent tests that explains these differences. As this forum is lousy as a medium for presenting results, I will digest them here and present validation on subsequent responses, as warranted. 1. I was surprised at how similar our two test scripts turned out to be. We both used very similar mechanisms for generating 10 million row INSERT simulations; however, we did differ in the base table configuration and attribute definitions. It is these differences that explains quite a bit of the interesting results of my subsequent tests. 2. The claim was that an IDENTITY as clustered index provided better performance than a Business Key as the clustered index. The problem was that Adam's counter-test was not a KEY; it was not an IDENTITY, and it was not unique, far from it. I used a VARCHAR(30) attibute with a random leading character, with a random length, up to 30 characters, while he used another INT, just not an IDENTITY, but randomized. The problem is that he only used a random INT between 0 and 300,000. But this was a test for 10 million rows inserted. So, this attribute as clustered can not possibly come close to uniqueness, 300,000/10,000,000. Now, even Kimberly Tripp's site that Adam used as reference for his claims recommends that the clustered index be nearly unique, if not unique, exclusively. However, by using IDENTITY, this is assured; whereas, the attribute Adam used in his counter example only had a cardinality of 3% at best. My tests claimed that a Business Key, a Key, which is unique by definition, just as an IDENTITY would be, provides just as good INSERT performance as an IDENTITY. In Adam's tests, the IDENTITY performed about 2.5 to 5 times faster than the randomized, but not selective, INT. My unique, VARCHAR KEY performed just as well, within 15 minutes as fast as the use of the IDENTITY. I even went as far as testing a DATETIME default attirbute clustered index, which too performed nearly as well, which was NOT unique, but still highly selective, about 30% to 50% unique, but also step-wise increasing. So, my tests showed high fragmentation, but also showed that the IDENTITY, the unique Business Key VARCHAR(30), and a DATETIME default all performed similarly to 10 million inserts when each was defined as the clustered index. Only Adam's non-unique, non-selective INT performed badly. So, don't use a non-selective candidate as your clustered index if you are worried about INSERT performance. However, the original claim was that a Business Key would prove better than an IDENTITY and my tests prove this. I went as far as keeping Adam's non-ident test, but made the clustered index INT attribute random but unique by replacing the following code in his script: insert somestuff(custid, somedata) values(convert(int, rand() * 300000) ,replicate('0', convert(int, rand() * 400))) With the following: insert somestuff(custid, somedata) values(convert(int, rand() * 213.74836470) * 10000000 + @i ,replicate('0', convert(int, rand() * 400)) ) The new unique, but random, INT performed just as well as the clustered index as the IDENTITY did. So, it is the selectivity of the attribute that controls the performance, not the page splits. Also, since I was getting about 0.45 millisecond response per INSERT and since SQL Server's DATETIME resolution is only 3.33 milliseconds, I was getting about 14% unique values; however, the GETDATE() functions values, which I used a a default, are still increasing. So, this attribute too performed well as the clustered index, as far as INSERTs go. The query performance given this clustered index was mixed depending on the query types, which was the other half of the examination and justification for recommending choosing the Business Key as the clustered index. Two more points I'd like to discuss. First, even though Adam's ident test produced a DBCC SHOWCONTIG result that reported only a 9% Extent Fragmentation value, when you ran DBCC SHOWCONTIG WITH TABLERESULTS, you see that the nodes were 99% Extent fragmented. The next point concerns the Extent Fragmentation on my tests. In my table definition, I had a generic description attribute that was populated with both random characters and random length. Adam's also contained a description attribute, but was populated with fixed '0' but random length. Moreover, my attribute was only VARCHAR(50) while Adam's was VARCHAR(400). My clustered IDENTITY test produced 50% Table level Extent Fragmention, Adam's only 9%, also at the table level; however, after modifying my description to still be randomized characters, but allowed to reach a randomized legth to match Adam's 400 characters, my Table level Extent Fragmentation droped to only 17%. My guess is that if I also used a fixed '0' character, my fragmention would also have drop to around the 9% mark. DBCC SHOWCONTIG scanning 'MyTableCIndexTest' table... Table: 'MyTableCIndexTest' (469576711); index ID: 1, database ID: 5 TABLE level scan performed. - Pages Scanned................................: 319688 - Extents Scanned..............................: 40096 - Extent Switches..............................: 40095 - Avg. Pages per Extent........................: 8.0 - Scan Density [Best Count:Actual Count].......: 99.66% [39961:40096] - Logical Scan Fragmentation ..................: 0.00% - Extent Scan Fragmentation ...................: 17.13% - Avg. Bytes Free per Page.....................: 150.2 - Avg. Page Density (full).....................: 98.14% I hope these results have proven as interesting to you as they did to me. Therefore, I still claim that the Business Key is the best candidate for the clustered index...always. Except, now I have to construct the multi-table test so I can prove Hugo wrong as well. Sincerely, Anthony Thomas -- "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message Please post a script similar to mine for your test -- it should create a newnews:eseabqIMFHA.4092@tk2msftngp13.phx.gbl... "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message news:ON9mnmIMFHA.3548@TK2MSFTNGP10.phx.gbl... > A replay of one of my tests, but on a diffenent server, entirely, and at the > 10 M level. database, go into the database, and run the full test. That way we can test the exact same things on our respective hardware. "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message I should have explained better when I sent the script -- it is intendednews:eBKcSfmPFHA.1476@TK2MSFTNGP09.phx.gbl... > > I used a VARCHAR(30) attibute with a random leading character, with a random > length, up to 30 characters, while he used another INT, just not an > IDENTITY, but randomized. The problem is that he only used a random INT > between 0 and 300,000. But this was a test for 10 million rows inserted. > So, this attribute as clustered can not possibly come close to uniqueness, > 300,000/10,000,000. to simulate a Web site with repeat visitors. Assuming 300,000 customers, a lot of them coming back (somewhat randomly, as customers do) to place new orders. I'm not sure what kind of unique business key you could use in that case, at least not one that isn't absolutely gigantic! The question was, would clustering on the CustomerId be better for performance -- given that the majority of queries against the table are based on CustomerId (customers asking for their own orders / order history), this was quite a valid question. The answer was that selecting from the table with this cluster was definitely faster -- but not enough to balance out the insert cost. And I think you'll find in that case that defining a cluster based on a "business key" will be expensive enough to far outweigh any perceived benefit. What columns would make up this business key? And how large will it end up being? Do you really want to add 20+ bytes (just an estimate; I haven't actually figured out how to generate a good key for that situation) to every row of your non-clustered indexes? Again, I think you need to broaden your outlook a bit. No solution is going to be right every time. Stop trying to find a formula ("IDENTITY is always good" vs. "IDENTITY is always bad"). Every system is different, and if you try to find a magic one-size-fits-all solution you'll very quickly find situations in which it doesn't apply. That's why the best answer that's given to most of the questions in these forums is, "it depends" -- which is a superset of the answer that's usually given, "test it on your end!" If you can figure out problem classes in which IDENTITY does not apply, that's great. Sounds like a good article for SQL Server Magazine or SQL Server Professional. But don't assume that your answer will hold true in all environments or for all business problems. No, no, thank's for the additional info. However, I realize NO one size
fits all solution exists and each entity within the context of the Business Model needs to be evaluated, independently and as a member of the overall design. But, and this is a big but, Kimberly's site, as well as others, have attempted to suggest that having a surrogate IDENTITY PK and making it clustered, although not in all, but MOST situations is superior. I have just proven them wrong. In your specific case, I understand what you are saying, but please hear me. I said the Business Key, that is, THE KEY that makes the Entity at least 1NF (first normal form). If you don't have one, then you are not even close to an RDBMS solution. Furthermore, in your specific case, that IDENTITY IS NOT A SURROGATE. This situation is not necessarily unique, but given what you are doing, you do not want to define your table by anything other than a unique instance of some event, a session in this case. It would make complete sense in this situation to have duplicate records based on the CustomerID. It is the session information that makes this entity unique, which is typically described by just such an IDENTITY. But, in this case the IDENTITY IS THE BUSINESS KEY. However, in this discussion and subsequent testing, I have also proven that a not necessisarily unique but piecewise increasing attribute, such as SessionLoginDateTime or SessionStartDateTime as a DATETIME data type, provides JUST AS GOOD INSERT performance as the IDENTITY when it is defined as the Clustered Index. So, with these two choices, IDENTITY, which no one would query, or filter by, often, versus a DATETIME field, which would be used quite often, by both singleton selects as well as range queries, would still be a better Candidate Attribute for the Clustered Index than the IDENTITY. I have also measured the size impact using this definition versus an Identity and it doesn't even come close to the problems associated with another "default," the use of UNICODE. So, the sizing of the indexes because of this choice of Clustered Index is trival in comparison. As a practioner, you must understand that most vended solutions hosted on SQL Server ASSUME the defaults far too often. Moreover, sites such as Kimberly's that actually reccomend using such a default, IDENTITY surrogates as PK and Clustered Index, also drive such horrid solutions. My hyperbole was meant to counteract such lame implementations. I think we both agree that a thoughtful design provides the best solution; however, given we rarely get to work with such designs, if you take my advice over what is being touted as the alternative, both your SELECT and CRUD operations will tend be more productive versus just the CRUD operations. One final comment, I do not disagree with the use of Surrogates; however, the definition itself requires that there actually already preexist an Alternate KEY; otherwise, what are you surrogating. This Alternate Key should be protected in order to safegaurd the data integrity through the use of a Unique Contraint. I am suggesting that declaring CONSTRAINT name UNIQUE CLUSTERED along with INT IDENTITY(whatever seed, whatever increment) PRIMARY KEY NONCLUSTERED is a better physical design than the PRIMARY KEY CLUSTERED solution alone and/or alternative. This situation should not be taken as lightly as you seem to be taking it. And it shouldn't be brushed away as simple hyperbole. This is the number one performance, locking, blocking, and deadlock issue that we have accross our more than 200 databases, 30 servers, 5 clustered installations, and this one generic, holistic, one size fits all solution is making a marked difference in the performance of our systems, unanimously. I do appreciate your engagement in this excerise and discussion. Sincerely, Anthony Thomas -- "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message I should have explained better when I sent the script -- it is intendednews:OaMg1XrPFHA.2348@tk2msftngp13.phx.gbl... "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message news:eBKcSfmPFHA.1476@TK2MSFTNGP09.phx.gbl... > > I used a VARCHAR(30) attibute with a random leading character, with a random > length, up to 30 characters, while he used another INT, just not an > IDENTITY, but randomized. The problem is that he only used a random INT > between 0 and 300,000. But this was a test for 10 million rows inserted. > So, this attribute as clustered can not possibly come close to uniqueness, > 300,000/10,000,000. to simulate a Web site with repeat visitors. Assuming 300,000 customers, a lot of them coming back (somewhat randomly, as customers do) to place new orders. I'm not sure what kind of unique business key you could use in that case, at least not one that isn't absolutely gigantic! The question was, would clustering on the CustomerId be better for performance -- given that the majority of queries against the table are based on CustomerId (customers asking for their own orders / order history), this was quite a valid question. The answer was that selecting from the table with this cluster was definitely faster -- but not enough to balance out the insert cost. And I think you'll find in that case that defining a cluster based on a "business key" will be expensive enough to far outweigh any perceived benefit. What columns would make up this business key? And how large will it end up being? Do you really want to add 20+ bytes (just an estimate; I haven't actually figured out how to generate a good key for that situation) to every row of your non-clustered indexes? Again, I think you need to broaden your outlook a bit. No solution is going to be right every time. Stop trying to find a formula ("IDENTITY is always good" vs. "IDENTITY is always bad"). Every system is different, and if you try to find a magic one-size-fits-all solution you'll very quickly find situations in which it doesn't apply. That's why the best answer that's given to most of the questions in these forums is, "it depends" -- which is a superset of the answer that's usually given, "test it on your end!" If you can figure out problem classes in which IDENTITY does not apply, that's great. Sounds like a good article for SQL Server Magazine or SQL Server Professional. But don't assume that your answer will hold true in all environments or for all business problems. As another thought in regards to my other response I just made, perhaps the
IDENTITY is NOT the Business Key as I had first assumed with regards to your session table. Perhaps the composite DATETIME and CustomerID is the Business Key, and, again, we are still talking about surrogate versus Business Key as the proper choice for the Clustered Index. Certainly, this Alternate Key must at least be a Business Key candidate and would be worth examining. I realize that everyone is probably getting bored with such an academic discussion, but you are helping me drive the questions that are helping me to justify my position to all of the vendors I argue with on a daily basis over this topic. It has occurred to me that, in addition to Hugo's suggestion that a PK IDENTITY clustered index may be preferrable for complex, multi-table, queries, I have yet to test the performance of a composite clustered index. If anyone is interested, I will post the results here as the others--shortly, I hope. Sincerely, Anthony Thomas -- "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message I should have explained better when I sent the script -- it is intendednews:OaMg1XrPFHA.2348@tk2msftngp13.phx.gbl... "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message news:eBKcSfmPFHA.1476@TK2MSFTNGP09.phx.gbl... > > I used a VARCHAR(30) attibute with a random leading character, with a random > length, up to 30 characters, while he used another INT, just not an > IDENTITY, but randomized. The problem is that he only used a random INT > between 0 and 300,000. But this was a test for 10 million rows inserted. > So, this attribute as clustered can not possibly come close to uniqueness, > 300,000/10,000,000. to simulate a Web site with repeat visitors. Assuming 300,000 customers, a lot of them coming back (somewhat randomly, as customers do) to place new orders. I'm not sure what kind of unique business key you could use in that case, at least not one that isn't absolutely gigantic! The question was, would clustering on the CustomerId be better for performance -- given that the majority of queries against the table are based on CustomerId (customers asking for their own orders / order history), this was quite a valid question. The answer was that selecting from the table with this cluster was definitely faster -- but not enough to balance out the insert cost. And I think you'll find in that case that defining a cluster based on a "business key" will be expensive enough to far outweigh any perceived benefit. What columns would make up this business key? And how large will it end up being? Do you really want to add 20+ bytes (just an estimate; I haven't actually figured out how to generate a good key for that situation) to every row of your non-clustered indexes? Again, I think you need to broaden your outlook a bit. No solution is going to be right every time. Stop trying to find a formula ("IDENTITY is always good" vs. "IDENTITY is always bad"). Every system is different, and if you try to find a magic one-size-fits-all solution you'll very quickly find situations in which it doesn't apply. That's why the best answer that's given to most of the questions in these forums is, "it depends" -- which is a superset of the answer that's usually given, "test it on your end!" If you can figure out problem classes in which IDENTITY does not apply, that's great. Sounds like a good article for SQL Server Magazine or SQL Server Professional. But don't assume that your answer will hold true in all environments or for all business problems. Hi Antony,
I regularly read your posts, and find them very educative. But in this case I think you are wrong. I will be very short in explaining my point of view. You talk about surrogate versus business keys like candidates for clustered index. If you have good database design (3NF), you will have tables with business keys only. Tables with surrogate keys are (almost without exceptions) result of normalization, so when you have surrogate key that key is result of normalization to 2NF so you will not have good alternate business key in the table with surrogate key. So If I have to chose Surrogate versus business keys is like choosing between normalization and denormalization. Tests will not prove nothing since sometimes we prefer denormalization for better performing queries. But that is only my point of view. Regards, Daniel Show quoteHide quote "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message news:#y4PJPsPFHA.2604@TK2MSFTNGP10.phx.gbl... > As another thought in regards to my other response I just made, perhaps the > IDENTITY is NOT the Business Key as I had first assumed with regards to your > session table. Perhaps the composite DATETIME and CustomerID is the > Business Key, and, again, we are still talking about surrogate versus > Business Key as the proper choice for the Clustered Index. Certainly, this > Alternate Key must at least be a Business Key candidate and would be worth > examining. > > I realize that everyone is probably getting bored with such an academic > discussion, but you are helping me drive the questions that are helping me > to justify my position to all of the vendors I argue with on a daily basis > over this topic. > > It has occurred to me that, in addition to Hugo's suggestion that a PK > IDENTITY clustered index may be preferrable for complex, multi-table, > queries, I have yet to test the performance of a composite clustered index. > > If anyone is interested, I will post the results here as the > others--shortly, I hope. > > Sincerely, > > > Anthony Thomas > > > -- > > "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message > news:OaMg1XrPFHA.2348@tk2msftngp13.phx.gbl... > "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message > news:eBKcSfmPFHA.1476@TK2MSFTNGP09.phx.gbl... > > > > I used a VARCHAR(30) attibute with a random leading character, with a > random > > length, up to 30 characters, while he used another INT, just not an > > IDENTITY, but randomized. The problem is that he only used a random INT > > between 0 and 300,000. But this was a test for 10 million rows inserted. > > So, this attribute as clustered can not possibly come close to uniqueness, > > 300,000/10,000,000. > > I should have explained better when I sent the script -- it is intended > to simulate a Web site with repeat visitors. Assuming 300,000 customers, a > lot of them coming back (somewhat randomly, as customers do) to place new > orders. I'm not sure what kind of unique business key you could use in that > case, at least not one that isn't absolutely gigantic! The question was, > would clustering on the CustomerId be better for performance -- given that > the majority of queries against the table are based on CustomerId (customers > asking for their own orders / order history), this was quite a valid > question. The answer was that selecting from the table with this cluster > was definitely faster -- but not enough to balance out the insert cost. And > I think you'll find in that case that defining a cluster based on a > "business key" will be expensive enough to far outweigh any perceived > benefit. What columns would make up this business key? And how large will > it end up being? Do you really want to add 20+ bytes (just an estimate; I > haven't actually figured out how to generate a good key for that situation) > to every row of your non-clustered indexes? > > Again, I think you need to broaden your outlook a bit. No solution is > going to be right every time. Stop trying to find a formula ("IDENTITY is > always good" vs. "IDENTITY is always bad"). Every system is different, and > if you try to find a magic one-size-fits-all solution you'll very quickly > find situations in which it doesn't apply. That's why the best answer > that's given to most of the questions in these forums is, "it depends" -- > which is a superset of the answer that's usually given, "test it on your > end!" If you can figure out problem classes in which IDENTITY does not > apply, that's great. Sounds like a good article for SQL Server Magazine or > SQL Server Professional. But don't assume that your answer will hold true > in all environments or for all business problems. > > > -- > Adam Machanic > SQL Server MVP > http://www.datamanipulation.net > -- > > Normalization to 2NF only creates new etities with FK relations. The FK
uses and refers back to the parent surrogate; however, the new entitiy itself still needs to be in 1NF, which means it must define a KEY. One could use yet another surrogate, a new one, to use on this new table, but it still needs the independent key for which you are surrogating. If this is not the case, then you probably goofed your normalization. For 1 to 1 relations, not all, but some, are called identifying relationships. In this case, I could see where the surrogate key of the parent becomes the business key on the child. But then, as I keep arguing with Adam, that IDENTITY is the business key and you wouldn't have too many other choices for the clustered index. I appreciate your comments, but you should go back to your normalization and review. Rule #1: every entity has a KEY and this exists BEFORE the creation of any surrogates. Sincerely, Anthony Thomas -- "Daniel Joskovski" <omnis@NOSPAMunetREMOVECAPS.com.mk> wrote in message I regularly read your posts, and find them very educative.news:ewxmM2tPFHA.2460@TK2MSFTNGP10.phx.gbl... Hi Antony, But in this case I think you are wrong. I will be very short in explaining my point of view. You talk about surrogate versus business keys like candidates for clustered index. If you have good database design (3NF), you will have tables with business keys only. Tables with surrogate keys are (almost without exceptions) result of normalization, so when you have surrogate key that key is result of normalization to 2NF so you will not have good alternate business key in the table with surrogate key. So If I have to chose Surrogate versus business keys is like choosing between normalization and denormalization. Tests will not prove nothing since sometimes we prefer denormalization for better performing queries. But that is only my point of view. Regards, Daniel Show quoteHide quote "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message news:#y4PJPsPFHA.2604@TK2MSFTNGP10.phx.gbl... > As another thought in regards to my other response I just made, perhaps the > IDENTITY is NOT the Business Key as I had first assumed with regards to your > session table. Perhaps the composite DATETIME and CustomerID is the > Business Key, and, again, we are still talking about surrogate versus > Business Key as the proper choice for the Clustered Index. Certainly, this > Alternate Key must at least be a Business Key candidate and would be worth > examining. > > I realize that everyone is probably getting bored with such an academic > discussion, but you are helping me drive the questions that are helping me > to justify my position to all of the vendors I argue with on a daily basis > over this topic. > > It has occurred to me that, in addition to Hugo's suggestion that a PK > IDENTITY clustered index may be preferrable for complex, multi-table, > queries, I have yet to test the performance of a composite clustered index. > > If anyone is interested, I will post the results here as the > others--shortly, I hope. > > Sincerely, > > > Anthony Thomas > > > -- > > "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message > news:OaMg1XrPFHA.2348@tk2msftngp13.phx.gbl... > "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message > news:eBKcSfmPFHA.1476@TK2MSFTNGP09.phx.gbl... > > > > I used a VARCHAR(30) attibute with a random leading character, with a > random > > length, up to 30 characters, while he used another INT, just not an > > IDENTITY, but randomized. The problem is that he only used a random INT > > between 0 and 300,000. But this was a test for 10 million rows inserted. > > So, this attribute as clustered can not possibly come close to uniqueness, > > 300,000/10,000,000. > > I should have explained better when I sent the script -- it is intended > to simulate a Web site with repeat visitors. Assuming 300,000 customers, a > lot of them coming back (somewhat randomly, as customers do) to place new > orders. I'm not sure what kind of unique business key you could use in that > case, at least not one that isn't absolutely gigantic! The question was, > would clustering on the CustomerId be better for performance -- given that > the majority of queries against the table are based on CustomerId (customers > asking for their own orders / order history), this was quite a valid > question. The answer was that selecting from the table with this cluster > was definitely faster -- but not enough to balance out the insert cost. And > I think you'll find in that case that defining a cluster based on a > "business key" will be expensive enough to far outweigh any perceived > benefit. What columns would make up this business key? And how large will > it end up being? Do you really want to add 20+ bytes (just an estimate; I > haven't actually figured out how to generate a good key for that situation) > to every row of your non-clustered indexes? > > Again, I think you need to broaden your outlook a bit. No solution is > going to be right every time. Stop trying to find a formula ("IDENTITY is > always good" vs. "IDENTITY is always bad"). Every system is different, and > if you try to find a magic one-size-fits-all solution you'll very quickly > find situations in which it doesn't apply. That's why the best answer > that's given to most of the questions in these forums is, "it depends" -- > which is a superset of the answer that's usually given, "test it on your > end!" If you can figure out problem classes in which IDENTITY does not > apply, that's great. Sounds like a good article for SQL Server Magazine or > SQL Server Professional. But don't assume that your answer will hold true > in all environments or for all business problems. > > > -- > Adam Machanic > SQL Server MVP > http://www.datamanipulation.net > -- > > Well, I just finished running the Composite Index tests and here are the
results. I used Adam's original NOIDENT script but added a DATETIME attribute with a DEFUALT (GETDATE()) constraint. I then left the NONCLUSTERED INDEX on the IDENTITY and another on the CustID. I then created a CLUSTERED INDEX on the composite somedate and custid attributes. I also left Adam original logic for loading the 10 million rows where he randomized custid with only 300,000 possible values, which in this many inserts, only produced 3% unique rows based solely on custid. However, as SQL Server's resolution is about 3.33 ms and I achieved about 2 records inserted per 1 ms, the somedate DATETIME attribute produced about 14% unique on this attribute, but it is piecewise increasing. The results were similar to the strict datetime clustered index results I ran earlier in my own tests at the total run time was nearly the same as the IDENT test provided by Adam. The fragmentation was a little higher, 0% Logical Fragmentation and 10% Table Level Extent Fragmentation versus the 9% for the IDENTITY Clustered. However, as with ALL the other tests, Adams or mine, the NODE level Extent Fragementation ranged from 50% up to 99%, but this was the same when using the IDENTITY as the clustered index. So, the point is that even the composite (DATETIME, INTEGER) index performed as well as the IDENTITY. Let's be very clear here. Everyone has agreed so far that having a Business Key, or candidate, as the clustered index provides better SELECT performance on single table queries. What we have been investigating is the INSERT performance of having the IDENTITY clustered as an alternative with the thought that reducing page splits would speed up operations on a system that primarily performs this activity. The results are, for single table manipulation, INSERT or SELECT, having a Business Key or cadidate is better, or as good, as using the surrogate IDENTITY attribute. So, I go back to my original statement, WHEN IS IT BETTER to choose the IDENTITY as the clustered index? I claim that it is NEVER. Hugo, however, has also suggested that I need to examine more complicated operations due to potential joining benefits by having the clustered index on the IDENTITY. I disagree, but will abstain until I perform those tests as well. Any other thoughts? Sincerely, Anthony Thomas -- "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message Normalization to 2NF only creates new etities with FK relations. The FKnews:OXupmy0PFHA.1932@tk2msftngp13.phx.gbl... uses and refers back to the parent surrogate; however, the new entitiy itself still needs to be in 1NF, which means it must define a KEY. One could use yet another surrogate, a new one, to use on this new table, but it still needs the independent key for which you are surrogating. If this is not the case, then you probably goofed your normalization. For 1 to 1 relations, not all, but some, are called identifying relationships. In this case, I could see where the surrogate key of the parent becomes the business key on the child. But then, as I keep arguing with Adam, that IDENTITY is the business key and you wouldn't have too many other choices for the clustered index. I appreciate your comments, but you should go back to your normalization and review. Rule #1: every entity has a KEY and this exists BEFORE the creation of any surrogates. Sincerely, Anthony Thomas -- "Daniel Joskovski" <omnis@NOSPAMunetREMOVECAPS.com.mk> wrote in message I regularly read your posts, and find them very educative.news:ewxmM2tPFHA.2460@TK2MSFTNGP10.phx.gbl... Hi Antony, But in this case I think you are wrong. I will be very short in explaining my point of view. You talk about surrogate versus business keys like candidates for clustered index. If you have good database design (3NF), you will have tables with business keys only. Tables with surrogate keys are (almost without exceptions) result of normalization, so when you have surrogate key that key is result of normalization to 2NF so you will not have good alternate business key in the table with surrogate key. So If I have to chose Surrogate versus business keys is like choosing between normalization and denormalization. Tests will not prove nothing since sometimes we prefer denormalization for better performing queries. But that is only my point of view. Regards, Daniel Show quoteHide quote "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message news:#y4PJPsPFHA.2604@TK2MSFTNGP10.phx.gbl... > As another thought in regards to my other response I just made, perhaps the > IDENTITY is NOT the Business Key as I had first assumed with regards to your > session table. Perhaps the composite DATETIME and CustomerID is the > Business Key, and, again, we are still talking about surrogate versus > Business Key as the proper choice for the Clustered Index. Certainly, this > Alternate Key must at least be a Business Key candidate and would be worth > examining. > > I realize that everyone is probably getting bored with such an academic > discussion, but you are helping me drive the questions that are helping me > to justify my position to all of the vendors I argue with on a daily basis > over this topic. > > It has occurred to me that, in addition to Hugo's suggestion that a PK > IDENTITY clustered index may be preferrable for complex, multi-table, > queries, I have yet to test the performance of a composite clustered index. > > If anyone is interested, I will post the results here as the > others--shortly, I hope. > > Sincerely, > > > Anthony Thomas > > > -- > > "Adam Machanic" <amachanic@hotmail._removetoemail_.com> wrote in message > news:OaMg1XrPFHA.2348@tk2msftngp13.phx.gbl... > "Anthony Thomas" <ALTho***@kc.rr.com> wrote in message > news:eBKcSfmPFHA.1476@TK2MSFTNGP09.phx.gbl... > > > > I used a VARCHAR(30) attibute with a random leading character, with a > random > > length, up to 30 characters, while he used another INT, just not an > > IDENTITY, but randomized. The problem is that he only used a random INT > > between 0 and 300,000. But this was a test for 10 million rows inserted. > > So, this attribute as clustered can not possibly come close to uniqueness, > > 300,000/10,000,000. > > I should have explained better when I sent the script -- it is intended > to simulate a Web site with repeat visitors. Assuming 300,000 customers, a > lot of them coming back (somewhat randomly, as customers do) to place new > orders. I'm not sure what kind of unique business key you could use in that > case, at least not one that isn't absolutely gigantic! The question was, > would clustering on the CustomerId be better for performance -- given that > the majority of queries against the table are based on CustomerId (customers > asking for their own orders / order history), this was quite a valid > question. The answer was that selecting from the table with this cluster > was definitely faster -- but not enough to balance out the insert cost. And > I think you'll find in that case that defining a cluster based on a > "business key" will be expensive enough to far outweigh any perceived > benefit. What columns would make up this business key? And how large will > it end up being? Do you really want to add 20+ bytes (just an estimate; I > haven't actually figured out how to generate a good key for that situation) > to every row of your non-clustered indexes? > > Again, I think you need to broaden your outlook a bit. No solution is > going to be right every time. Stop trying to find a formula ("IDENTITY is > always good" vs. "IDENTITY is always bad"). Every system is different, and > if you try to find a magic one-size-fits-all solution you'll very quickly > find situations in which it doesn't apply. That's why the best answer > that's given to most of the questions in these forums is, "it depends" -- > which is a superset of the answer that's usually given, "test it on your > end!" If you can figure out problem classes in which IDENTITY does not > apply, that's great. Sounds like a good article for SQL Server Magazine or > SQL Server Professional. But don't assume that your answer will hold true > in all environments or for all business problems. > > > -- > Adam Machanic > SQL Server MVP > http://www.datamanipulation.net > -- > > On Wed, 23 Mar 2005 09:59:05 -0600, Anthony Thomas wrote:
>Thank you for your comments. Yes, I do agree that the test need to start Hi Anthony,>broading there scope into the addition, and more typical kinds of queries. > >The intial discussion however began with a question regarding the use of a >ROWGUID as a clustering index. I made an generalist statement that >surrogate keys make lousy cluster index candidates, which the response >initiated this thread. I think this thread is mainly a result of you posting a statement that was too generalist: >>> Cluster Index definition on surrogate keys are ALWAYS a bad idea, INT or In situations like this, there is no "ALWAYS", and hardly ever a>>> GUID. Usually, the Business Key the surrogate is proxiing for is a better >>> Cluster Index candidate. "Usually". I consider this quote (from your original post to this thread) bad advice - not because it's false, but because it gives the impression that there might be one single "always correct" solution, where there really isn't. > The proof offered was a matter of lessing splits. Actually, you didn't ask for a proof, but for one example. :-)Anyway, since I don't consider myself an expert in the field of mass inserts and their effect on page allocation, I've decided not to mingle in this part of your exchange with Adam. I do read it with great interest, though! (snip text about fragmentation) > You clearly haven't read many of my newsgroup posts. For if you had,>Now, about "defaults," I agree. Unfortuantely, we have the majority of our >systems vended, by BIG SOFTWARE companies, that take the advice and the >default behavior of the SQLEM tool and generate hundreds of tables, all with >surrogate PK IDENTITY clustered, as a matter of default, precisely because >the tool does it and people like Kimberly Tripp and Adam Mechanic, and, >perhaps, even you, espouse their use. you'd never have suggested that I, perhaps, might espouse the use of identity PKs, either clustered or nonclustered. In fact, I've posted many messages where I fulminate against the mindless slapping of an identity key on any table. Just to clarify things, my stance toward surrogate keys is: a) Always define the natural key during data modeling, before even starting to create tables. b) Usually, the natural key is the best candidate for primary key. c) In some cases, a surrogate key might be worth consideration: if the natural key is composite (2 columns is not a problem, 4 or more is a definite candidate for contemplating a surrogate key) or very long (usually varchar columns with names or descriptions), AND the table is being referred to in other tables, then you should consider a surrogate key. d) If a surrogate key is used, it should always be IN ADDITION TO the natural key - the natural key columns should still be included in the table and defined in a UNIQUE constraint. e) If a surrogate key is used, identity is usually a better choice than uniqueidentifier. f) The surrogate key should only be used internally and never exposed to the end user. They should use the natural key to identify instances of an entity. And in questions regarding clustered vs nonclustered index, my usual reply is that it depends and should be tested in the environment where it will run (or one as similar as possible). I do sometimes advise a specific choice if a specific query is included in the question, but never without the warning that any change to indexexs might impact performance of other parts of the application and should therefor be tested thoroughly. >The point of these was to attempt to give general advice that would prove I think that the only general advice that can safely be given in SQL>more beneficial than what has been typically given. And, then, proceeded to >prove my point. Server related issues is "it depends" or "you really should test all variants on your end and check the results". (snip text about fragmentation) Best, Hugo -- (Remove _NO_ and _SPAM_ to get my e-mail address) As to the first, you are correct on all counts.
As to the generalist advice, yep. It was hyperbole expressly because of articles like Kimberly's Q & A site that Adam hooked me up with that expressly gives as opinion that most tables should be defined with a surrogate Identity and that this should be the definition of the clustered key. I got exactly what I intended: to antagonize someone into a discussion on the topic just so this conversion could occur-I often find one gets engaged when irritated. So, now, we have quite a healthy thread occurring and some research expressed precisely because I was so beligerent. As for your design issues? RIGHT ON! Couldn't have said it better myself. Have been preaching this mantra for what seems like forever. One of these days, maybe some of our vendors will actually listen. Thanks for all of your feedback. Now, beligerence aside, I know that I am among like-minded folk. Sincerely, Anthony Thomas -- "Hugo Kornelis" <hugo@pe_NO_rFact.in_SPAM_fo> wrote in message On Wed, 23 Mar 2005 09:59:05 -0600, Anthony Thomas wrote:news:ilm341ds2bgl8235vi90mlgifsvvfn5384@4ax.com... >Thank you for your comments. Yes, I do agree that the test need to start Hi Anthony,>broading there scope into the addition, and more typical kinds of queries. > >The intial discussion however began with a question regarding the use of a >ROWGUID as a clustering index. I made an generalist statement that >surrogate keys make lousy cluster index candidates, which the response >initiated this thread. I think this thread is mainly a result of you posting a statement that was too generalist: >>> Cluster Index definition on surrogate keys are ALWAYS a bad idea, INT or In situations like this, there is no "ALWAYS", and hardly ever a>>> GUID. Usually, the Business Key the surrogate is proxiing for is a better >>> Cluster Index candidate. "Usually". I consider this quote (from your original post to this thread) bad advice - not because it's false, but because it gives the impression that there might be one single "always correct" solution, where there really isn't. > The proof offered was a matter of lessing splits. Actually, you didn't ask for a proof, but for one example. :-)Anyway, since I don't consider myself an expert in the field of mass inserts and their effect on page allocation, I've decided not to mingle in this part of your exchange with Adam. I do read it with great interest, though! (snip text about fragmentation) > You clearly haven't read many of my newsgroup posts. For if you had,>Now, about "defaults," I agree. Unfortuantely, we have the majority of our >systems vended, by BIG SOFTWARE companies, that take the advice and the >default behavior of the SQLEM tool and generate hundreds of tables, all with >surrogate PK IDENTITY clustered, as a matter of default, precisely because >the tool does it and people like Kimberly Tripp and Adam Mechanic, and, >perhaps, even you, espouse their use. you'd never have suggested that I, perhaps, might espouse the use of identity PKs, either clustered or nonclustered. In fact, I've posted many messages where I fulminate against the mindless slapping of an identity key on any table. Just to clarify things, my stance toward surrogate keys is: a) Always define the natural key during data modeling, before even starting to create tables. b) Usually, the natural key is the best candidate for primary key. c) In some cases, a surrogate key might be worth consideration: if the natural key is composite (2 columns is not a problem, 4 or more is a definite candidate for contemplating a surrogate key) or very long (usually varchar columns with names or descriptions), AND the table is being referred to in other tables, then you should consider a surrogate key. d) If a surrogate key is used, it should always be IN ADDITION TO the natural key - the natural key columns should still be included in the table and defined in a UNIQUE constraint. e) If a surrogate key is used, identity is usually a better choice than uniqueidentifier. f) The surrogate key should only be used internally and never exposed to the end user. They should use the natural key to identify instances of an entity. And in questions regarding clustered vs nonclustered index, my usual reply is that it depends and should be tested in the environment where it will run (or one as similar as possible). I do sometimes advise a specific choice if a specific query is included in the question, but never without the warning that any change to indexexs might impact performance of other parts of the application and should therefor be tested thoroughly. >The point of these was to attempt to give general advice that would prove I think that the only general advice that can safely be given in SQL>more beneficial than what has been typically given. And, then, proceeded to >prove my point. Server related issues is "it depends" or "you really should test all variants on your end and check the results". (snip text about fragmentation) Best, Hugo -- (Remove _NO_ and _SPAM_ to get my e-mail address) On Wed, 23 Mar 2005 22:56:47 -0600, Anthony Thomas wrote:
>As to the first, you are correct on all counts. Hi Anthony,> >As to the generalist advice, yep. It was hyperbole expressly because of >articles like Kimberly's Q & A site that Adam hooked me up with that >expressly gives as opinion that most tables should be defined with a >surrogate Identity and that this should be the definition of the clustered >key. > >I got exactly what I intended: to antagonize someone into a discussion on >the topic just so this conversion could occur-I often find one gets engaged >when irritated. And a truly interesting discussion it is! I do hope that you and Adam will continue it, in only to find out the cause of the differences between your test results. I'm intrigued! >As for your design issues? RIGHT ON! Couldn't have said it better myself. Great! The more this mantra is preached, the better.>Have been preaching this mantra for what seems like forever. (After reading a days worth newsgroup posts, I often find myself hoping that the typical design posted in the typical newsgroup question is not exemplary for the average computer professional in the world - but I also often find myself fearing that my hope is in vain...) Best, Hugo -- (Remove _NO_ and _SPAM_ to get my e-mail address) |
|||||||||||||||||||||||