DB2 LUW Performance: A Self Tuning Memory Manager Case Study


Posted by Scott on April 16, 2009, 11:58 pm
in DB2 Case Studies ( DB2 Performance)

I was recently contacted by a DBA at a large insurance company. After upgrading from DB2 V8.2 to 9.1.0.3, the DBA activated the Self Tuning Memory Manager (STMM) in the hope that it would solve all performance woes. To the contrary, performance got worse, and the DBA noted a marked increase in SORT overflows. He even opened a PMR with IBM. But this isn't a case of STMM gone crazy; something else is going on, but what? Let's take a look at the numbers and discover the truth.
I was recently contacted by a DBA at a large insurance company. After upgrading from DB2 V8.2 to 9.1.0.3, the DBA activated the Self Tuning Memory Manager (STMM) in the hope that it would solve all performance woes. To the contrary, performance got worse, and the DBA noted a marked increase in SORT overflows. He even opened a PMR with IBM. But this isn't a case of STMM gone crazy; something else is going on, but what? Let's take a look at the numbers and discover the truth.
Average Result Set Size (ARSS)

The ARSS for this database was 17.08 meaning that the typical SELECT statement returns seventeen rows. An ARSS > 10 typically indicates a data warehouse database.

I inquired with the DBA to see if, in his opinion, the database really was a data warehouse. He replied that the database was an OLTP database. Here is our first sign of trouble. We have an OLTP database that is returning unusually large result sets to the application - this often happens when application developers fail to let DB2 do the work. When applications pull large result sets from the database and do their own filtering and sorting, average result set sizes can be abnormally large.

Index Read Efficiency (IREF)

The IREF for this database was 37.17 meaning that DB2 picks up and evaluates 37+ rows to return just one row on average. This is indicative of scans, and scans in an OLTP database are bad (or evil as my daughter would say).

Synchronous Read Percentage (SRP)

The SRP for this database was an amazingly low 0.5%. This further indicates that there is a great deal of asynchronous prefetch I/O occuring ( 99.5% prefetch! ). When the SRP is low in an OLTP database, it means that indexes are missing or sub-optimally defined. Quite frankly, this is the worst SRP I've seen in an OLTP database in several years.

Number of Transactions Completed (TXCNT ) : 97,361,114 (this database is no slouch when it comes to trying to get work done! )

Selects per Transaction (SELTX)

The average transaction was running 1.04 Selects for each transaction. This number is on the unusually low side. Typical SELTX for OLTP range from 3-15. The value of 1.07 means the transactions are small in work scope - there is a commit after almost every SELECT.

DML per Transaction (DMLTX)

The average transaction was running 0.0356 DML per transaction. This is lower than the worldwide normal range of .5 to 4. Here 1 Select statement is being accompanied by a rare Insert/Update/Delete statement, on average. With very light DML activity like this, our tuning focus needs to be on data retrieval.

Sorts per Transaction (SRTTX)

On average, each transaction in this database performed 1.22 sorts. Here is another indicator of potential "opportunity for improvement" because, with 1 SELTX, this means that, on average, each SELECT is performing 1 to 2 sorts. It is more likely, however, that some statements don't perform any sorting and other statements are performing multiple or several sorts to retrieve required result sets.

SRTTX is an important cost metric because this metric can be used to verify tuning success.

Sort Overflows per Transaction

On average, each transaction performed 0.0013 sort overflows. The good news here is that very few of the sorts (1.22 per TX) are overflowing SORTHEAP memory. Still, SORT is a four-letter word and sorts are EVIL - they burn up CPU cycles even if performed in memory. This cost metric can be used to verify tuning success.

Rows Read per Transaction (DB-RRTX)
Rows Fetched per Transaction (DB-FETTX)

The average transaction was reading 663.72 rows to fetch just 17.86 rows on average. This is very inefficient as indicated by the IREF metric. CPU consumption on this machine is much higher than it needs to be.

Bufferpool Logical Reads per Transaction (BPLRTX)

The average transaction is performing 169.82 logical bufferpool page reads. As noted in an earlier blog post, this cost metric is one of the best cost measurements for verifying your tuning success. Bufferpool logical page reads equate in direct proportion to CPU time consumed.

Bufferpool Logical INDEX Reads per Transaction (BPLITX)

This database is performing 105.80 logical index page reads for every transaction. Since 105.80 is much greater than SELTX (1) x 4 x 1.5 (6), this indicates that there are a significant number of leaf page scans occurring (DB2 is using an index but MatchCols = 0). Leaf page scans are expensive and CPU costly. BPLITX provides another excellent cost measurement to verify your tuning success.

Overall Read Time (ms) (ORMS)

The average time to perform a physical read for this database was 1.38ms. This is good (but much higher than our previous case study), and perhaps one of this database's saving graces. However, 99.5% of the I/O is prefetch asynchronous I/O which tends to be much faster than synchronous I/O.

Overall Write Time (ms) (OWMS)

The average time to perform a physical write for this database was 0.44ms. This is very good. 98.23% of writes are being performed asynchronously. This is also good.

Average Sort ms (SRTMS)
Sort time per Transaction (SRTMSTX)
Sort Overflow Percentage (SRTOVFLOPCT)

The average time to complete a sort was 0.29ms and each transaction spent 0.349ms sorting. While these sorts are small and fast, they may be burning up CPU cycles unnecessarily if they can be eliminated. 0.001% percent of Sorts are overflowing the SORTHEAP. As a percentage, this is good - but, for the 127,302 transactions that experienced a sort overflow, there were probably some unhappy end users looking at an hour glass for too long.

Average Lock Wait Time (LCKMS) and Lock Time per Transaction (LCKMSTX)
Fighting Over Data - Lock Contention

The average lock duration was 537.6ms (Total Lock Wait Time divided by number of Lock Waits), and each transaction suffered 0.029ms of lock wait time (Total Lock Wait Time divided by number of Transactions). While this isn't bad, zero time spent waiting on locks is ideal. Further, there were 70 lock timeouts. I expect that some of this lock wait time and timeouts could be reduced if better indexes were available to reduce the number of scans. Locks are a symptom of poor performing SQL, not a problem.

The Catalog Cache Hit Ratio (CATHR)

The CATHR was 99.98%. This is excellent. The Catalog Cache is not an AUTOMATIC tuning (STMM) participant.

The Package Cache Hit Ratio (PKGHR)

The PKGHR was 99.9%. This is excellent. Too bad this database has so many problems with data and index scans.

Database Files Closed (DFC)

The database had closed 3,774 files. Closing files burns up CPU cycles and slows down SQL performance.

Database Bufferpool Index Hit Ratio (DB-BPIHR) :: Database Bufferpool Overall Hit Ratio (DB-BPOHR)

The bufferpool index hit ratio was 99.999% and the overall bufferpool hit ratio was 98.65%. While these hit ratios look good, remember that bufferpool hit ratios can be very misleading (give you a false sense of security and tuning success) when scans are occurring in the bufferpools. For this database, in memory data and index page scans are degrading performance and setting the CPUs on fire.

Asynchronous Pages Read per Request (APPR)

The APPR for this database was 9.99. APPR cannot exceed the EXTENTSIZE for tablespaces, and the average/typical EXTENTSIZE for tablespaces in this database was 32. Therefore, the value of 9.99 is NOT good and means that prefetch is working ineffectively. Worse yet, THE BAD NEWS is that this OLTP database is satisfying its read requests with asynchronous prefetch I/O 99.5% of the time!!!

Table Rows Read per Transaction (TBRRTX)

This database contains hundreds of tables. We did a quick scan of the Tables snapshot and found 5 tables having Rows Read per Transaction greater than 10. Some of the TBRRTX values exceeded 1,000 or more. This is further evidence that SCANS are prolific in this database.

Summary This database is suffering from several mosquito swarm infestations. The numbers indicate that several needed indexes are missing, and some of those that exist could be better defined. The absence of an optimal index physical design is contributing to excessive scans, leaf page scans, unnecessary sorts, and lock wait time. Based on evaluations of hundreds of databases from around the world, my instinct tells me that CPU utilization is about 70-90% higher on this machine than it needs to be. While response time performance is reported to be fairly good presently, it could be improved 10-25% or more by reducing CPU consumption with an improved physical design. Furthermore, if more application users are added, the abnormally high CPU consumption may cause this database to fail to scale.

STMM VERDICT: NOT GUILTY

For this database, STMM was dynamically shifting memory away from SORTHEAP and into the bufferpools to help address the prolific and costly scans. The memory removed from SORTHEAP caused SORT OVERFLOWS to increase and the DBA's phone to ring with calls from unhappy end users.

STMM was doing its job properly by doing it's best to compensate for physical design flaws. Despite the marketing hype, there is still no substitute for SQL Workload Analysis and proper physical design that addresses workload costs. STMM is NOT a little blue miracle pill, though it can help elevate and optimize performance in the presence of good physical design and overall database health.

Next Steps

The DBA needs to perform SQL Cost Aggregation to identify and isolate problematic costly statements that are performing the scans, then cure these opportunities for improvement with high quality indexes. Also, read this blog post about Index Cardinality.

I also advised the DBA to close his PMR with IBM. Under the duress of this database's highly flawed physical design, STMM was behaving consistently with expectations. As a temporary measure, until the massive scan problems could be cured, I further suggested that he remove sort memory from the scope of STMM's authority by setting it to a fixed quantity sufficiently large enough to abate the majority of sort overflows.

More Interesting Reading

The Shameless Marketing Moment

DBI customers typically enjoy 30-90% CPU savings and improved response times within just a few hours of using DBI's integrated performance management tools for DB2 LUW.

Brother-Panther Logo If you find all of the measurements and analysis above to be helpful, you don't have to crunch the numbers by hand. DBI's Brother-Panther provides industry leading, best of breed, statement workload analysis capabilities plus computes all of the metrics discussed in this blog post. You will quickly see the statements that have high aggregate costs, and you can pass individual statements or entire costly workloads to the IBM Design Advisor for rapid solutions. After you have implemented a physical design change with the intent of improving performance, Brother-Panther's Performance Trend Charts with integrated, correlated, and plotted change event histories make it easy to verify your performance improvement success AND track STMM change effectiveness. No tool on the market makes the full performance management life cycle of identify, isolate, cure, and verify easier, nor can any other tool deliver better performance results.

With kindest regards,
Scott

Scott Hayes
President & CEO, DBI
IBM GOLD Consultant
DBI is an IBM Advanced, Industry Optimized, Business Partner
DBI is an Oracle Technology Network Partner
DBI provides products and services to over 2,000 customers worldwide
Your Performance IS Our Business

DBI Logo

Post from : http://www.dbisoftware.com/blog/db2_performance.php
Printed from : http://www.dbisoftware.com/blog/db2_performance.php?id=125