[an error occurred while processing this directive]

Performance Management of a Data Warehouse

Introduction

Data warehouse projects are popular within the business world today. Competitive advantages are maintained or gained by the strategic use of business information that has been analyzed to produce ways to attract new customers and sell more products to existing customers. The benefits of this analysis have caused business executives to push for data warehouse technology – and expectations for these projects are high. This document examines the performance characteristics of a data warehouse and looks at how expectations for these projects can be set and managed. This paper focuses on the performance aspects of a warehouse running on an IBM® mainframe and using UDB for OS/390 as the database.

Performance Characteristics of a Data Warehouse Environment

The art of performance tuning has always been about matching workloads for execution with resources for that execution. Therefore, the beginning of a performance tuning strategy for a data warehouse must include the characterization of the data warehouse workloads. To perform that characterization, there must be some common metrics to differentiate one workload from another.

Once the workloads have been characterized, some analysis should be performed to determine the impact of executing multiple workloads at the same time. It is possible that some workloads will not work well together and thus, when executed at the same time, will degrade each other's performance. In such cases, it is best to keep these workloads from running at the same time. The workload that consists of the maintenance programs for the warehouse should be tracked closely because of its impact on availability. While not specifically related to the issue, data marts have their place in a warehouse strategy. One factor that should be evaluated in deciding to establish marts is the ability to segregate workloads across multiple data marts and thus mitigate the instances of competing workloads that degrade performance.

One issue to address in a warehouse environment is whether there will be uniform workloads or whether all work will be unique. Some companies may find queries that are executed on a regular basis and thus can be characterized as a workload. Others may find the dynamic nature of the environment very difficult to characterize. This will be addressed in more detail later. The performance analyst can get help with pattern matching from end users while trying to determine the workload characteristics of the company's warehouse.

Workload arrival rates and how those arrival rates can be influenced must be combined with the workload characterization. Queries against a warehouse are not driven by customers transacting business with the company, but rather by users who are searching for information to make the business run smoother and be more responsive to its customers. Therefore, the timing of these queries can be under some control. Typically, this control will work best when it is integrated into the warehouse service-level agreements. This will make control a part of the agreement between IS and the users of the system. For example, this control might be structured by setting different response time goals for different workloads or groups of users based on day of the week or hour of the day. This would not guarantee that the work would arrive at certain times, but it would encourage submission of workloads at different times.

Sometimes availability is overlooked in the evaluation of performance. If a system is unavailable, then it is not performing. Therefore, the ability of the platform to deliver the required availability is critical. Some might question the need for high availability of a warehouse compared to the availability requirements of an operational system. A warehouse may, in fact, need 24x7 availability. Consider that queries against a warehouse will have to process large volumes of data, which may take hours or perhaps days. Longer outages might be tolerated by an operational system if they are planned around user queries, but unplanned outages at the middle or end of a long running query may be unacceptable for users.

In addition, the more a company uses a warehouse to make strategic business decisions, the more the warehouse becomes just as critical as the operational systems that process the current orders. Many have argued the value of a warehouse to project future buying patterns and needs to ensure that the business remains competitive in a changing marketplace. These decisions affect whether there will be future orders to record in an operational system.

Realistic service-level agreements

Realistic service-level agreements can be achieved within the management of a data warehouse. Testing the warehouse before production implementation is the basis of realistic service-level agreements. This will probably occur near the end of the testing phase and should include tests against the same data that will actually exist in the production warehouse at its inception. This will have to be coordinated with the users to ensure that the tests represent realistic queries.

If a tool generates the queries, this will be an opportunity to determine how much control you have within the tool to generate efficient SQL. This may take some time and effort to work with the vendor of the tool and to work with some level of SQL EXPLAIN information to achieve the best SQL for your system. If all of the benefits gained from that design in the SQL queries are not used, then the long hours spent designing the schema of the warehouse to meet the users' requirements will have been wasted.

There are many choices of platforms on which to run the data warehouse. Performance is sometimes a consideration when making this choice, but not always. If performance of the warehouse is not a major consideration, then service-level agreements for the warehouse will have to be adjusted based on the ability of the platform to deliver service.

Service-level agreements for a data warehouse may need to be more fluid than those established for an operational system. An operational system may have a requirement that 90% of all transactions complete in less than one second. This is easy to track, report and understand. However, in a data warehouse system, there may be a requirement to produce an answer within one second for every 10,000 pages processed by the query. Metrics that can be obtained without traces are important because of the overhead of DB2® traces. While the user may require some education on the metric, this allows the system's administrator to have more control over the delivery of service and the resources needed to deliver on service-level agreements.

Service-level agreements need to be adjustable after implementation rather than rigid. The goal of the data warehouse is to deliver value to the business. Value is delivered in terms of more business, growth in new business areas or reduced costs for the business. When this is successful, the company executives will want more. This will generate an increase in warehouse activities that will strain the initial resources. Service-level agreements along with capacity planning information can pave the way for warehouse growth as it proves its value to the company and expands its mission.

In addition, time-sensitive information may occasionally be required for business opportunities that exist for only a short time. The ability to adjust resources to meet specific workloads can have an impact on the attainment of other service-level agreements. Because of these adjustments and the fluid nature of the SLA system, reporting of attainment should have some level of prioritization contained within the reporting structure. This allows for the recognition that some workloads for brief periods were considered less important. This reduces the impact of missing the SLA for these workloads. Furthermore, within the SLA reporting structure, the emphasis should always be on percentages – of attainment, of resource utilization, and so forth – because the actual resources and response times could and probably will change over time.

Workload characterization

Workload characterization may vary between companies. They may characterize by business units that use the warehouse, by applications or by the type of query executed against the warehouse. Some companies may characterize by all of the above because there is merit to all of these characterizations. However, if the primary goal is to improve the performance of the warehouse, then the important characterization is by the type of queries executed against the warehouse. This is important because it will allow understanding of the problems that might be generated by competing workloads. As times of competition are minimized, it will allow for better performance of the warehouse.

An example of competing workloads is when two queries are executed at the same time against two different databases, but the underlying tables reside on disk drives that are attached to the same controller on the same physical channel to DASD. Those two queries will be competing for the same resource – a channel to the disk drives. Queuing will begin to occur and wait time will be inevitable. In computing systems, wait time is the enemy of performance.

Another example of a competing workload is when two queries are executed at the same time against the same table. One is a well-refined query that uses the available indexes to retrieve a relatively small result set and the other is a broad query that scans certain partitions to generate a large result set. In this case, the synchronous I/O for the random reads for the well-refined query will more than likely have to wait for the asynchronous I/O of the poorly defined query. This happens because of the number of pages retrieved by the asynchronous I/O and the time required performing that operation. Again, the enemy of performance is wait time, and in this case, the unfortunate loser is the person with the well-constructed SQL statement.

The purpose of workload characterization typing is to define the resource requirements so that workloads that compete for resources can be executed at different times.

Level of characterization for a workload

If workload characterization were performed for an operational application, the level of characterization might be a DB2 thread. However, for a data warehouse, the sensible level of characterization is at the query level since the vast majority of work in a warehouse environment will be performed at that level. Therefore, for performance improvements of a data warehouse, the characterization of workloads should be at the individual SQL statement level.

Some might be concerned about tracking individual SQL statements and might fear that the amount of data would be overwhelming. In a warehouse environment, the number of statements executed daily is significantly smaller, and the majority of these statements run much longer than statements from an OLTP application.

Data warehouse metrics

Metrics can be used for a variety of purposes in a data warehouse. They can provide for workload characterization, serve as the basis of realistic service-level agreements, help measure processor and I/O subsystem throughput and provide for query efficiency.

Metrics for workload characterization

The purpose of workload characterization is twofold: to avoid executing competing workloads at the same time and to promote simultaneous execution of complementary workloads.

It would be impossible to list all the metrics involved in workload characterization of a data warehouse. However, the object names used to produce a query's result set are important. If multiple queries access the same objects sequentially, then the size of the buffer pool for the accessed objects could determine the nature of these workloads as being competing or complementary. If the buffer pool(s) were large enough to contain a high percentage of the object's pages, then this workload could be very complementary.

However, if the pool contained a small percentage of the object's pages, then this workload could compete for the scarce buffer pool resource quickly. In this case, the important metrics are the object name, the average number of concurrent queries, the buffer pool for the named object, the size of the buffer pool and the average and maximum utilization of the buffer pool. These pieces of information would allow the user to determine if the workloads were complementary or competing.

Interval or event data

Should the metrics represent individual events or a summarization of individual events over an interval of time? Event data is absolutely the most accurate data to work with when analyzing past history with the hope of predicting the future. However, most companies do not maintain event data historically because of the volume of data accumulated.

For operational systems, the amount of data would be larger than any data warehouse. Additionally, the difference that event data would make in the analysis process would be small compared to the difference in cost to achieve that accuracy. Companies will not collect event data for operational systems because of the cost. However, the amount of information that would be collected for a data warehouse would be much smaller. Therefore, collecting event data for a data warehouse would not be out of the question.

This would not be true of all types of event data concerned with the operation of a data warehouse. Input/Output (I/O) operations occur so frequently and are of such a small duration that collection of this type of event data would be out of the question. The time required to record the event could exceed the time of the event itself. When this happens, the overhead of maintaining the event data would greatly increase the overhead of operating the data warehouse and would not be justifiable. However, the collection of detailed event data for every query executed against the data warehouse would be a viable option since the number of daily queries would be small compared to the queries executed against an operational system.

Therefore, the metrics of analysis will probably include both interval data and event data. With that recognition comes the difficult task of organizing and evaluating the data so that the interval data can be matched to the event data in a coherent fashion. For example, accounting statistics are collected on every query against the warehouse; I/O statistics are accumulated on intervals of 15 minutes; buffer pool statistics are accumulated on intervals of 15 minutes. If a normal query executes in less than 15 minutes, how can the analyst take into account the interval data when the query was not executing during the entire interval? What happens when the query lasts 15 minutes, but spans two 15-minute intervals because it started in the middle of one and finished in the middle of another? These questions highlight the issues associated with the attempt to mix interval and event data.

Event data needs to have as much detail associated with the event as is possible to collect without incurring unacceptable overhead. This data can then be used to tune the performance of specific queries in the data warehouse and to characterize the queries for workload analysis. As an example, event data associated with a particular query should have all the information concerning that query's use of DB2 buffer pools, getpage requests and I/O events to avoid trying to extract that information from an interval record.

Interval data is extremely valuable in judging the health of a data warehouse. It allows for trending, so the analyst can see where the warehouse has been and have a good idea of where the warehouse is going. Interval data can provide for tuning of a DB2 buffer pool for system-wide activity. After this, event data can be used for minor adjustments to additional information that can lead to moving objects from one pool to another. Interval data can be valuable in pattern matching. If specific events occur repeatedly and have the same impact on the system, they will be easier to notice with interval data because the analyst is working with smaller amounts of data that will report over longer time frames. This is especially true if the events only repeat once a month or once a quarter.

Metrics for performance at the subsystem level

Characterizing workloads is important, but that is not the only aspect of tuning a DB2 data warehouse for maximum performance. Metrics should be maintained to ensure that the DB2 system is tuned for the workload it is executing. From a subsystem perspective, the analyst should track DB2 use of buffer pools, the EDM pool, RID pools, sort pools, logs, checkpointing and locking. DB2 use of pools is a trade-off of using memory to avoid physical I/O operations and thus reducing wait time from DB2 queries. Logging and checkpointing activities are closely aligned and must be tuned because DB2 updates can go no faster than the log.

In a warehouse system and because of the small number of updates, these will not be as important as the pool usage. Likewise, locking conflicts are less important in a warehouse environment. Locking conflicts are greatly reduced because of the small number of updates, the bind and SQL options that allow queries to read through locks and the ability to place an object in read-only status for specific periods of time.

In a warehouse environment, the single task of system-wide tuning, with a chance for significant changes in performance, should concern itself with DB2's use of memory in its four types of pools. This tuning should provide DB2 as much memory as possible without generating excessive overhead through MVS paging. Depending on the version of DB2 used, the performance analyst will have options, such as data spaces for buffer pools and the use of ESO hyperspace. The greatest challenge to this tuning is that three of the pools have no external controls through DB2 commands, and the one that has an external interface through DB2 commands provides little assistance. The user must know exactly what to change in the buffer pools to improve performance.

The traditional approach to buffer pool tuning has been to collect detailed information for a short time, like 15 to 20 minutes, at a peak period, and then make buffer pool adjustments to get optimum performance when the workload matches that time interval. The difficulty with this approach is that DB2 seldom matches the same workload used to create this optimum environment. As a result, the performance gains from this type of tuning are not always ideal. Part of the problem with this tuning methodology is the changing workloads based on query execution that DB2 is trying to manage.

No one has yet to accept the challenge of tuning DB2, but the optimum performance from a DB2 warehouse environment can only be achieved when the buffer pools are dynamically adjusted through DB2 commands based on the workloads that are executing. This is true for any DB2 subsystem, not just one that is running a warehouse. There are three elements to tuning:

Separation of objects across buffer pools

Pools must be structured so those objects with a common usage pattern are together. Objects that are accessed randomly should be put together and separated from objects that are accessed sequentially. If only randomly accessed objects are in a pool, memory will be better used by retaining pages for longer periods of time, and the likelihood of a subsequent hit in the pool for a getpage request will increase, thus avoiding I/O processing.

Other recommendations include keeping all system objects in BP0 and establishing a separate BP7 for all sort work objects (DSNDB07). Depending on the amount of virtual storage that is available, further separation can be obtained by placing indexes and data in different pools. Even greater separation can be achieved by separating scanned indexes from probed indexes. These recommendations apply to tablespace objects as well.

Sizing and sizing adjustments of buffer pools

One of the dynamics of a DB2 subsystem is that workloads are constantly changing. This means that some objects are accessed heavily during some periods of the day and lightly at other times. If there are separated objects in multiple pools, then at any point in the day there will be pools with excessive amounts of storage allocated because the objects are currently being accessed lightly. Other objects are not achieving maximum performance because their access is so heavy that the pool becomes a constraining factor.

If virtual storage were truly boundless, then all pools would be sufficiently large so that no work would ever be constrained by the size of the pool. Unfortunately, we are not there yet; therefore, there is some value to adjusting pool sizes based on the objects that have a high access pattern.

Definition of pool thresholds

Before discussing the management of pool threshold values, it makes sense to enumerate those eight threshold values and their usage by DB2. The first five can be changed with the Alter Bufferpool command and the last three cannot be changed dynamically by DB2.

Management of pool thresholds

These pool thresholds are used by DB2 to either encourage or discourage the use of prefetch activities within the system and to clear updated pages from the pool so those pages can be used for other purposes. Prefetch activities within DB2 provide a benefit to applications that are sequentially processing the rows in a table or to applications requiring a large number of rows from a table. By loading pages into a pool in anticipation of their need, DB2 can reduce the overall number of I/O operations in the system and thus reduce wait time for these types of application queries.

Prefetch can be called for at bind time if the access path chosen is a sequential scan of a table. It can be dynamically invoked if DB2 determines the query is processing the data sequentially. It can also be invoked to sort a long list of row identifiers for prefetch when the application requires a large number of rows from a table that are not necessarily in exact sequence. These operations benefit the system by reducing wait times; however, they can also have a negative impact on the system if, and when, these operations fill the pool with data pages.

When the pool is filled, DB2 has to ensure that other functions, such as synchronous I/O operations, can be performed for running queries. These thresholds allow DB2 to execute prefetch operations as much as possible while preventing these operations from dominating the use of a buffer pool. Because DB2 maintains a log with all updates performed by an SQL statement, the system does not have to write the updated pages to DASD immediately. This allows DB2 to accumulate changes to a page in the buffer pool and then write all the changes to a page with a single I/O operation. This facility allows for a well running DB2 subsystem. When the pool becomes full, DB2 gradually begins writing pages to DASD quicker, and the benefits of a single write I/O operation for multiple updates is eroded.

Workload characterization and threshold information can be used to dynamically tune a DB2 subsystem. Workload characterization can differentiate between objects that are primarily accessed sequentially and objects that are primarily accessed randomly. By placing these objects in different buffer pools, the analyst can manipulate thresholds to either maximize or minimize prefetch activities based on the workload characterization. Also, by separating these objects, each pool can be used for maximum efficiency.

In this environment, thresholds could be set to coincide with the primary usage of the pool. However, when the objects in the pool are not following their normal access patterns, the operator could be informed of these changes and make adjustments through the Alter Bufferpool command to tailor the thresholds to match the current access patterns. When the queries that caused this not normal access pattern are finished, the operator could be informed to change the thresholds back to their original desired state.

Additionally, the use of many pools with objects separated by access patterns could put a strain on virtual storage usage by the DB2 subsystem. This could be caused by sizing the pools for efficiency at peak access times. Not all of the pools and not all of the objects in a DB2 subsystem will be accessed at peak rates all hours of the day. Therefore, to use virtual storage in a responsible manner, DB2 buffer pools would be virtual storage-friendly if they could expand and contract based on the use of objects in a particular pool.

The DB2 system administrator, in concert with the MVS systems programmer, could decide how much virtual storage to allow for DB2 buffer pools. At that point, DB2 would set the pool sizes based on some priority system to match the most politically important processing time of the day. As workload changes indicate diminished use of important objects and increased use of less important objects in the system, the operator could expand and contract pools with the Alter Bufferpool command to maintain the total amount of virtual storage allocated for DB2 buffer pools. The operator could, at the same time, redistribute that storage based on the object usage at the time. As new objects are accessed, changes could be made to create a match between the objects being accessed and the pool sizes for those objects.

Metrics for performance at the subsystem level – summary

At a subsystem level, in a warehouse environment, the control and distribution of virtual storage for all four kinds of pools can have a dramatic effect on the performance of the warehouse. This would require monitoring pool usage information and occasionally adjusting virtual storage allocation across a variety of pools with the objective of maintaining peak performance from the DB2 subsystem.

The example that has been described in detail concerned the use of buffer pools. However, the same methodology could be applied to all four types of pools in a DB2 subsystem. The metrics required for this tuning effort could easily be derived from interval data. By using interval data, the cost associated with data collection would be held to a minimum, while the performance gains could potentially be tremendous.

Metrics for performance at the individual query level

Queries can be tuned by:

The queries executed against a warehouse also provide metrics necessary for tuning. In fact, tuning executed queries provides significantly greater performance improvements over tuning queries at the subsystem level. The metrics involved can be divided into two groups:

Metrics for performance of the SQL syntax

The third tuning method (DB2 optimizer hints) uses metrics obtained from the first two methods to tune DB2 queries. The first two methods rely on unique metrics to tune DB2 queries. The first of these metrics is comprised of information relating to the SQL syntax.

These metrics contain information generated by the DB2 optimizer as output from the Explain process. These metrics indicate to the user how DB2 will retrieve the data to process the query. The method of access, whether it is a scan or index access, is an example of the information that could be gathered. This could be used in conjunction with the name of any index that might be used by DB2 to process the query to identify what DB2 resources will be used; by exclusion, resources which will not be used are also identified. In addition to these metrics, there are statistics that are maintained by the DB2 in the catalog concerning the objects processed to produce the query results.

Furthermore, depending on your use and collection of DB2 object statistics, metrics might be required from the actual objects that are not maintained in the catalog. This would be helpful if you have chosen not to update DB2 catalog statistics because you fear that access paths will change and cause performance degradations. Finally, SQL syntax information, such as columns used in predicates and the operators used inside those predicates, are invaluable in tuning DB2 query performance. This last set of metrics is the most difficult to collect and can be obtained from parsing the SQL text.

It is also important to understand how these metrics would be used. The access path information could be combined with object statistics to produce a list of statements that generated a large number of I/O operations to produce the result set. Each shop will have its own number to assign to that "large I/O" threshold.

Analysis of these statements might lead to changes in the predicate structure so that DB2 would consider indexes for access path selection that would have been ruled out otherwise. Analysis might also identify SQL usage patterns that were less than optimal and provide a basis for SQL training of warehouse users to avoid SQL syntax that generates excessive
I/O operations and causes poor query response time.

Collecting information concerning all queries would allow for performance analysis very similar to the business analysis performed using the warehouse itself. The value of pattern matching for predicate operators and columns used in a predicate would provide invaluable information to the performance analyst. This would allow the analyst to distinguish between broad misunderstandings of DB2 performance and SQL usage from single users who require more training to avoid poor SQL that slows down a warehouse's performance.

Unfortunately, DB2 does not provide all this information today. The access path selection information can be obtained from the mini-bind IFCID 22 for dynamic queries and from the PLAN_TABLE table for static queries that are bound with the "EXPLAIN YES" parameter coded. The object statistics can be collected from the DB2 catalog or from third- party analysis tools if the catalog is not kept current. The column names used in predicates could be obtained from the catalog after parsing the SQL text maintained in the SYSSTMT and SYSPACKSTMT table, but this would only assist in static queries and the overwhelming majority of queries in a warehouse environment are dynamic. The predicate information provides the greatest value, but unfortunately, there is no current mechanism to capture and maintain that information on dynamic queries. Therefore, the tuning of individual queries is largely left to historical analysis of thread history information and user education on good SQL coding techniques.

Metrics for performance of the schema

When analyzing the schema, it is important to collect information over time concerning object usage within the schema at the SQL statement level. To make performance improvements, it is important to know which SQL statements access which objects at particular times. Furthermore, the object accesses must be categorized by synchronous or asynchronous requests, so that scans, for whatever reason, are identified. Additionally, the type of object access should be identified to distinguish between read and update access. It is not required, but it is nice to know what forms of updates are being processed to distinguish between updates, inserts and deletes.

This information can be organized into two matrices. Both could have a list of objects, such as table and index names, and associated with each object would be the SQL statements that drive access to that object. At the statement level, the number of getpage requests, subdivided by asynchronous and synchronous requests, would be detailed. These detailed numbers would be rolled up to the object level for total object getpage requests.

The second matrix would have similar object information and a count of the number of select and update statements executed over a particular time interval. The longer the time interval is, the better the analysis will be. The update information could include a breakdown of insert, update and delete statements for more detailed analysis. From this information, index objects could be identified that generate little or no read activity and, therefore, are potential candidates for elimination.

Objects that have a high insert and delete activity could be targeted for increased free space during a reorganization to avoid frequent reorganizations. Objects that have no delete activity could be analyzed to determine the need for some archival process based on business requirements of the data. All of these activities could indirectly affect the performance of the application. Additionally, this information can have great value in reducing the DASD requirements of the warehouse. . Disk drives may be inexpensive, but they are not free.

For direct improvement of performance, these matrices would pinpoint statements that scan tables frequently and provide opportunities for tuning improvements. The Explain from DB2 might indicate that an access path using List Prefetch is being used, when the actual data shows that a List Prefetch path has been abandoned for lack of storage or because too many rows have been selected. The object/statement/getpage request matrix identifies the problem and quantifies the impact to the DB2 subsystem.

When viewing the matrix from a DB2 subsystem or data-sharing group perspective, the analyst can tune the SQL causing the greatest impact on the DB2 subsystem. By tying this information back to a specific user, the analyst can target users that might need further SQL education. This information is of greatest value when working with dynamic SQL because the only Explain of dynamic SQL is the IFCID 22 Mini-bind, which is seldom externalized through a DB2 trace.

Conclusion

There are a variety of metrics that can be used effectively to tune a DB2 data warehouse. Decide on the metrics that are most appropriate for your site and gather these metrics at a detailed level. In fact, establish a data warehouse of performance metrics and mine this warehouse just like your end users do. This may give you the opportunity to learn from them or even empathize with them. This information presents a general methodology for collecting metrics without mandating what metrics each site should use. Remember that the collection of data is not an end unto itself. The metrics collected must be analyzed and presented so that they become information and knowledge about your warehouse.

Helping You Maintain Advantage

BMC Software Professional Services helps companies maintain their advantage through a comprehensive suite of consulting services and education offerings designed to ensure ongoing business availability of critical applications. Our packaged and custom service offerings maximize product potential, reduce project risk, accelerate the time to value, and improve operations. The Service Assurance Center by BMC Software is our premier custom solution, helping customers achieve service-level management. In addition, our support organization provides the crucial documentation and responsive problem resolution to keep your business on track.

[an error occurred while processing this directive]