ACME Widgets are implementing several Oracle Business Intelligence Applications (OBIA) modules to enable reporting against their new EBS R12 deployment. As well as using the provided OBIEE metadata model (RPD) and reports, they have extended the metadata model and built their own reports.
Having recently loaded a year's data volumes in their UAT environment response times for approximately 25% of reports have been found to be unacceptable for the intended purpose of interactive reporting. Issues were also seen errors on both the database and application server caused by temporary space usage by the respective stack components.
The performance assessment focussed solely on OBIEE and related components. The performance of ODI was not investigated.
My findings are given in detail below, but the key points are as follows:
No misconfiguration was observed on the database or application that would account directly for the performance issues seen.
Without the aforementioned implementation changes made in OBIEE, the application server was seen to be under CPU, memory and I/O pressure with just a single user executing the report under test. Comprehensive performance testing has not been carried out yet, and it may be that there are other reports that will put similar undue pressure on the application server.
Rudimentary concurrency ('load') testing showed that there is concern around the capacity of the database and/or application server to support high workloads without response times increasing and server resources being exhausted. Further testing with realistic workload models is necessary to identify exact capacity limits and prioritise any necessary performance optimisation work.
The nature of the data model (physical and RPD) is such that 'compile' times in OBIEE are relatively high, and currently form the lower boundary of what the minimum response time for reports can be, even with BI Server caching in place.
The following steps are recommended to be undertaken immediately in order to help stabilise and assess further the performance implications of the OBIA deployment going into UAT.
These steps are all recommended for the long-term health and success of the OBIA project, but should not impact on the immediate UAT activity. Further explanation and detail of each recommendation can be found in the Performance Assessment section later in this document.
opmn.xml
, once a repeatable capacity test framework is in place to validate them.The approach taken was to establish a strong empirical base for subsequent analysis and validation of configuration changes ("tuning"). In a stack as complex as OBIEE it is critical that a holistic view is taken and attention not focussed unduly on one component only.
A single OBIEE dashboard with a single set of parameter values was chosen for detailed analysis of behaviour throughout the stack. This led to the observations and recommendations below. Following on from this a basic capacity test was performed, with results and analysis in the Capacity Test section below.
All work and investigation was done in the NPC environment, as used by UAT.
Software Versions:
The standard BI Apps RPD is in use, along with a bespoke Subject Area (ACME Widgets Financials - GL Hierarchies) built against the a modified Core business model that is used for many of the reports.
Security is integrated with EBS and there is also use of the LDAP provided by WebLogic Server.
Whilst BI Server Caching is enabled, no cache management strategy is in place. It is essential that one is implemented to ensure that users do not see stale data, and that the cache provides optimal performance benefits.
Cache seeding can be done in two ways:
SASeedQuery
ODBC procedure to load a specific query's results into the cacheCache purging can be done by issuing one of the cache management ODBC calls. These calls can be from an Agent or using the ODBC or JDBC BI Server interface. Cache purging can also be done through Event Polling Tables which integrate with ETL easily, but are not as tightly coupled.
It is best to integrate cache management directly into the ETL, so that as soon as new data is available the cache is purged, and as soon as the cache is purged it is reseeded.
The BI Server can make use of cache entries to satisfy reports that are not a direct hit but are a subset or summary of what is in the cache. Therefore it is a good idea to seed the cache with 'chunks' of data matching the reports that users run, rather than trying to pre-empt every single permutation. For example, seed the cache with the P&L report by week for all weeks, and then any P&L report run for a single week should be able to make use of the cache entry. This approach should be the most optimal but requires testing in order to perfect it. The use of the cache can be validated with nqquery.log
and the Administration Tool, and the request variable SKIP_PHYSICAL_QUERY_EXEC
can be used to see how OBIEE will react to a given query without actually executing it.
Each Physical table in the RPD has a checkbox indicating whether the BI Server may cache the results of queries that use it. The Administration Tool's Query function can be used to identify any objects with this disabled. During testing at least two tables were found to have this option not set.
ACME Widgets may also want to consider placing the BI Server Cache on fast disk or as a RAM disk if server resources allow. The latter has added complexity in that the BI Server expects the cache to be persistent (since it is written to disk), so use of RAM disk would need to handle planned server restarts as well in order to re-populate the RAM disk with the files that the BI Server is expecting to find present.
When OBIEE runs a report it generates its own execution plan for the the logical query, prior to generating one or more physical queries that it sends to the Oracle Database. OBIEE is capable of performing its own database-like functions including filtering, aggregation and stitching, and it will sometimes opt to execute a query with less work done on the database and more on the BI Server. From a performance point of view this is not always desirable. The execution plan generation can be seen by setting LOGLEVEL=5.
Of interest in this implementation is the use of sub-request caching and the effect that it has on the physical SQL queries generated. By default, OBIEE has an option enabled that will cache sub-requests of the query being run. This can be disabled by setting
DISABLE_SUBREQUEST_CACHING=1
This can be done per query using a Request Variable to override the System Session Variable, or globally by setting the System Session Variable in an Initialisation Block in the RPD.
In the P&L report it was seen that:
The response times of these executions was not consistent, but anecdotally it may be that a side-effect of the seven database queries generated as a result of sub-request caching is that by executing concurrently they make more efficient use of database resources than a single database query. However the corollary of this is that the BI Server in theory has to do more work to process the seven datasets returned from the database, and the database resources may be exhausted sooner by the multiple concurrent queries.
Further investigation is required into the most appropriate setting for this variable, in conjunction with DISABLE_CACHE_SEED.
Note that there were a few of these errors present in the WLS AdminServer log, and further testing/monitoring is recommended to determine whether there is a memory capacity problem:
<BEA-310003>
<Free memory in the server is 4,495,096 bytes. There is danger of OutOfMemoryError>
opmn.xml
file. These have not been implemented yet. Details are in OBIEE 11g Infrastructure Performance Tuning Guide (Doc ID 1333049.1.S_NQ_ACCT
and S_NQ_DB_ACCT
have been created in the main Datawarehouse schema SIT_DW.When the BI Server returns data from the database that it has to perform additional processing on (for example, stitching the results of multiple database queries together) it will use temporary files on disk if it cannot do so in memory. These files can grow to large sizes (several GB), which can cause operational issues. ACME Widgets have already reported this error:
[nQSError: 10058] A general error has occurred.
[nQSError: 43113] Message returned from OBIS.
[nQSError: 46017] Sort has no work space in the work directory.
[nQSError: 46073] Operation 'write()' on file '/u01/BI_FMW/instances/instance1/tmp/OracleBIServerComponent/coreapplication_obis1/obis_temp/nQS_16908414_889_49755138.TMP' failed with error: (2) A file or directory in the path name does not exist.. (HY000)
I would recommend moving the work area to its own filesystem for the following reasons:
To change the path used by the BI Server for writing temporary files update the WORK_DIRECTORY_PATHS
element in the NQSConfig.INI
configuration file.
The following bash script can be used to monitor the space in use by the BI Server temporary files. It will write a CSV log file that can be parsed by a tool such as Excel to show the use over time:
export FMW_HOME=/u01/BI_FMW/
export OUTBASE=/tmp/nqstmp_usage
while [ 1 -eq 1 ]; do echo -n $(date) >> $OUTBASE.du;echo -n ',' >> $OUTBASE.du;du $FMW_HOME/instances/instance1/tmp/OracleBIServerComponent/coreapplication_obis1/obis_temp >> $OUTBASE.du;sleep 5; done
Whilst on site I implemented OBIEE Connection Pool Instrumentation, so that the source of OBIEE activity on the Database can be easily identified. The RPD code to implement this is provided in the file instrumentation_patch.xml
, with full details available here. Details of the OBIEE connection can now be seen through Enterprise Manager pages, as well as the database system tables such as V$SESSION
:
The dashboard investigated was Management/Profit and Loss. This is now made up of four analyses; at week and period level, and at L1 and L2 of the hierarchy. An OBIEE Condition object is used to determine which analysis to display on the dashboard page.
CASE
statements in order to realise the required account structure in reports such as the P&L suggest that the data as stored on disk is not held in a way optimal for reporting. The impact of this is:
Whilst not an insubstantial piece of work, it may be that without remodelling the data on disk (and thus the RPD too) response times can never be brought into the second/sub-second range as expected by many users of interactive reporting.
An example snippet of the kind of statement follows:
case
when (("ACME Widgets Financials - GL Hierarchies"."GL Segment2 - Account"."L31 Account Code" ='XMGX0') and ("ACME Widgets Financials - GL Hierarchies"."GL Segment2 - Account"."L30 Account Code" ='MGM01')) then /*turnover third party*/"ACME Widgets Financials - GL Hierarchies"."GL Segment2 - Account"."L30 Account Code"||' '||"ACME Widgets Financials - GL Hierarchies"."GL Segment2 - Account"."L30 Account"||'~'||"ACME Widgets Financials - GL Hierarchies"."GL Segment2 - Account"."L31 Account Code"||'~'||"ACME Widgets Financials - GL Hierarchies"."GL Segment6 - Customer Type"."L31 Customer Code"||'~'||"ACME Widgets Financials - GL Hierarchies"."GL Segment6 - Customer Type"."L30 Customer Code"||' '||"ACME Widgets Financials - GL Hierarchies"."GL Segment6 - Customer Type"."L30 Customer"
when (("ACME Widgets Financials - GL Hierarchies"."GL Segment2 - Account"."L31 Account Code" ='XMGX0') and ("ACME Widgets Financials - GL Hierarchies"."GL Segment2 - Account"."L30 Account Code" ='MGM03')) then /*turnover not third party*/"ACME Widgets Financials - GL Hierarchies"."GL Segment2 - Account"."L30 Account Code"||' '||"ACME Widgets Financials - GL Hierarchies"."GL Segment2 - Account"."L30 Account"||'~'||"ACME Widgets Financials - GL Hierarchies"."GL Segment2 - Account"."L31 Account Code"||'~'||"ACME Widgets Financials - GL Hierarchies"."GL Segment2 - Account"."L29 Account Code"||' '||"ACME Widgets Financials - GL Hierarchies"."GL Segment2 - Account"."L29 Account"
A Presentation Variable is currently used in order to implement functionality allowing the user to switch the scale of facts (millions/thousands/full). This could be more efficiently done through creating a Session Variable with the default value for user that is then optionally overridden with a Request Variable. Setting the value to the actual order of magnitude instead of a text string would also remove the need to CAST
/CASE
the value as is currently seen in the logical SQL:
(cast(( case when ('millions' ='millions') then 1000000 when ('millions' ='thousands') then 1000 else 1 end ) as float
Using Conditions to determine which analysis to display in a dashboard page is a common technique, but it is worth being aware that each condition is going to generate a query to the BI Server, so as user concurrency increases the number of conditions in simultaneous use will do so proportionally.
OBIEE provides metrics through the Dynamic Monitoring System (DMS). These give crucial information about the internal functioning of OBIEE over time, including the number of active sessions, and the number of available connections to the database. In this graph here the number of connections to the database is saturated and requests begin to queue up -- this would exhibit itself to the user only by increased response times.
DMS metrics can be accessed in several ways, including:
DMS Spy, deployed by default on the AdminServer
http://obiee-server-01:7001/dms
http://obiee-server-01:7001/dms/index.html?format=metrictable&cache=false&prefetch=false&table=Oracle_BI_DB_Connection_Pool&orderby=Name
PerfMon, through OBIEE
http://obiee-server-01:9704/analytics/saw.dll?perfmon
Enterprise Manager (Fusion Middleware Control)
http://obiee-server-01:7001/em
Go to coreapplication -> Business Intelligence -> Capacity Management -> Metrics
For more permanent capture of DMS metrics we recommend using obi-metrics-agent (http://ritt.md/obi-metrics-agent) which captures the metrics to file or data store from where they can be analysed and monitored, including with tools such as Grafana:
The following metrics will highlight any configuration/capacity bottlenecks internal to OBIEE:
Oracle BI DB Connection Pool/* Connection Pool -> Current Queued Requests
Oracle BI Thread Pool/DB Gateway -> Current Queued Requests
Oracle BI Thread Pool/Server -> Current Queued Requests
Oracle BI Thread Pool/Usage Tracking -> Current Queued Requests
Oracle BI PS Chart Engine/Oracle BI PS Chart Engine -> Current Charts Queued
Oracle BI PS Query Cache/Oracle BI PS Query Cache -> Current Queued Queries
Oracle BI PS Thread Pools/AsyncLogon -> Current Jobs Queued
Oracle BI PS Thread Pools/Cancel -> Current Jobs Queued
Oracle BI PS Thread Pools/ChartThreadPool -> Current Jobs Queued
Oracle BI PS Thread Pools/HostLookup -> Current Jobs Queued
Oracle BI PS Thread Pools/IndexUsersAccount -> Current Jobs Queued
Oracle BI PS Thread Pools/KPI -> Current Jobs Queued
Oracle BI PS Thread Pools/KPICancel -> Current Jobs Queued
Oracle BI PS Thread Pools/MarketingRunJob -> Current Jobs Queued
Oracle BI PS Thread Pools/Query -> Current Jobs Queued
Oracle BI PS Thread Pools/Scorecard -> Current Jobs Queued
Oracle BI PS Thread Pools/ScorecardCancel -> Current Jobs Queued
Oracle BI PS Thread Pools/SocketRPCServer -> Current Jobs Queued
Oracle BI PS Thread Pools/Thumbnails -> Current Jobs Queued
The application server contains several core dump files. These can often be large in size and should be removed from the primary application filesystem to another area for further analysis and a Support Request (SR) raised to Oracle against them as required.
/u01/BI_FMW/instances/instance1/core
/u01/BI_FMW/instances/instance1/core_19922992/core
/u01/BI_FMW/instances/instance1/core_20971604/core
/u01/BI_FMW/instances/instance1/core_7143562/core
/u01/BI_FMW/instances/instance1/core.15073418.14162448
A stack trace for each coredump can usually be found in the console~coreapplication
log file for the system component that generated the dump.
init.ora
against the recommendations from Oracle. I observed no reason to change these further.
disk_asynch_io
, which is currently enabled and Oracle's Oracle Business Intelligence Applications Version 11g Performance Recommendations (Doc ID 1963225.1) document recommends disabling. This is a rather broad recommendation to make given the relation between the setting and the specific IO capabilities of a system that I would be inclined to leave it enabled unless testing proves otherwise.Tests were executed by submitting the Logical SQL to the BI Server through nqcmd. Test script run-test-keep-output.sh is provided.
Note that this analysis has since been optimised, but the analysis here should serve as a template for future issues.
Using the above data we can create a time profile:
That over a third of the time is spent in the BI Server is a concern given the slow performance, and the impact can be seen on the OS resources, discussed below.
Even with BI Server cache hit, the query is taking 36 seconds to compile. No amount of hardware will make this run faster; the compilation time has to be reduced by optimising the analysis and /or RPD.
The workload impact on both Database and Application server is well illustrated by these two graphs, showing a single query executing and the workload passing from Database to Application server: