Business logic changes often have an impact on structure or semantics of information stored in the application database, so a new client-side software most likely comes along with a need to perform mass data updates, to ensure that the database conforms with those changes. Frequently this is done in maintenance windows, when the service is taken off-line and is unavailable for some period of time. And because time is money, you have to do that updates as fast as possible, considering that the enterprise-class databases can be extremely large. Besides, maybe even more important thing is to leave that data in consistent state after the whole process.
Let's assume that no one will modify an application data except the maintenance staff - the application is shut down and the database server (Oracle in this case) is ready to perform an upgrade. Are there any caveats you can encounter?
Procedural approach vs. single statement
One of the most frequent mistakes made by Oracle developers I have seen (probably caused by old habits taken from imperative languages) is to do something procedurally, what can be done in a single SQL statement. Not only the latter case tends to execute faster than procedural approach, but also use significantly fewer resources. Consider the example (
big_table is a copy of dba_objects with some extra data pumped in, and has 500 000 records):(I've also seen other variations of that loop, for example with
SELECT FOR UPDATE cursor and UPDATE using CURRENT OF clause, which does basically the same but with row locking, so it is much slower - which makes no sense on the assumption that no other concurrent transactions will modify that data anyway).The problem with that kind of loop is a context switching between pl/sql and sql engines - every loop iteration involves a switch to sql engine (binding variables, if any), sql statement execution and switch back to pl/sql (again, binding return variables, if any). [1, 2]
With a single statement (which can also be boosted by Parallel DML feature) we get:
Of course not every job can be completed in a single statement - in the case of complicated problem it may be not possible (some people say it's always possible) - or it is possible, but awkward, so for some reason you must do it procedurally. But even a procedural approach can run faster than "ordinary" loop - just use array processing in conjunction with bulk SQL [3] (employ
FETCH BULK COLLECT with LIMIT and FORALL to reduce context switching issues). Optimal LIMIT X value can be choosen experimentally, through testing - don't blindly presume that "the bigger is the better":For example:
One can ask why don't just fetch all data, process it and do the bulk update in a single statement. The answer is simple - you can exhaust the process memory and get
ORA-04030 when you fetch all records without reasonable limit. Also keep in mind that this is a simple example, normally you wouldn't implement loop processing just to apply the trivial LOWER function to the table column.Create instead update?
Consider re-creating the table using
CREATE AS SELECT, doing necessary data transformations in the select statement. For example:Of course, you need extra free space to achieve this (as much as original table size in that case) and you will also have to re-create constraints, indexes etc. on the new table. Furthermore, some of those operations can be parallelized (Parallel DDL feature) or done without logging if necessary. You can also try to create a new empty table and populate it with data using direct-path insert (
INSERT /*+append */ INTO ... SELECT ...).Other considerations
In the event of mass data updates, where DML statements can take hours, I bet you could probably hear people saying: "But what about undo? We don't have sufficient undo space to handle such a big updates". Their answer is to commit frequently, so they break a transaction into smaller chunks, that can be commited independently, or they commit every N rows in a procedural loop (hopefully they write the code in a way that allows the whole task to be re-executed in case of error, yet leave data in consistent state).
My answer to previously stated question is: "You don't have enough undo? Then size your undo space accordingly to the task that have to be done". People tend to think about undo like a somewhat limited resource (or worse - like an overhead), but it's really a configuration issue that need to be resolved to handle long-running transactions correctly. You can always allocate more undo space for that big task, execute it, and if you don't need that extra undo space any more, you can revert to the previous undo size. It requires some ahead-of-time planning work to be done, not to mention close cooperation between developers and database administrator, but it is certainly possible.
Of course, there are situations with limited maintenance window time (you must complete the task within fixed X-hours window, and the time is critical). There is always a possibility for that big update to be unexpectedly interrupted - for example, you are executing computational intensive ten-million rows update on a table, only to discover that processing of the last row somehow cause an exception, and the whole statement is rolled back. You can apply a loop with array processing here (maybe even with those infamous cyclic commits), with an exception handler that will cope with any kind of errors, but remember that you can leave the system in inconsistent state - you've "done" your job, but you haven't modified all necessary data after all, so you must somehow deal with those errors and data incosistency, sooner or later. Which strategy would you use to achieve your goals (with all its advantages and disadvantages) - is a matter of your carefully thought out decision (preferably backed by performing authoritative tests).
Parallel execution
Parallel execution features of Oracle DBMS are methods that can cause a significant speed-up of certain operations on big data sets (if applied correctly) - especially on multi-core and multi-processor systems. Many of those features were not meant for use in OLTP environments, as they attempt to use all available resources (by default), thus limiting scalability in heavy-transactional systems. But if we are talking about mass data updates during system maintenance downtime, there will (or should) be no other concurrent transactions taking place, and the processing power of database server will be right there at our demand.
Maybe the simplest way to achieve simultaneous execution is so called "do-it-yourself" parallelism (I often call this "parallelism for the poor"). If you have a couple of independent tasks - for example, updating few non-related tables or processing non-overlapping set of rows in one table - you can submit them in a group of jobs (yes, i'm talking about
dbms_job or dbms_scheduler facility), which will automatically execute them in concurrent sessions, if the job queue has sufficient free processes to handle them all. You can also simply execute them by hand (or using script) in many sessions. Of course you have to be sure that those tasks are truly independent and will not cause concurrency-related issues (such as blocking other transactions with locks, or other resource contention problems that result with wait events). Will it be faster than the serial process? Hopefully yes, but it depends on the kind of operations you are running and available resources, so you must always test and measure it on representative data sets before going into production.Dividing the job into smaller sub-tasks - this is basically what the parallel-enabled Oracle features do - break the table (or index) into several pieces that can be handled independently by many concurrent processes (so called parallel execution servers). Oracle has many features related to parallel execution (though some of them are available only with Enterprise Edition license [4]), but they will be the subject of another post:
- Parallel query
- Parallel DML
- Parallel DDL
- Parallel pipelined functions
DBMS_PARALLEL_EXECUTE(new in 11g release 2) [5]
