Hibernate Batch Processing – Why you may not be using it. (Even if you think you are)

April 23, 2008

Hibernate batch processing is powerful but it has many pitfalls that developers must be aware of in order to use it properly and efficiently. Most people who use batch probably find out about it by trying to perform a large operation and finding out the hard way why batching is needed. They run out of memory. Once this is resolved they assume that batching is working properly. The problem is that even if you are flushing your first level cache, you may not be batching your SQL statements.

Hibernate flushes by default for the following reasons:

  • Before some queries
  • When commit() is executed
  • When session.flush() is executed

The thing to note here is that until the session is flushed, every persistent object is placed into the first level cache (your JVM's memory). So if you are iterating over a million objects you will have at least a million objects in memory.

To avoid this problem you need to call the flush() and then clear() method on the session at regular intervals. Hibernate documentation recommends that you flush every n records where n is equal to the hibernate.jdbc.batch_size parameter. A Hibernate Batch example shows a trivial batch process. Lets look at a slightly more complicated example:

public void doBatch() {
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
for ( int i=0; i<100000; i++ ) {
    Customer customer = new Customer(.....);
    Cart cart = new Cart(...);
    customer.setCart(cart) // note we are adding the cart to the customer, so this object 
     // needs to be persisted as well
    session.save(customer);
    if ( i % 20 == 0 ) { //20, same as the JDBC batch size
        //flush a batch of inserts and release memory:
        session.flush();
        session.clear();
    }
}
tx.commit();
session.close();
}
&#91;/sourcecode&#93;

Assuming the Customer cascades save to the Cart object you would expect to see something like this in your SQL logs:

&#91;sourcecode language='sql'&#93;
insert into Customer values (...)
insert into Cart values(...)
&#91;/sourcecode&#93;

There are two reasons for batching your hibernate database interactions.  The first is to maintain a reasonable first level cache size so that you do not run out memory.  The second is that you want to batch the inserts and updates so that they are executed efficently by the database.  The example above will accomplish the first goal but not the second.  

The problem is Hibernate looks at each SQL statement and checks to see if it is the same statement as the previously executed statement.  If they are and if it hasn't reached the <code>batch_size</code> it will batch those two statements together using JDBC2 batch.  However, if your statements look like the example above, hibernate will see alternating insert statements and will flush an individual insert statement for each record processed.  So 1 million new customers would equal a total of 2 million insert statements in this case.  This is extremely bad for performance.  

The solution is very simple.  Just add the following two lines to your hibernate configuration.

<prop key="hibernate.order_inserts">true</prop>
<prop key="hibernate.order_updates">true</prop>        

These two parameters tell Hibernate to sort the insert and update statements before trying to batch them up. So if you have 20 inserts for a Customer object and 20 inserts for a Cart, those will be sorted and each call to flush will result in two JDBC2 batch executions of 20 statements each.