Why doesn’t hibernate automatically update changed objects?

February 7, 2009

Have you ever asked yourself this question? Have you ever been surprised that the changes made to a persistent object are not committed to the database? Well this is a common problem and can be caused by many things. One nasty cause is if you are using a common hibernate batch processing pattern where session.flush and session.clear are used to manage memory and improve performance. Do you see the problem with the following code snippet?

public void doBatch() {
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
List personList = session.createQuery("from Person").list();
int i = 0;
for (Person person : personList) {
    person.setName("newName"); // this change should be caught by hibernate and cause the an update statement to be generated
    if ( ++i % 20 == 0 ) { //20, same as the JDBC batch size
        //flush a batch of inserts and release memory:
        session.flush();
        session.clear();
    }
}
tx.commit();
session.close();
}

You might expect that every Person object that didn’t already have “newName” for the name value would be updated in the database with that new value. That is the correct assumption for the first 20 Person objects. However #21 on will not be updated in the database. This is because when hibernate has persistent objects in the first level cache it compares any changes to the field values of the object to the values in the cache. When a field changes hibernate schedules an update for the object in the database. Calling session.clear() clears the first level cache which is great for saving memory but doesn’t help with auto detecting changes to persistent objects. There are two easy solutions to this problem:

public void doBatch() {
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
List personList = session.createQuery("from Person").list();
int i = 0;
for (Person person : personList) {
    String newValue = "newName";
    if (!newValue.equals(person.getName()) {
        session.update(person);  // need to manually call update because object may not still be in first level cache
    }
    person.setName(newValue); 
    if ( ++i % 20 == 0 ) { //20, same as the JDBC batch size
        //flush a batch of inserts and release memory:
        session.flush();
        session.clear();
    }
}
tx.commit();
session.close();
}

Notice the manual check for a value change and the call to session.update. The manual check prevents updating objects where no values have changed.

Another solution is to use a scrollable resultset. When a scrollable result set is used the persistent object is not loaded into the cache until resultset.next() is called. This way you do not have to worry about the object being cleared from the cache before you make changes to it.

Advertisement

Hibernate Batch Processing – Why you may not be using it. (Even if you think you are)

April 23, 2008

Hibernate batch processing is powerful but it has many pitfalls that developers must be aware of in order to use it properly and efficiently. Most people who use batch probably find out about it by trying to perform a large operation and finding out the hard way why batching is needed. They run out of memory. Once this is resolved they assume that batching is working properly. The problem is that even if you are flushing your first level cache, you may not be batching your SQL statements.

Hibernate flushes by default for the following reasons:

  • Before some queries
  • When commit() is executed
  • When session.flush() is executed

The thing to note here is that until the session is flushed, every persistent object is placed into the first level cache (your JVM's memory). So if you are iterating over a million objects you will have at least a million objects in memory.

To avoid this problem you need to call the flush() and then clear() method on the session at regular intervals. Hibernate documentation recommends that you flush every n records where n is equal to the hibernate.jdbc.batch_size parameter. A Hibernate Batch example shows a trivial batch process. Lets look at a slightly more complicated example:

public void doBatch() {
Session session = sessionFactory.openSession();
Transaction tx = session.beginTransaction();
for ( int i=0; i<100000; i++ ) {
    Customer customer = new Customer(.....);
    Cart cart = new Cart(...);
    customer.setCart(cart) // note we are adding the cart to the customer, so this object 
     // needs to be persisted as well
    session.save(customer);
    if ( i % 20 == 0 ) { //20, same as the JDBC batch size
        //flush a batch of inserts and release memory:
        session.flush();
        session.clear();
    }
}
tx.commit();
session.close();
}
&#91;/sourcecode&#93;

Assuming the Customer cascades save to the Cart object you would expect to see something like this in your SQL logs:

&#91;sourcecode language='sql'&#93;
insert into Customer values (...)
insert into Cart values(...)
&#91;/sourcecode&#93;

There are two reasons for batching your hibernate database interactions.  The first is to maintain a reasonable first level cache size so that you do not run out memory.  The second is that you want to batch the inserts and updates so that they are executed efficently by the database.  The example above will accomplish the first goal but not the second.  

The problem is Hibernate looks at each SQL statement and checks to see if it is the same statement as the previously executed statement.  If they are and if it hasn't reached the <code>batch_size</code> it will batch those two statements together using JDBC2 batch.  However, if your statements look like the example above, hibernate will see alternating insert statements and will flush an individual insert statement for each record processed.  So 1 million new customers would equal a total of 2 million insert statements in this case.  This is extremely bad for performance.  

The solution is very simple.  Just add the following two lines to your hibernate configuration.

<prop key="hibernate.order_inserts">true</prop>
<prop key="hibernate.order_updates">true</prop>        

These two parameters tell Hibernate to sort the insert and update statements before trying to batch them up. So if you have 20 inserts for a Customer object and 20 inserts for a Cart, those will be sorted and each call to flush will result in two JDBC2 batch executions of 20 statements each.


Hot Stock

February 1, 2008

Wish I would have got in early!
Stock Chart
Top Movers


Using Terracotta To Cluster a Single JVM Master/Worker Application

October 5, 2007

I developed a batch processing system for my company that follows the Master/Worker design pattern. Of course I did this before things like Spring Batch or Open Data Grid were available. So now it works well enough (robust, task retry, generic workers, generic task partitioning, etc…) that it is hard for me to justify ripping it out and replacing it with one of the open source alternatives now available. However, I’d still like to take the system to the next level and distribute tasks among multiple JVMs (and machines) rather than just single JVM multi-threaded application. I’ve done test implementations in the past using Spring and JMS and that has worked just fine but this blog inspired me to give Terracotta a try.

The basic approach I took when developing our batch processing framework was to use the Java 1.5 concurrency package and especially the ExecutorService to simplify the concurrency and thread issues related to a parallel processing framework. The framework uses four basic components, a Task, a TaskPartitioner, TaskExecutor. and a TaskMaster.

public interface Task {
...
}

public interface TaskPartitioner {
   public Task getTasks(Task parentTask) throws PartitionException;
}

public interface TaskExecutor extends Runnable {
   /**
    * Execute a single task.  
    * This task should be a leaf-node that is 
    * currently in executable state. 
    * The task will be treated as a independent 
    * unit-of-work and any transactional operations will
    * be committed upon completion of task execution
    */
   public void executeTask(Task task) throws BatchException;
}

public interface TaskMaster {
   /**
    * This method takes a root task, parititons it into sub-tasks
    * and then executes those sub-tasks when they enter
    * an executable state.  
    */
   public void runTask(Task rootTask) throws BatchException;
}

The basic idea is a client submits a Task and a TaskPartitioner which is processed by the batch framework. The framework splits the Task into sub-tasks and may continue splitting those sub-tasks until it gets to a executable child. I’ll write about the TaskIterator in another post. As Tasks become executable that are submitted to the ExecutorService. Here is a very basic implementation of the TaskMaster interface that takes a root tasks and submits the executable sub-tasks to the ExecutorService when the are in an executable state.

public class SimpleTaskMaster implements TaskMaster {
   private final ExecutorService executorService = Executors.newFixedThreadPool(5);

   public void runTask(Task rootTask) throws BatchException {
       
       private final TaskIterator taskIterator = new TaskIterator(rootTask);
       while (taskIterator.hasNext()) {
           Task task = taskIterator.next();
           if (task == null) {
               // this means we have tasks
               // waiting to execute but they 
               // are blocked by other tasks
               // most likely serial dependencies
               try {
                  Thread.currentThread().sleep(100);
               } catch (InterruptedException ie) {}
               continue;              
           } 
           executorService.execute(new BasicTaskExecutor(te));
       }           
   }
}

Now that the basic framework is defined we can take it to the next level by replacing the BasicTaskMaster with a slightly modified version that will enable the introduction of Terracotta in the next step. By replacing the direct usage of the ExecutorService with a BlockingQueue we can distribute the queue and have or workers access the queue from the same or a different JVM using Terracotta’s clustering capabilities.

public class SimpleWorkQueue implements WorkQueue {
   private final BlcokingQueue workQueue;
   
   public Task getWork() throws InterruptedException {
     return workQueue.take(); // blocks if empty
   }

    public void addWork(Task executableTask) {
      workQueue.put(task);
    }
} 
public class BasicTaskMaster2 implements TaskMaster {
   private final WorkQueue workQueue;

   public BasicTaskMaster2(WorkQueue workQueue) { 
     this.workQueue = workQueue;
   }

   public void runTask(Task rootTask) throws BatchException {
       
       private final TaskIterator taskIterator = new TaskIterator(rootTask);
       while (taskIterator.hasNext()) {
           Task task = taskIterator.next();
           if (task == null) {
               // this means we have tasks
               // waiting to execute but they 
               // are blocked by other tasks
               // most likely serial dependencies
               try {
                  Thread.currentThread().sleep(100);
               } catch (InterruptedException ie) {}
               continue;              
           } 
           // instead of directly using the ExecutorService
           // we add tasks to workQueue which 
           // may or may not be distributed across
           // multiple JVMs.  Nice part is that it works
           // in a clustered or non-clustered environment
           workQueue.addWork(task);
       }           
   }
}

OK, finally… Time to use Terracotta to cluster this thing. Since I was already using Spring for DI it made perfect sense to use it to help cluster the application. To do so was surprisingly simple. First I had to create the terracotta configuration that distributes WorkQueue.

<tc:tc-config xmlns:tc="http://www.terracotta.org/config"
              xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
              xsi:schemaLocation="http://www.terracotta.org/schema/terracotta-4.xsd">
    <servers>
        <server host="%i" name="sample"/>
    </servers>
    <clients>
        <logs>%(user.home)/terracotta/client-logs/spring/coordination/%D</logs>
    </clients>
    <application>
        <spring>
            <jee-application name="*">
                <instrumented-classes>                   
                    <include>
                        <class-expression>
                           com.quantumretail.qlogic.batch.Task
                          </class-expression>
                    </include>
                </instrumented-classes>

                <locks>
                    <autolock>
                        <method-expression>* *..*.*(..)</method-expression>
                    </autolock>
                </locks>

                <application-contexts>
                    <application-context> 
                        <paths>
                            <path>*-context.xml</path>
                        </paths>
                        <beans>
                            <bean name="workQueue"/>
                        </beans>
                    </application-context>
                </application-contexts>
            </jee-application>
        </spring>
    </application>
</tc:tc-config>

The XML above basically tells Terracotta to distribute the Spring bean with name workQueue. I’ve oversimplified a bit because you will also need to tell Terracotta to instrument any objects that may be referenced by Task because those will be distributed as well. I’ll skip the Spring configuration here but it is pretty simple. Just define the workQueue bean and inject that in the BasicTaskMaster2 bean.

Now that the Master side of the Master/Worker is ready for clustering we have to define the Worker. The Worker is designed to be able to run in one or more JVMs and simply sits and waits for Tasks to become available. When they are it executes and the looks for the next task.

public class SimpleTaskExecutor implements TaskExecutor {
   private final WorkQueue workQueue;
   // the clustered workes can be multi-threaded as well
   private transient ExecutorService executorService = 
         Executors.newFixedThreadPool(5);
   public SimpleTaskExecutor(final WorkQueue workQueue) {
      // injected workQueue may be a local or 
      // clustered object, doesn't matter to this class
      this.workQueue=workQueue;
   }

   public void start() {
      while (true) {
         final Task task = workQueue.getWork();
         executor.service.execute(...)
      }
   }
}

At this point the clustered framework is ready to run. So after the terracotta server is running we start the Master JVM which will add work to the workQueue.

public class Master {
   public static void main(String[] args) {
      new ClassPathXmlApplicationContext(new String["master-context.xml"]);
   }
}

And then run one-to-n workers.

public class Worker {
   public static void main(String[] args) {
      ApplicationContext ctx = 
          new ClassPathXmlApplicationContext(new String["worker-context.xml"]);
      TaskExecutor taskExecutor = (Tas)context.getBean("simpleTaskExecutor");
      taskExecutor.start();
   }
}

Now we have a clustered Master/Worker application that is ‘Ready for Work’. Obviously I’ve skipped a ton of details and as described above the application would not be a robust, fault-tolerant batch processing framework but it should give you an idea of how easy it is to take a single JVM application and turn it into a clustered application with a little re-factoring and a little XML.


SimpleDateFormat is not so Simple

September 22, 2007

I came across this blog Another Code Treasure today. It provides an example of a poorly written method to validate a date and then provides a better alternative. Here is the better alternative provided in the blog:

private static SimpleDateFormat mySpanishDateFormatter =
             new SimpleDateFormat("ddMMMyyyy", new Locale("es"));
    
public static boolean isSpanishDateStringValid(String dateStr) {
        if(dateStr==null)
           return false;
        mySpanishDateFormatter.setLenient(false);
        try {
             mySpanishDateFormatter.parse(dateStr);
             return true;
        } catch (ParseException e) {
            return false;
        }
}

This code is fine as long as you have a single-threaded application. But what happens if you have a mutlti-threaded application? The answer is we don’t know because SimpleDateFormat is not thread-safe. The code above will likely cause an exception to be thrown. The fix is to store the SimpleDateFormat in a ThreadLocal variable. This way each Thread gets its’ own instance of the SimpleDateFormatter. Here is an example of a Thread-Safe implementation of the date validation code:

private final ThreadLocal threadLocalFormat = 
    new ThreadLocal();

private DateFormat getDateFormat() {
        if (threadLocalFormat.get() == null) {
            threadLocalFormat.set(new SimpleDateFormat("ddMMMyyyy", new Locale("es")));
        }
        return threadLocalFormat.get();
}

public static boolean isSpanishDateStringValid(String dateStr) {
        if(dateStr==null)
           return false;
        getDateFormat().setLenient(false);
        try {
             getDateFormat().parse(dateStr);
             return true;
        } catch (ParseException e) {
            return false;
        }
}

Don’t Let Your Non Thread-Safe Variables Escape!

September 16, 2007

I was asked to help debug a problem in a project that was about to be deployed. It was one of those ‘fun’ transient problems that only happens when you can least afford it. Like when the customer is testing the product.

The project consists of a JSP/AJAX web tier that connects to the DB via a JPA EntityManager. The problem we had is that every now and then the user would see a nasty assertion error for an uncaught exception displayed in their browser. (Why the exception was not handled nicely is a another subject worth writing about).

The code uses a custom static class to provide access to the EnityManager (Using a DI framework instead would be nice). It has some actions that span multiple calls so the EntityManager for each Thread was stored in a ThreadLocal variable. This should ensure that each Thread should get its’ own EntityManager and hence its’ own transaction. I noticed a few synchronization problems when I first looked at the class. For example:

public static EntityManager getEntityManager() {
  EntityManager entityManager = 
     managerThreadLocal.get();
  if (entityManager == null 
                          || !entityManager.isOpen()) {
    if (entityManagerFactory == null) {
      entityManagerFactory = 
        Persistence.createEntityManagerFactory("myDS");
    }
    entityManager = 
      entityManagerFactory.createEntityManager(); 
    managerThreadLocal.set(entityManager);
  }
  return entityManager;
}

So to fix this we made the entityManagerFactory a private static final member variable of the class. Since the enityManager itself is ThreadLocal we shouldn’t need to synchronize access to that variable since only one Thread should be ever accessing its’ own EntityManager at any one time.

Of course this did not fix our main problem. The real problem turned out to be that the method above breaks Thread confinement. Sound the alarms! We have an escaping variable! Exposing a public method that returns a ThreadLocal variable is dangerous because we have no control over what the other classes in the system will do with that variable once it is returned. In this project the problem turned out to be that one class adopted this ThreadLocal variable and turned it into a member variable. So sometimes it worked, and others it didn’t.

The best way to resolve this is to not let the enityManager escape. Encapsulate all access to this variable inside of a class that is responsible for ensuring thread safe access to the non thread safe variable. This solution is cumbersome and adds a layer of abstraction that isn’t core to the business logic but is certainly much better than transient assertion errors visible to the customer. Switching to a declarative transaction management implementation such as this one, Transaction Management Using Spring is a better solution in the long run.