|
|
Features Java Persistence on the Grid: Approaches to Integration
JPA - the enterprise standard for accessing relational data in Java
By: Shaun Smith
Jun. 2, 2009 04:00 PM
The Java Persistence API (JPA) is the enterprise standard for accessing relational data in Java. JPA provides support for mapping Java objects to a database schema and includes a simple programming API and expressive query language for retrieving mapped entities from a database and writing back changes made to these entities. JPA offers developers productivity gains over writing and maintaining their own mapping code allowing a single API regardless of the platform, application server, or persistence provider implementation. Besides the productivity gains the leading implementations offer developers valuable performance and scalability benefits through the inclusion of caching solutions. These caching solutions allow frequently accessed entities to be cached which reduces the number of queries going to the database and the amount of processing time spent converting database query results into objects. Caching can have a significant positive effect on application performance. JPA and Data Grids The relevance of data grids to today's enterprise applications is huge, yet their usage is still limited to technology specialists. Data grids are becoming mainstream and developers should consider grid architectures when developing applications and be aware that an application might be expected to scale up to a grid in the future. Consider a banking system that processes incoming deposit and withdrawal requests by validating all fields before writing them to the database. Validation might include whether the account is valid, whether the requester is the account owner, whether the account contains sufficient funds for the request, etc. You can imagine there are many other validations that could be performed in such a system. The amount of data you have to read from the database to perform the validation of a single request can be significant and result in a large number of queries. Fortunately building such a database-centric application in JPA is straightforward. You map each of the classes in your domain to the database and write the necessary JP QL queries to retrieve the objects required for validation. The system may have to read large amounts of data from the database to process each request, but it works. Now if we want to dramatically increase the throughput of this system we'd have to address its single greatest bottleneck: querying the database for validation data. Most JPA implementations either provide an L2 cache or support the integration of third-party L2 caches. But if we have to handle very large numbers of requests that arrive in a random order it's unlikely we'll have the required reference data in cache. Caches are useful when you're repeatedly accessing the same data. If your access pattern is random then it's unlikely your cache will contain what you need when you need it. Of course you can always increase your cache size to better your odds of a hit, but each server only has so much heap. Data grids provide a way to exceed the heap size of a single server and distribute your cached objects over a cluster of servers. The challenge is to integrate data grid technology with JPA to increase throughput without requiring complete application rewrites. Of course as is typically the case with software systems, there's more then one approach to integration, each with its advantages and disadvantages. Let's look at different integration architectures and how we could use them. Data Grid as Middle Tier Object Cache For example, consider a simple validation method in our banking system: public boolean isValidAccount(Request request) { With the data grid integrated as the L2 cache, the find() will check the grid for the desired Account. If not found, it can then proceed to query the underlying database. However, if the grid is warmed with all the Accounts then there will be no need to query the database. Warming the appropriate caches can eliminate database access from the validation process entirely. Primary key finds are easily directed to the data grid but what about JP QL queries? Consider this method, which finds the Customer associated with a request using a non-primary key query: public Customer getTxCustomer(Request request) Querying the data grid for an object that matches an arbitrary criterion is problematic. First it requires that the data grid provides some sort of query framework and second that the JPA/data grid integration can translate from JP QL into this framework. If both requirements are satisfied then it's possible that the query in our example could be directed to the grid and not the database. One of the most valuable features of this approach is the possibility of parallel query execution. It stands to reason that the query in our example could be executed in parallel on all the servers in the grid to find the desired object. However, a query that returns many objects is much more interesting. Each grid server could execute a query in parallel to identify those objects it holds that match a given criteria. Performing such a query 10 times in parallel on 10,000 objects is going to be much faster than one time on 100,000 objects. The more servers the smaller the number of objects on each server and the faster the query executes! Unfortunately there's one complication with queries that return multiple results. Unlike a primary key find() in which a cache miss could automatically result in a database query, it's not clear whether the results obtained from the grid are sufficient. Perhaps only half of the objects you're looking for are in the grid so a grid query wouldn't return the other half from the database. Warming the cache solves this problem by ensuring all objects are in the grid but that isn't always possible. However, for a given use case, you may know whether a particular query should be directed to the grid or to the database. The way you effect query execution in JPA is through query hints. Perhaps something like: Customer customer = entityManager Of course there's no standard JPA hint for whether to direct a query to a data grid or not. This means that you'd have to introduce implementation-specific hints into your code. Fortunately the JPA specification requires that implementations ignore hints they don't understand so your code isn't tightly coupled to any particular one through hints. Updating Objects Data Grid as System of Record In this architecture, all JPA operations that would normally have resulted in SQL directed to the database are directed instead to the data grid. This includes all queries and all updates. Essentially we replace the database entirely with the data grid. With JP QL translation support we can continue to use JPA as our programming API while working with data stored exclusively in the middle tier. For systems that don't need long-term persistent storage this is ideal. And if more storage or query performance is required you simply add servers to the grid. Database-backed Data Grid Mix and Match - Heterogeneous Configuration If we can configure how to read/write/query on an Entity-by-Entity basis we can mix architectures. Consider a stock trading application. In such an application you've got "enduring" Entities like Companies, Stocks, and Bonds. But you've also got transient Entities like Bids and Asks. An Entity-level configuration would enable JPA to use the data grid as a cache for persistent Entities, like Company, and the data grid as the system of record for transient Entities like Bids. Scaling JPA with a Data Grid Traditionally, scaling up a JPA application is done by increasing the number of servers in the application cluster and using a load balancer to distribute the work evenly. But as you increase the cluster size you are limited in what you can cache without introducing inter-process messaging and locking. Updates to shared data must be communicated to all cluster servers to ensure no JPA caches contain stale data. For a cluster with N servers this means each update will require N-1 messages. As you increase the number of servers in the cluster the cost of processing a single concurrent update per server increases quadratically according to (N-1)² because each server must message every other server for every update. Worse still, as the cluster grows each server will have to spend a significant amount of its available processing time dealing with incoming update messages. These non-linear communication and update processing costs means that while traditional approaches to clustering JPA applications that employ caching do work well, they are limited to small-to-medium-size clusters. A data grid solves this communication problem by having only one shared copy of an object accessible from all servers. An update doesn't require messaging to all servers because they'll each pick up the change next time (if ever) they need the updated object. In a data grid with a scalable peer-to-peer communication architecture (i.e., one without a central message routing bottleneck) an update requires communicating to the server that stores the object and to the server(s) that stores a backup copy in case of failover. In this case, the communication cost for processing a single concurrent update per server is described by the linear function C(N) where C is a constant reflecting the number of copies (primary and backups). This linear update cost means that it's possible to scale JPA application using a data grid to large clusters and achieve much higher throughput than would typically be possible. Challenges As discussed earlier, staleness due to updates made in other cluster servers is traditionally solved by messaging although it has its limitations. In high-transaction-rate systems where the messaging overhead is significant, JPA applications tend to minimize their cache usage and rely on the database to ensure they have the most recent version of the data. Ironically, as transaction rates increase and the value of caching increases it's often disabled because the cost of maintaining cache coherence is too expensive. The use of a data grid to virtually eliminate the messaging and update processing overhead means that high-transaction-rate systems can take advantage of caching to achieve even higher throughput without having to manage staleness. Querying is another challenge. JPA defines the general-purpose JP QL that is in many ways similar to SQL and includes many of the same notions. The goal of JP QL is to provide an object-based query language that's easy to translate into SQL for execution on a relational database. Of course data grids aren't relational databases and each has its own query framework. The extent that JP QL can be translated and executed on a particular grid depends on the expressiveness of the grid's query framework. Another challenging area is object relationships. JPA supports a number of relationship types along with the notion of embedded objects. Relationship support varies by data grid product and each has its subtleties. Issues include: what kind of relationships are supported; whether objects can have relationships across the grid or must be co-located; and what query operators are supported on relationships. The answer to this last question obviously has a big impact on what kind of JP QL queries can be executed. This list is definitely not exhaustive but it highlights the kinds of issues that have an impact on JPA/data grid integration. Conclusion References Reader Feedback: Page 1 of 1
|
|||