HnD | Support for Query Batches, Page 1

Support for Query Batches

Posts

squirm32
User

Posts: 7
Joined: 06-Aug-2009

# Posted on: 16-Oct-2010 14:15:48

Is there a way to perform multiple queries in a single round trip to the database similar to what NHibernate does with future queries?

I'm using the 3.0 version of LLBLGen.

Otis
LLBLGen Pro Team

Posts: 39895
Joined: 17-Aug-2003

# Posted on: 17-Oct-2010 12:06:56

future queries are meant to fetch related entities, am I correct? We use prefetch paths for that (eager loading). LLBLGen Pro doesn't use query batching, because we need per-entity functionality to work, e.g. vallidation/authorization/auditing and when batch based fetching is used, this is not possible.

Frans Bouma | Lead developer LLBLGen Pro

squirm32
User

Posts: 7
Joined: 06-Aug-2009

# Posted on: 17-Oct-2010 16:26:38

I'm not familiar with NHibernate, but I believe future queries simply let you batch statements into one round-trip. They are not necessarily for related data. Here are a few very simple examples of what NH does. The first is unrelated data and the second is for related data.

http://elegantcode.com/2009/01/25/nhibernate-and-future-queries/ http://gnschenker.blogspot.com/2007/06/optimize-nhibernate-queries.html

Let me back up and give you a bit more information about my underlying motivation for asking this question. Our system experiences an extremely high transactional volume. As such, we are always looking for ways to decrease the load on our DB servers. I have been told by co-workers who conducted tests that, at least in some situations, when executing queries that involve joins, we can save a few milliseconds per call overall by doing selects to each of the individual tables and then joining the results in memory on the calling machine vs. letting the DB server do the joins for us. That seems hard for me to believe but, assuming it's true, we could reduce our load greatly by executing queries in this manner.

I am the resident expert on LLBLGen at my workplace so the question posed to me was, "Can LLBLGen do multiple, individual queries to retrieve related data?" My answer was, "yes" by using prefetch paths. However, those execute in separate round-trips which would obviously negate any performance gain we would see. The real solution seemed to be having prefetch paths execute in a single round-trip. As I typed an alternate version of this response, I started to realize that this was not going to solve the problem either because the DB server would still have to do joins to get the related data. There's no way to get around joins generally in prefetch paths OR in the NHibernate examples I referenced above.

I think the real solution I'm after would be to allow multiple entity fetches to be executed in the same batch (round-trip). That would give the developer the flexibility to ask for any data they want in a single db call, get the convenience of having the results projected into entities automatically, and allow the developer to assemble/relate those as desired.

I admittedly don't understand all the ramifications this would have on other aspects of the LLBLGen system, but on the surface it seems reasonable.

Otis
LLBLGen Pro Team

Posts: 39895
Joined: 17-Aug-2003

# Posted on: 18-Oct-2010 11:28:02

squirm32 wrote:

I'm not familiar with NHibernate, but I believe future queries simply let you batch statements into one round-trip. They are not necessarily for related data. Here are a few very simple examples of what NH does. The first is unrelated data and the second is for related data.

http://elegantcode.com/2009/01/25/nhibernate-and-future-queries/ http://gnschenker.blogspot.com/2007/06/optimize-nhibernate-queries.html

Ok, but where is the 'advantage'? The single roundtrip? In LLBLGen Pro you can keep the connection open if you want, making a second round trip almost instantaneous, and no downsides.

Adapter makes this easy. Just set 'DataAccessAdapter.KeepConnectionOpen' to true, fetch multiple sets, set it to false and call CloseConnection on the DataAccessAdapter instance.

Let me back up and give you a bit more information about my underlying motivation for asking this question. Our system experiences an extremely high transactional volume.

... which means a lot of inserts, updates or a lot of selects? This makes a difference.

As such, we are always looking for ways to decrease the load on our DB servers. I have been told by co-workers who conducted tests that, at least in some situations, when executing queries that involve joins, we can save a few milliseconds per call overall by doing selects to each of the individual tables and then joining the results in memory on the calling machine vs. letting the DB server do the joins for us. That seems hard for me to believe but, assuming it's true, we could reduce our load greatly by executing queries in this manner.

Doing joins outside the DB is not more efficient. The DB can use indexes to join sets and use cached hash-sets to do the nested loop over the hashes to join sets together. Furthermore, all happens inside the DB memory, so no data transport to the outside. Mind you: the data joined is often not materialized into objects on the client: you join to filter on a related entity. If you fetch a lot of projections on Joins, you should look into materialized / indexed views. These are readonly sets which are stored on-disk as-is, and can greatly increase performance on joined sets. Doing joins outside the DB has higher costs wrt to data transport and join logic, the only advantage is that you use CPU / memory outside the DB server, but that's about it.

I am the resident expert on LLBLGen at my workplace so the question posed to me was, "Can LLBLGen do multiple, individual queries to retrieve related data?" My answer was, "yes" by using prefetch paths. However, those execute in separate round-trips which would obviously negate any performance gain we would see.

They execute in separate queries but the connection is kept open, so the difference is minimal: the SQL has to be generated anyway, the SQL also has to be interpreted by the RDBMS anyway, and the resultset has to be returned anyway. The only extra is an extra connection made at a low API level, which really isn't the cost of your query (and definitely not a couple of ms), as the connection is already open: there's already a listener active at the RDBMS level, ready to interpret commands.

Furthermore: prefetch paths generate more efficient queries in a lot of cases, because they are generated based on the data of the parent node. This means that if the # of elements in the parent node is below the threshold, it will generate an IN (...) clause instead of a subquery, which is much quicker. Queries which are executed together can't do that and always have to fall back onto subqueries.

We fetch 1 node at a time because this is the most efficient way to do it: multi-branch graphs and 1:n relationships cause data duplication if you don't fetch per node, which makes the resultset to return (also with futures!) potentially extremely big. Data transport is often a slowdown in queries, as it takes a while to get data transported to the client (relatively speaking) and if this is brought down to a minimum, overall performance will go up.

The real solution seemed to be having prefetch paths execute in a single round-trip. As I typed an alternate version of this response, I started to realize that this was not going to solve the problem either because the DB server would still have to do joins to get the related data. There's no way to get around joins generally in prefetch paths OR in the NHibernate examples I referenced above.

I think the real solution I'm after would be to allow multiple entity fetches to be executed in the same batch (round-trip). That would give the developer the flexibility to ask for any data they want in a single db call, get the convenience of having the results projected into entities automatically, and allow the developer to assemble/relate those as desired.

Multiple, unrelated resultsets really can only work with MARS, otherwise you need to add columns to the single resultset to return for the subsequential resultsets of the subsequential queries executed. This is not efficient at all, because it's likely the sets don't have the same # of rows. As 1 set is returned, the # of rows returned is equal to the # of rows in the largest subset.

But again, the real gain is already available: a single open connection. All commands are already executed over the same connection which is open if you ordered LLBLGen Pro to keep the connection open.

Still it's really not a wise idea to join on the client to gain performance, simply because data transport and less efficient joins (no index usage) is slower, so overall your queries are slower.

IMHO what you should look into is lower the number of locks on tables. With high-transaction volume databases, it's key your system can keep inserting rows without blocking selects. One of the ideas which can help is for example to create two databases: one for reading and one for writing. The reading one is read-only and only used for selects, uses materialized/indexed views and utilize the built-in indexed view system of the RDBMS, which auto-updates indexed views when a table changes due to an update, or use a set of tables instead of views and build the readonly database in these tables a couple of times a day or once every night, depending on whether you need the data to be very up to date or not.

Often data for read purposes (not for modification!) can be stale for a little while, which greatly reduces the pressure on databases because you won't have blocking selects which have to wait because updates/inserts run on the same table. Also (but this is a no brainer) make sure all fields touched in filters are indexed.

Frans Bouma | Lead developer LLBLGen Pro

squirm32
User

Posts: 7
Joined: 06-Aug-2009

# Posted on: 18-Oct-2010 17:28:22

Thanks for your detailed and thoughtful response.

Your point about leaving the adapter open for multiple queries is well taken. I went back to look at the NH examples to see if that was the difference. If they dispose of the connection between each query then that would obviously explain their need for a batching mechanism. I understand this limitation does not exist in LLBLGen. I assume the session object in NH is equivalent to the adapter and would provide similar ability to keep the connection open though, so I now struggle to see the real need for the NH future query support.

There may be more than an academic difference (however negligible) between the execution times of running two queries in the same batch and returning them at the same time vs. running each one individually (even on the same open connection). However, the only way fetching related data could be optimized over the current prefetch functionality would be in extremely limited cases.

For example, if we have... TableA: ID (PK)

TableB: ID (PK) A_ID (FK to TableA.ID)

... and we want to retrieve a single entity A and it's related B entities using A's PK, then we could do...

SELECT * FROM TableA WHERE ID = [value] SELECT * FROM TableB WHERE A_ID = [value] AND [Some other optional filter on B]

Notice here that I'm not going to do joins in the resultsets on the client because I don't really have anyting to join. In this specific case, the most optimal way to execute these two statements would be to submit a single batch to the server that returned two resultsets. In this specifc case, this would also have to be more efficient than using prefetch paths because I don't have the overhead of having to run any prefetch path code. Again, I do realize that this statment only holds true any time we are only filtering TableA on columns also found in TableB.

I concede that I have not done tests to definitively measure the difference in execution speed between running two separate commands in the same batch and running them individually in different batches on the same open connection (as you suggested earlier). I suspect you may have tested this and concluded that any savings wasn't worth the effort. Still, it strikes me that this pattern is frequently used to retrieve data and even if there were a 1 ms difference, given our transaction volume, we may see an overall benefit.

Walaa
Support Team

Posts: 14995
Joined: 21-Aug-2005

# Posted on: 19-Oct-2010 10:27:09

In this specifc case, this would also have to be more efficient than using prefetch paths because I don't have the overhead of having to run any prefetch path code.

What prefetchPath code, that would need to be run? Using a PrefetchPath would result in the same thing, 2 quesries sent in the same connections and 2 resultsets returned.

squirm32
User

Posts: 7
Joined: 06-Aug-2009

# Posted on: 19-Oct-2010 13:25:06

No prefetch code would need to be run. I was elaborating on the scenario where both of those queries could be submitted to SQL server in the same batch instead of individually. With prefetch paths you have this: 1. Open DB connection 2. SELECT * FROM TableA WHERE ID = [value] 3. Receive results from query. 4. SELECT * FROM TableB WHERE A_ID = [value] AND [Some other optional filter on B] 5. Recieve results from query. 6. Close DB connection.

What I was proposing was a way to do this: 1. Open DB connection 2. Submit batch with both select queries at one time. 3. Recieve multiple result set response from DB. 4. Close DB connection.

I understand Fran's assertion that there is no significant diference between these two operations. I still think there may be utility in being able to have multiple queries run in the same batch, but I am fine with his answer.

Otis
LLBLGen Pro Team

Posts: 39895
Joined: 17-Aug-2003

# Posted on: 19-Oct-2010 16:32:58

squirm32 wrote:

No prefetch code would need to be run. I was elaborating on the scenario where both of those queries could be submitted to SQL server in the same batch instead of individually. With prefetch paths you have this: 1. Open DB connection 2. SELECT * FROM TableA WHERE ID = [value] 3. Receive results from query. 4. SELECT * FROM TableB WHERE A_ID = [value] AND [Some other optional filter on B] 5. Recieve results from query. 6. Close DB connection.

What I was proposing was a way to do this: 1. Open DB connection 2. Submit batch with both select queries at one time. 3. Recieve multiple result set response from DB. 4. Close DB connection.

I understand Fran's assertion that there is no significant diference between these two operations. I still think there may be utility in being able to have multiple queries run in the same batch, but I am fine with his answer.

Actually, your proposal _could _ be slower (but also faster) . Let me show an example (this is true for set fetching, not single entity fetches, which are so quick, it's academic to debate about speed really)

say you fetch a bunch of orders and the customers they belong to. Say the set of orders is TableA and the customers is table B. Say the # of customers of the set of orders is below the ParameterizedPrefetchPathThreshold you've set (e.g. threshold is 100 and you have 50 customers).

in your case, you have to specify as queries: 1: SELECT ... FROM Orders WHERE <order filter> 2: SELECT ... FROM Customers WHERE CustomerID IN (SELECT CustomerID FROM Orders WHERE <order filter>)

However, with prefetch paths and 2 queries, this can be optimized. The second query will become: 2: SELECT ... FROM Customers WHERE CustomerID IN (@p1, @p2, ... @p50);

this is quicker, as the query doesn't have to run over Orders again.

So you can optimize subsequential queries, based on data already fetched, but you can only do that if you execute 1 query at a time (obviously). This might be a bit slower due to subsequential commands being executed (although I doubt anyone could measure that as the biggest hurdle, the connection, is kept open) but it's a lot quicker due to smarter SQL

It's a give and take system though, as I don't want to suggest that this system is quicker in all situations. It's possible to design a situation where this isn't quicker and a different approach might be faster. However in generic code it's impossible to decide what to do: take approach X over approach Y or vice versa, as data in the DB (statistics) are needed to determine that, which are obviously not available in-memory.

So in short: if you look for more performance, it's key to simply execute smarter SQL and less SQL, limit locks etc. It's far more quicker to simply re-use a pre-rendered webpage from the cache for 5 minutes than to rebuild the webpage every time from data obtained from the DB, even if that DB / query is very quick. (just an example).

Frans Bouma | Lead developer LLBLGen Pro