HnD | FetchEntityCollectionAsync and PrefetchPath, Page 1

FetchEntityCollectionAsync and PrefetchPath

Posts

kievBug
User

Posts: 105
Joined: 09-Jan-2009

# Posted on: 22-Feb-2018 17:36:46

Hey guys,

I'm using v5.3.2.0 Adapter with SQL server, trying to load data


                var parameters = new QueryParameters
                {
                    AllowDuplicates = true,
                    RowsToTake = 100,
                    RowsToSkip = 0,
                    PrefetchPathToUse = prefetchPath,
                    CollectionToFetch = toReturn,
                    FilterToUse = filter,
                    RelationsToUse = relations,
                    SorterToUse = sortExpression,
                };

                await adapter.FetchEntityCollectionAsync(parameters, CancellationToken.None);

As you can see I've specified 100 records, and it returns 100, the table has a lot of records. prefetchPath has several items in it.

When I look at the queries generated, I see that prefetch path queries does have a filter based on the main query, but it is not limited to the number of records from the main query and loads a lot of data.

Here is an example of the queries: Main query


SELECT TOP(@p2) 
                [dbo].[bulBulletin].[ID]                 AS [Id],
                [dbo].[bulBulletin].[Title],
                [dbo].[bulBulletin].[Type]
FROM   [dbo].[bulBulletin]
WHERE  (([dbo].[bulBulletin].[Type] = @p3 AND [dbo].[bulBulletin].[Deleted] = @p4))
ORDER  BY [dbo].[bulBulletin].[ID] DESC

Prefetch Query


SELECT [dbo].[bulBulletinGroup].[BulletinID] AS [BulletinId],
       [dbo].[bulBulletinGroup].[GroupID]   AS [GroupId]
FROM   [dbo].[bulBulletinGroup]
WHERE  ([dbo].[bulBulletinGroup].[BulletinID] IN
        (SELECT [dbo].[bulBulletin].[ID] AS [Id]
         FROM   [dbo].[bulBulletin]
         WHERE  ((([dbo].[bulBulletin].[Type] = @p1
               AND [dbo].[bulBulletin].[Deleted] = @p2)))))

For each record from in Bulletin table there 2 records in the BulletinGroup table, and as you can see all records from the group table are being fetched.

Why do you need to fetch all of them if the main query get 100 records only?

Anton

Walaa
Support Team

Posts: 14950
Joined: 21-Aug-2005

# Posted on: 22-Feb-2018 20:54:28

PrefetchPath with Paging

LLBLGen Pro supports paging functionality in combination of Prefetch Paths. If you want to utilize paging in combination of prefetch paths, be sure to set DataAccessAdapter.ParameterisedPrefetchPathThreshold to a value larger than the page size you want to use. You can use paging in combination of prefetch path with a page size larger than DataAccessAdapter.ParameterisedPrefetchPathThreshold but it will be less efficient.

It's set to 50 by default.

kievBug
User

Posts: 105
Joined: 09-Jan-2009

# Posted on: 22-Feb-2018 21:40:08

So if my max page size is 500, then I should change it to 500? I just do not think that fetching all records from the related table (groups in this case) is efficient - in my opinion it is super inefficient, and really not sure why in such simple case I need to configure something and it doesn't work out of the box.

Is there a better way to deal with it? Is there anything that you can do on your side?

Anton

Otis
LLBLGen Pro Team

Posts: 39614
Joined: 17-Aug-2003

# Posted on: 22-Feb-2018 23:12:33

kievBug wrote:

So if my max page size is 500, then I should change it to 500? I just do not think that fetching all records from the related table (groups in this case) is efficient - in my opinion it is super inefficient, and really not sure why in such simple case I need to configure something and it doesn't work out of the box.

Because there are limits on the # of parameters you can send to a database, some support more parameters than others, hence you need to configure this So to make it work, it needs to send the subquery as an IN clause, with parameters, one for each parent's ID. but it can't decide in the generic code where it handles prefetch paths how much is allowed for the database you're targeting, so you have to specify how much is the limit. This is also because these queries are in general expensive, so you are forced to think about what you're doing.

Is there a better way to deal with it? Is there anything that you can do on your side? Anton

It's literally 1 line of code, set the threshold, done. I don't see why that's a problem, but maybe I'm missing something?

There's no other way, as I don't know how the query would look like. I also don't know why we need to do anything as we can't do much more than we do now, also considering the limitations I described above and the reasons behind why the system works the way it works now.

Frans Bouma | Lead developer LLBLGen Pro

kievBug
User

Posts: 105
Joined: 09-Jan-2009

# Posted on: 23-Feb-2018 00:41:15

I think you guys should be interested in providing a value out of the box - for me value is in best performance possible. I think that you have all the info you need - you know what database is used, limitation of the number of parameters, which query gives better results(that if the number of params is greater than 300 it starts to slow down), you know table structure, all the query parameters like top and skip, etc, and based on all of those you need to fetch data efficiently, you can make this decision instead of me. I’m using ORM to abstract all of it away and don’t deal with sql optimization, etc at least in simple cases. I believe everybody who started with LLBLGen experienced the same issue, and it is not a problem to add the line of code, the problem is how much time it takes to find it out, investigate why 100,000 thousands of records are being loaded for a 100 records request, and ask here for help, find out that I need to configure it to generate a query with better performance i.e. for production use. My expectation is that LLBLGen generates the best query possible and this was one of the main reasons for switching from EF long time ago.

Anton

Otis
LLBLGen Pro Team

Posts: 39614
Joined: 17-Aug-2003

# Posted on: 23-Feb-2018 09:33:47

kievBug wrote:

I think you guys should be interested in providing a value out of the box - for me value is in best performance possible. I think that you have all the info you need - you know what database is used, limitation of the number of parameters, which query gives better results(that if the number of params is greater than 300 it starts to slow down), you know table structure, all the query parameters like top and skip, etc, and based on all of those you need to fetch data efficiently, you can make this decision instead of me.

That's easy to say, but how do we know what the 'best' query is? We don't know how much rows are in the table for instance. We can't make a decision based on what we have, as the biggest factor: the table size, is unknown, plus these kind of things aren't simple: they're the core of e.g. SQL optimization engines, and if you look how good these are in databases like oracle (read: not that great) you know even corporations with that many resources don't get it right, so how can we, with even less information?

I’m using ORM to abstract all of it away and don’t deal with sql optimization, etc at least in simple cases. I believe everybody who started with LLBLGen experienced the same issue, and it is not a problem to add the line of code, the problem is how much time it takes to find it out, investigate why 100,000 thousands of records are being loaded for a 100 records request, and ask here for help, find out that I need to configure it to generate a query with better performance i.e. for production use.

I understand that, so to get somewhere what works, and sorry, but we can't make decisions like this, is throwing an exception a good solution? (as in: throw the exception if a prefetch path with paging and a threshold which is too low so a subquery is issued) that would have notified you instantly and you could anticipate on it. (an exception which states what's wrong and not "Invalid operation" or equivalent 'helpful' text).

Combined with that we could add a setting (but then you have to learn about that setting, but we can enforce that through the exception), that forces a parameterized subquery no matter what the threshold is for paging queries. So the setting then gets 3 values: default (the current behavior), exception, adjust (which might crash in the DB)

I get you want us to simply adjust the threshold to match the query, but say the DB supports just 300 parameters, and you need 500, what do we do? Say we adjust the parameters to 500 no matter what, the query will crash and you would wonder why it would create a query with 500 parameters as the DB clearly supported only 300. In the case of sqlserver it's 1100 I think (or 2000, have to look it up), but using that much parameters has side effects too. What I'm saying is that issuing a prefetch path with that many rows might not be the right choice here: a projection with a join might be a better choice.

My expectation is that LLBLGen generates the best query possible and this was one of the main reasons for switching from EF long time ago.

We try, though we can't always do what seems obvious because we don't know all the information, like table size and even if most of the info is available, code has to make the decision, which isn't as simple as saying "make the right decision" .

So, I get you're frustrated with this, please read what I proposed above and provide us with feedback whether that would work.

Frans Bouma | Lead developer LLBLGen Pro

kievBug
User

Posts: 105
Joined: 09-Jan-2009

# Posted on: 26-Feb-2018 17:09:46

I'm not asking you to change parameter at all - it is not the point.

Current way of fetching the data for prefetch path doesn't work and requires optimization, it is not efficient - as it fetches all data from related table when only 100 records are fetched from the main one.

As for me you guys have to come up with more efficient strategy of fetching data for related table and only fetch data you need, not everything. It is causing performance issues in the applications. Configuration of the parameter is just a temporary workaround not the solution.

I understand that it is super hard, a lot of constraints, not enough data about table size, but I do believe in you guys :-) you can find the way and solve it.

Anton

Otis
LLBLGen Pro Team

Posts: 39614
Joined: 17-Aug-2003

# Posted on: 26-Feb-2018 17:21:16

kievBug wrote:

I'm not asking you to change parameter at all - it is not the point.

Current way of fetching the data for prefetch path doesn't work and requires optimization, it is not efficient - as it fetches all data from related table when only 100 records are fetched from the main one.

Sure, though that's a result for it picking the wrong strategy. There are two implemented: the one with a subquery and one with an IN clause. It will only pick the second one if it meets some criteria (so it's sure it won't crash). Your query didn't match the criteria (the threshold is set to 50 by default, so your query didn't match that one).

It can't know it has to switch over to a different strategy as it doesn't know what 'a small amount' is. It's software it has to work with data, and it can't pick a strategy without that data. So it can't select the IN clause strategy even if the threshold isn't met as it doesn't know for sure what to do, currently. We can add some rules so it can pick the right strategy more cleverly but that has consequences, e.g. it might run into a crash (e.g. when you fetch 4000 root entities an IN clause won't work), and it can't know whether those consequences are acceptable by the developer/user or not, so a setting is required. Currently we don't have one, a setting could give you the control you need so the query will pick the right strategy.

As for me you guys have to come up with more efficient strategy of fetching data for related table and only fetch data you need, not everything. It is causing performance issues in the applications. Configuration of the parameter is just a temporary workaround not the solution.

I understand that it is super hard, a lot of constraints, not enough data about table size, but I do believe in you guys :-) you can find the way and solve it.

It's not a temporary workaround, it's telling the algorithm what to do with a given situation, as it can't decide things at the moment.

You want a solution with no action from your side, but that's not going to happen, sorry , because there isn't one: it can't pick the 'IN clause' strategy 'just because', it needs rules on which it can base its decisions on. There's currently 1 rule: the prefetchpath threshold. We can stretch that a little, but not as in: pick the IN clause always: we can't make that decision for you, you have to make that decision, as picking that clause in paging queries always can lead to crashes and you then will tell us of course why the query crashed.

Paging and prefetch paths is a tricky situation, there's no real solution other than an IN clause, so a configuration setting which tells the system what to do in that situation (where the threshold is too small) is best, as you then have to decide what is acceptable for your application. We can never decide that for you.

Frans Bouma | Lead developer LLBLGen Pro

kievBug
User

Posts: 105
Joined: 09-Jan-2009

# Posted on: 26-Feb-2018 18:00:02

As for me the main thing is the desire to find solution. For example in this case:

Let's assume we deal with SQL(should be similar for other DBs as well).

We know that if the number of parameters > 300 it already start to behave poorly. We will just use that in our fetch algorithm. if 300 is not the right number, then ok we will need to do some tests and figure it out. Doesn't have to be the same number for each DB.
We have the case when we need to fetch more then 300 records, what do we do in this case?

There are two cases: - when related table is small and it is going to be much faster to fetch it all and join in memory - when related table is bigger than the main one, in this case we should not fetch it all

Obviously, we do not have table size info. So what we can do? - get that info somehow - ignore it, and try to come up with generic method might not be as good as the first option.

IMHO, we need both of this methods, just because we want our solution to perform, and be configurable so that devs can get the best performance out of LLBLGen.

Generic method of loading data - what can we do? One of the options is to use temp tables. i.e data for join field from the main table will go into temp table and we will use it when fetching data from related. I'm pretty sure if we think about it more, we can find other ways to do that as well. Obviously the option to load everything should be considered as the fallback as it is the most expensive one.

Get data for table sizes - essentially this is similar to just giving developer an option to pick fetch strategy. For different tables, dev can specify different fetching strategy - that is good feature, but we want less coding - we can specify it in the editor when we create model, probably not a bad option as well.

I'm pretty sure that we can find other options, and come up with a better solution.

The main question is whether do YOU want to do this or not, is there a value for you and customers?

And based on your responses it doesn't look like you see value in it, and it is definitely your call. For me this is a value, a value that I see in LLBLGen - abstraction, performance, ease of use.

Otis
LLBLGen Pro Team

Posts: 39614
Joined: 17-Aug-2003

# Posted on: 27-Feb-2018 10:40:54

Good points.

The main problem is that this code is written on top of the prefetch path system, it can't go to the DB 'just to pull extra data', but thinking about it, it's also not needed to know what size the child table is, you have to apply predicates on the child table regardless of the size: if it's small, it doesn't matter, if it's big you want to to limit the # of rows fetched, and the subquery system for this case (paging) isn't useful.

For prefetch paths you have the # of parent rows which dictate what is chosen: subquery (default) or in clause. In normal prefetch path cases, the query which fetched the root rows has some sort of predicate to define the set, so this can be used in the subquery so both perform normally well. This doesn't need changing (the only optimization which might be necessary is to use joins to join in-database instead of in-memory, however performance tests show it's in most cases slower so we didn't implement this. EF also switched from joins to quer-per-node in EF Core)

In case of paging, we can't use a subquery with parent predicates, as paging queries are 'special': they can't really be used inside a subquery. So we work with the root rows as-is, and I agree using a subquery here is in almost all cases not what you want, so this really should change.

I think the main problem is to pass the set of identifying values for the parent rows to the DB to limit the set of child rows. As this is for paging, in general pages aren't that big, so we could silently switch to always use the IN clause optimization, regardless of the threshold. The reason for that is that whatever you choose: the parent row IDs have to be passed to the DB to use them in predicates on the child rows: whether you insert them in a temptable for joins or otherwise: they have to be passed. An alternative, which is in almost all cases much slower, is re-run the paging query and instead of returning the set of rows in the page, insert the IDs in a temp table.

So if we silently switch to IN clause optimization for paging queries with prefetch paths, the problem we then run into is page size * # of identifying fields > max # of parameters allowed. This is a problem one would run into otherwise as well (i.e. setting the threshold manually) so the remedy is in that case to use smaller pages.

The other problem, which is also the case when setting the threshold manually, is the performance degradation when using lots of parameters. This is however not determinable for us, for the simple reason it's not a constant. it might be for some tables of certain size it might be faster to simply run the query with 1100 parameters, and for other tables to run 2 batch queries of 550 parameters. Batch queries likely are slower due to the multiple roundtrips and query parsing overhead, so we can't say "we start batching when we get over 300 parameters (or other random number)".

So to recap: - agreed, we shouldn't use subquery filtering on paging queries with prefetch paths at all - a solution is to silently switch to IN clause queries as the alternative: subqueries, isn't useful - for all solutions, the child rows have to be filtered using a predicate relying on the parent rows, which in all cases requires the parent rows' identifying field values (which can be FK fields!) and for all solutions this thus requires to pass these values to the DB, which requires parameters (and thus runs into the limit, no matter what). - The # of values in the IN clause is equal to the # of rows in the page * the # of identifying fields in the parent rows, and in general this is quite low (< 500), and if it exceeds the maximum for the DB, it would have hit that otherwise too with the manual threshold set, and the simple remedy is to limit page size as there's no other option (as all parameters have to be passed to the db)

Passing large sets of parameters is something stackoverflow has experimented with, e.g. packed into a string and unpack them in the DB with SQL, passing them as table valued parameters etc, and it's all limited and restricted: the packed string variant obviously doesn't work with values that are strings as we won't pass literals no matter what, and table values parameters are interesting but require a type being created in the DB, which sucks as we then have to alter the schema at runtime which is likely not allowed by the user running the code.

I think this is in line with what you wanted. In your case you wouldn't have seen anything. I don't see a way to do this faster, as we can't even use a join in the DB as the parent rows are from a paging query

If I've missed anything, let me know, otherwise we'll make the change to always pick IN clauses for paging queries with prefetch paths in v5.4

Frans Bouma | Lead developer LLBLGen Pro

Otis
LLBLGen Pro Team

Posts: 39614
Joined: 17-Aug-2003

# Posted on: 08-Mar-2018 12:46:29

It turned out this was already implemented but the implementation contained a bug: it only took effect when there were rows to skip. This was too rigid: the first page obviously doesn't skip any rows but does run into this problem. Additionally if you fetched, say, 100 entities + prefetch paths, so you specified a limit, it would also fetch all rows.

This has been corrected, it always picks the parameterized prefetch path route if the root node's entities are fetched with a limit or paging. So in your situation it would have picked that route, regardless of what the threshold is.

Frans Bouma | Lead developer LLBLGen Pro