SELECT TOP does not work in some cases

Posts   
 
    
Andrius
User
Posts: 68
Joined: 04-Apr-2005
# Posted on: 31-Aug-2005 10:38:09   

Hello

I noticed in version 2004.0.2 that SELECT TOP is not working in some cases.

I traced the problem to LLBL source, that TOP is ignored when relations are specified. Is this a new feature of this version? And why it is performed such way? I think Relations and TOP does not hurt each other

        if(maxNumberOfItemsToReturn > 0)
        {
            // only emit TOP if DISTINCT is emitted as well or when there are no relations or when the limit is 1 as duplicates don't hurt a limit of 1,
            // otherwise set the flag in the RetrievalQuery object for manual limitation.
            if(distinctEmitted || !relationsSpecified || (maxNumberOfItemsToReturn==1))
            {
                queryText.AppendFormat(" TOP {0}", maxNumberOfItemsToReturn);
            }
            else
            {
                selectQuery.RequiresClientSideLimitation=true;
                selectQuery.MaxNumberOfItemsToReturnClientSide = maxNumberOfItemsToReturn;
            }
        }
Marcus avatar
Marcus
User
Posts: 747
Joined: 23-Apr-2004
# Posted on: 31-Aug-2005 10:46:43   

I think in this case LLBLGen performs client side TOP, meaning that all rows are returned from the DB and later filtered in the LLBLGen code. I seem to remember Frans giving a reason for this in an previous thread...

Andrius
User
Posts: 68
Joined: 04-Apr-2005
# Posted on: 31-Aug-2005 10:54:52   

Marcus wrote:

I think in this case LLBLGen performs client side TOP, meaning that all rows are returned from the DB and later filtered in the LLBLGen code. I seem to remember Frans giving a reason for this in an previous thread...

Sometimes its client side TOP, sometimes server side.

Andrius
User
Posts: 68
Joined: 04-Apr-2005
# Posted on: 31-Aug-2005 11:01:40   

Marcus wrote:

I think in this case LLBLGen performs client side TOP, meaning that all rows are returned from the DB and later filtered in the LLBLGen code. I seem to remember Frans giving a reason for this in an previous thread...

OK I found that old thread, but can agree only with DISTINCT part. But why if you add any relation to a guery it omits TOP?

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39614
Joined: 17-Aug-2003
# Posted on: 31-Aug-2005 11:32:24   

TOP only works if: - no joins are present - if joins are present, distinct can be applied.

Otherwise, you get duplicates (can get duplicates), and TOP will then return the top n rows but some of them are duplicates. So you get less rows back effectively.

DISTINCT is applied if possible, though if there is a distinct violating type in the select list (blob for example) distinct is omitted and client-side limitation is performed.

So it doesn't mean TOP doesn't work, it means the limitation feature is done differently, to get the proper results. Because only unique entities are returned, returning 100 rows with 25 duplicates will result in 75 entities so that's not working, hence it will then perform client side limitations.

Could you give an example when it doesn't work? (i.e.: with your code, table setup and data)

Frans Bouma | Lead developer LLBLGen Pro
Andrius
User
Posts: 68
Joined: 04-Apr-2005
# Posted on: 31-Aug-2005 12:53:30   

Otis wrote:

TOP only works if: So it doesn't mean TOP doesn't work, it means the limitation feature is done differently, to get the proper results. Because only unique entities are returned, returning 100 rows with 25 duplicates will result in 75 entities so that's not working, hence it will then perform client side limitations.

Could you give an example when it doesn't work? (i.e.: with your code, table setup and data)

Well, I have a table with 700 000 records and want to return only TOP 100 records from this table in a dynamic list so joins are present in my query. And guess what: simple operation results in FULL scan on that table and returning only 100 result filtered on client side. Very nice behaviour simple_smile Since it happens in a webservice call, client timeouts and never ever gets a result back.

I expect limiting results will be transformed to TOP 100 clause in Select sql, but no, for some reason you decided not to include TOP when relations are present (I guess it's a new feature of 2004.0.2 because older version worked as I expected, I believe, but I'm not 100% sure).

I agree that Joins can duplicate records, but i think that should be left out for a developer to consider.

Marcus avatar
Marcus
User
Posts: 747
Joined: 23-Apr-2004
# Posted on: 31-Aug-2005 13:24:23   

Andrius wrote:

I agree that Joins can duplicate records, but i think that should be left out for a developer to consider.

I've got to agree with this one... Is there an "AllowDuplicates" flag?

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39614
Joined: 17-Aug-2003
# Posted on: 31-Aug-2005 13:30:17   

Andrius wrote:

Otis wrote:

TOP only works if: So it doesn't mean TOP doesn't work, it means the limitation feature is done differently, to get the proper results. Because only unique entities are returned, returning 100 rows with 25 duplicates will result in 75 entities so that's not working, hence it will then perform client side limitations.

Could you give an example when it doesn't work? (i.e.: with your code, table setup and data)

Well, I have a table with 700 000 records and want to return only TOP 100 records from this table in a dynamic list so joins are present in my query. And guess what: simple operation results in FULL scan on that table and returning only 100 result filtered on client side. Very nice behaviour simple_smile Since it happens in a webservice call, client timeouts and never ever gets a result back.

'client side limitation' means in the DataAccessAdapter (adapter) or in the DaoBase (selfservicing). So what you're experiencing is not caused by that, as the webservice internally should limit the rows. The row limitation action on the 'client' (i.e. internal in the orm support classes) are done by the DbDataAdapter used to fill the datatable in the case of a dyn. list or typed list, so it simply reads in the first 100 rows from the datareader internally (an overload of DbDataAdapter.Fill supports this).

You may call it sarcastically 'nice behavior' but it's the only way to get correct results.

If you specify a limit of 100, and relations, and allowDuplicates = true, you won't get distinct. As you didn't specify any code before, I didn't know what you were doing. As you're filling a dynamic list, you have to specify allowDuplicates = false. For entities this is already set by default so it will work with entities without a problem.

I expect limiting results will be transformed to TOP 100 clause in Select sql, but no, for some reason you decided not to include TOP when relations are present (I guess it's a new feature of 2004.0.2 because older version worked as I expected, I believe, but I'm not 100% sure).

It was a bug, because it resulted in incorrect results, hence it was corrected in 1.0.2004.2

I agree that Joins can duplicate records, but i think that should be left out for a developer to consider.

No, because the developer wants correct results, and there is only one 'correct' in this case. So if I leave it up to the developer, I get requests why the results are wrong and that it should be fixed.

Exactly the same problem is active in paging. If DISTINCT can't be applied and the query contains joins, duplicates can occur so paging results are not correct by default, hence it switches to client-side paging (e.g.: skipping n rows, then reading pagesize rows, filtering out duplicates (in the case of entities).

Frans Bouma | Lead developer LLBLGen Pro
Otis avatar
Otis
LLBLGen Pro Team
Posts: 39614
Joined: 17-Aug-2003
# Posted on: 31-Aug-2005 13:40:36   

Marcus wrote:

Andrius wrote:

I agree that Joins can duplicate records, but i think that should be left out for a developer to consider.

I've got to agree with this one... Is there an "AllowDuplicates" flag?

The main reason is that there is just 1 routine which produces the SELECT query. It doesn't know nor care where the call comes from. If you specify allowduplicates = true in a typedlist/dynlist fetch WITH relations, and a limit of n rows, what do you want? n rows WITH dupes or n rows WITHOUT dupes? (and before you say: "with dupes, I didn't care about dupes", a list of rows with dupes is of absolutely no value.) The routine which generates the select query doesn't know client side filtering of dupes can't be done in case of a typedlist/dynlist fetch. (it has no choice also). So it switches to client side filtering, as there is no other option.

The datatable filler then uses a dataset fill with a DbDataAdapter and specifies the limit there. This should fetch the amount from the resultset.

Frans Bouma | Lead developer LLBLGen Pro
Andrius
User
Posts: 68
Joined: 04-Apr-2005
# Posted on: 31-Aug-2005 14:54:00   

I agree that Joins can duplicate records, but i think that should be left out for a developer to consider.

No, because the developer wants correct results, and there is only one 'correct' in this case. So if I leave it up to the developer, I get requests why the results are wrong and that it should be fixed.

Exactly the same problem is active in paging. If DISTINCT can't be applied and the query contains joins, duplicates can occur so paging results are not correct by default, hence it switches to client-side paging (e.g.: skipping n rows, then reading pagesize rows, filtering out duplicates (in the case of entities).

BTW I am using an adapter. And by Client side I mean client of SQL server DB.

You should consider that your Query builder engine should have a simple parameters to switch on/of DISTINCT, and to switch on/off TOP clause.

What you are talking is true when considering EntityCollections, but is false talking about Dynamicaly constructed queries. Because with dynamic queries (lists) you need to have better flexibility, because usually you do this in not usual cases.

My problem is that query is taking too long to execute, and I need only 100 results, no matter if they are duplicates or not, because I am constructing query dynamically and can take care of it. It does not matter for me in this case that adapter in the end returns 100 rows, it simply too long to wait for this 100 rows when table is 700 000 records. Good results in wrong timing does not help me at all.

My suggestion: to leave current behavior when fetching EntityCollections, but allow flexibilyty I need with dynamic queries. I think this will make your engine better and more flexible in the long run.

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39614
Joined: 17-Aug-2003
# Posted on: 31-Aug-2005 15:20:19   

Andrius wrote:

I agree that Joins can duplicate records, but i think that should be left out for a developer to consider.

No, because the developer wants correct results, and there is only one 'correct' in this case. So if I leave it up to the developer, I get requests why the results are wrong and that it should be fixed.

Exactly the same problem is active in paging. If DISTINCT can't be applied and the query contains joins, duplicates can occur so paging results are not correct by default, hence it switches to client-side paging (e.g.: skipping n rows, then reading pagesize rows, filtering out duplicates (in the case of entities).

BTW I am using an adapter. And by Client side I mean client of SQL server DB.

OK simple_smile

You should consider that your Query builder engine should have a simple parameters to switch on/of DISTINCT, and to switch on/off TOP clause.

It has these already, but the engine's philosophy is that it should have a meaning, so if I switch off DISTINCT, request TOP and specify a set of joins, that's not going to work.

What you are talking is true when considering EntityCollections, but is false talking about Dynamicaly constructed queries. Because with dynamic queries (lists) you need to have better flexibility, because usually you do this in not usual cases.

Ok, though you can switch on distinct, which should enable TOP: specify false for allowDuplicates in the FetchTypedList() call (pick an overload which accepts that parameter). This then will result DISTINCT being emitted if possible, and thus also TOP.

My problem is that query is taking too long to execute, and I need only 100 results, no matter if they are duplicates or not, because I am constructing query dynamically and can take care of it. It does not matter for me in this case that adapter in the end returns 100 rows, it simply too long to wait for this 100 rows when table is 700 000 records. Good results in wrong timing does not help me at all.

Ok, though, SqlServer executes TOP after the complete query. This thus will still result in a scan of your complete table, build of the resultset and then TOP is done. Some optimization can be performed, but often that's not possible, as TOP has to be done after order by for example... I doubt (but am not sure) if TOP in the query will make the query very quick.

A @@ROWCOUNT switch would not have this and simply cut off the set, but that would give bad results when sorting.

Now, the resultset is send in small chunks, so the datatable filler in teh SqlDataAdapter won't fetch 700.000 rows and then cut off after 100, it will fetch 100 rows and simply close the connection.

My suggestion: to leave current behavior when fetching EntityCollections, but allow flexibilyty I need with dynamic queries. I think this will make your engine better and more flexible in the long run.

You mean: TOP no matter what, even if Distinct isn't possible?

Could you try specifying 'false' for allowDuplicates?

Frans Bouma | Lead developer LLBLGen Pro
Andrius
User
Posts: 68
Joined: 04-Apr-2005
# Posted on: 31-Aug-2005 15:26:22   

Ok, though, SqlServer executes TOP after the complete query. This thus will still result in a scan of your complete table, build of the resultset and then TOP is done. Some optimization can be performed, but often that's not possible, as TOP has to be done after order by for example... I doubt (but am not sure) if TOP in the query will make the query very quick.

You are wrong on that. TOP is considered during the query. LIMIT is executed after the query

Andrius
User
Posts: 68
Joined: 04-Apr-2005
# Posted on: 31-Aug-2005 15:48:03   

You mean: TOP no matter what, even if Distinct isn't possible?

yes. because it has to do with execution time of a query

Could you try specifying 'false' for allowDuplicates?

Yes, it worked.

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39614
Joined: 17-Aug-2003
# Posted on: 31-Aug-2005 16:11:29   

Andrius wrote:

Ok, though, SqlServer executes TOP after the complete query. This thus will still result in a scan of your complete table, build of the resultset and then TOP is done. Some optimization can be performed, but often that's not possible, as TOP has to be done after order by for example... I doubt (but am not sure) if TOP in the query will make the query very quick.

You are wrong on that. TOP is considered during the query. LIMIT is executed after the query

When I check the execution plan for:


select top 50 c.* 
from customers c inner join orders o
on c.customerid = o.customerid
order by o.orderdate desc

top is the last phase, after join and order by have been performed. Perhaps you're using a different db, as LIMIT is not an sqlserver statement, (MySql if I'm not mistaken)

It is more efficient than no TOP statement, absolutely right, though I doubt it will be very efficient, as TOP has to be applied after everything else is done, otherwise you can't have TOP after sorting.

Frans Bouma | Lead developer LLBLGen Pro
Marcus avatar
Marcus
User
Posts: 747
Joined: 23-Apr-2004
# Posted on: 31-Aug-2005 16:59:48   

Otis wrote:

The main reason is that there is just 1 routine which produces the SELECT query. It doesn't know nor care where the call comes from. If you specify allowduplicates = true in a typedlist/dynlist fetch WITH relations, and a limit of n rows, what do you want? n rows WITH dupes or n rows WITHOUT dupes? (and before you say: "with dupes, I didn't care about dupes", a list of rows with dupes is of absolutely no value.) The routine which generates the select query doesn't know client side filtering of dupes can't be done in case of a typedlist/dynlist fetch. (it has no choice also). So it switches to client side filtering, as there is no other option.

Well that shut me up! smile stuck_out_tongue_winking_eye

Andrius
User
Posts: 68
Joined: 04-Apr-2005
# Posted on: 31-Aug-2005 17:09:21   

Otis wrote:

When I check the execution plan for:


select top 50 c.* 
from customers c inner join orders o
on c.customerid = o.customerid
order by o.orderdate desc

top is the last phase, after join and order by have been performed. Perhaps you're using a different db, as LIMIT is not an sqlserver statement, (MySql if I'm not mistaken)

It is more efficient than no TOP statement, absolutely right, though I doubt it will be very efficient, as TOP has to be applied after everything else is done, otherwise you can't have TOP after sorting.

OK maybe I'm wrong.

arschr
User
Posts: 893
Joined: 14-Dec-2003
# Posted on: 31-Aug-2005 18:11:09   

I think if you take the order by clause off (or order by a clustered index field) you'll skip the table scan. (Yes I just tested with a table with 500 million rows and got back the top 100 in under a second).

pilotboba
User
Posts: 434
Joined: 05-Aug-2005
# Posted on: 31-Aug-2005 19:45:09   

arschr wrote:

I think if you take the order by clause off (or order by a clustered index field) you'll skip the table scan. (Yes I just tested with a table with 500 million rows and got back the top 100 in under a second).

Correct. This issue isn't black and white as to when TOP is applied. It depends on the query. If there is a covering index and or if you are ordering it. At least for SQL Server from the BOL:

If a SELECT statement that includes TOP also has an ORDER BY clause, the rows to be returned are selected from the ordered result set. The entire result set is built in the specified order and the top n rows in the ordered result set are returned.

Frans, I see your quandry. Perhaps there should be a switch "forceServerSideTop" with a documentation that if you say allowDuplicates = false you may not get X records.

However, I think there is a moot point here. My assumption would be that the query runs the same speed if you are doing an ORDER BY with our without the TOP. Since the data reader is closed once the first 100 records are pulled there is no difference in traffic. Andrius, have you tested this in query analyzer?

Of course, I believe if the user is so adamant about how the SQL is written he could use a stored proc to deliver the data, right?

BOb

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39614
Joined: 17-Aug-2003
# Posted on: 01-Sep-2005 10:48:12   

pilotboba wrote:

arschr wrote:

I think if you take the order by clause off (or order by a clustered index field) you'll skip the table scan. (Yes I just tested with a table with 500 million rows and got back the top 100 in under a second).

Correct. This issue isn't black and white as to when TOP is applied. It depends on the query. If there is a covering index and or if you are ordering it. At least for SQL Server from the BOL:

If a SELECT statement that includes TOP also has an ORDER BY clause, the rows to be returned are selected from the ordered result set. The entire result set is built in the specified order and the top n rows in the ordered result set are returned.

Frans, I see your quandry. Perhaps there should be a switch "forceServerSideTop" with a documentation that if you say allowDuplicates = false you may not get X records.

However, I think there is a moot point here. My assumption would be that the query runs the same speed if you are doing an ORDER BY with our without the TOP. Since the data reader is closed once the first 100 records are pulled there is no difference in traffic.

That's indeed the idea. simple_smile

Frans Bouma | Lead developer LLBLGen Pro
hlesesne avatar
hlesesne
User
Posts: 47
Joined: 22-Jul-2004
# Posted on: 02-Sep-2005 06:36:28   

May I hop into this thread for a moment as I think it may answer some problems that I have been having with DISTINCT being added?

I have a query that uses a LEFT JOIN on two fields in a single table and then checks for null. I believe that my null check will prevent any possibility of duplicates, but my resulting SQL looks like the following:

SELECT DISTINCT TOP 10 [dbo].[tblLeadCategory].[LeadID] AS [LeadID],[dbo].[tblCategoryLetter].[LetterID] AS [LetterID] 
    FROM (( [dbo].[tblCategoryLetter]  
        INNER JOIN [dbo].[tblLeadCategory]  ON  [dbo].[tblCategoryLetter].[CategoryID]=[dbo].[tblLeadCategory].[CategoryID]) 
        LEFT JOIN [dbo].[tblLeadLetter]  ON  [dbo].[tblCategoryLetter].[LetterID]=[dbo].[tblLeadLetter].[LetterID] AND ( [dbo].[tblLeadLetter].[LeadID] = [dbo].[tblLeadCategory].[LeadID])) 
    WHERE ( [dbo].[tblLeadCategory].[AddDate] + [dbo].[tblCategoryLetter].[LetterWait] <= GetDate() And [dbo].[tblLeadLetter].[LeadLetterID] IS NULL)

Please note that the GetDate() in the second to last line was inserted to replace a parameter.

The query takes a LONG time to execute (1 million + records in tblLeadCategory), but by removing the DISTINCT I get results in under 1 second.

By looking at the above code, is there anyway around this limitation? There could be hundreds of thousands of rows returned if I exclude DISTINCT and TOP so a client side filter isn't really suitable.

Thanks for all the great discussion on this topic and I hope you can help. As it stands now, I may have to circumvent my DAL for this function in order to increase speed cry ...

Best regards,

Hal

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39614
Joined: 17-Aug-2003
# Posted on: 02-Sep-2005 11:11:39   

hlesesne wrote:

May I hop into this thread for a moment as I think it may answer some problems that I have been having with DISTINCT being added?

I have a query that uses a LEFT JOIN on two fields in a single table and then checks for null. I believe that my null check will prevent any possibility of duplicates, but my resulting SQL looks like the following:

SELECT DISTINCT TOP 10 [dbo].[tblLeadCategory].[LeadID] AS [LeadID],[dbo].[tblCategoryLetter].[LetterID] AS [LetterID] 
    FROM (( [dbo].[tblCategoryLetter]  
        INNER JOIN [dbo].[tblLeadCategory]  ON  [dbo].[tblCategoryLetter].[CategoryID]=[dbo].[tblLeadCategory].[CategoryID]) 
        LEFT JOIN [dbo].[tblLeadLetter]  ON  [dbo].[tblCategoryLetter].[LetterID]=[dbo].[tblLeadLetter].[LetterID] AND ( [dbo].[tblLeadLetter].[LeadID] = [dbo].[tblLeadCategory].[LeadID])) 
    WHERE ( [dbo].[tblLeadCategory].[AddDate] + [dbo].[tblCategoryLetter].[LetterWait] <= GetDate() And [dbo].[tblLeadLetter].[LeadLetterID] IS NULL)

Please note that the GetDate() in the second to last line was inserted to replace a parameter.

Only 1:1 relations don't result in duplicates, the other joins always potentially result in duplicate rows. Also, with m:1 relations, and selecting the m part, you don't have to run into duplicates, but it's not determinable.

The query takes a LONG time to execute (1 million + records in tblLeadCategory), but by removing the DISTINCT I get results in under 1 second.

By looking at the above code, is there anyway around this limitation? There could be hundreds of thousands of rows returned if I exclude DISTINCT and TOP so a client side filter isn't really suitable.

No, that's not true. The datareader will send small chunks of data to the client, and the serverside cursor is effectively closed once the last row to read is read. If you create a dynamic list from your query, and don't specify allowDuplicates = false (so specify true), you'll get clientside limitation, will that be faster?

Frans Bouma | Lead developer LLBLGen Pro