Top-N query

Term of “TOP-N” intuitively makes thinking what they query should run faster than just ORDERED query, as only N elements needed from it.
But, as well known, not always..

This blog describes method how using proper index would optimize TOP-N  query making it benefit from fact what only N rows needed.

This is how common case would look before optimization:

(for reference will call this Example 1)

CREATE TABLE EMPLOYEE
  (
    EMPLOYEE_ID NUMBER,
    BRANCH      VARCHAR2(100),
    DEPARTMENT  VARCHAR2(20),
    HIRE_DATE   DATE,
    SALARY      NUMBER
  );

CREATE INDEX IDX1_EMPLOYEE ON EMPLOYEE (BRANCH,DEPARTMENT);

And table has data distribution like:

SQL> select count(1) from employee where branch='US' and department='IT';

COUNT(1)
----------------------
818008

SQL> select count(1) from employee;

COUNT(1)
----------------------
4745976

TOP-N query is:

SQL> select  * from (
select * from employee a
where department='IT'
  and branch='US'
order by salary
) where rownum<=3;

3 rows selected.

Elapsed: 00:00:03.42

Execution Plan
----------------------------------------------------------
Plan hash value: 2997571357

----------------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name          | Rows  | Bytes |TempSpc| Cost (%CPU)| Time       |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |               |     3 |   297 |       | 13754	 (1)| 00:02:46 |
|*  1 |  COUNT STOPKEY                       |               |       |       |       |              |          |
|   2 |   VIEW                               |               |   818K|    77M|       | 13754	 (1)| 00:02:46 |
|*  3 |    SORT ORDER BY STOPKEY             |               |   818K|    22M|    34M| 13754	 (1)| 00:02:46 |
|   4 |     TABLE ACCESS BY INDEX ROWID      | EMPLOYEE      |   818K|    22M|       |  7240	 (1)| 00:01:27 |
|*  5 |      INDEX RANGE SCAN                | IDX1_EMPLOYEE |   818K|       |       |  2728	 (1)| 00:00:33 |
----------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=3)
   3 - filter(ROWNUM<=3)    
   5 - access("BRANCH"='US' AND "DEPARTMENT"='IT')  

Statistics  ----------------------------------------------------------
        0  recursive calls
        0  db block gets
     5910  consistent gets
        0  physical reads 
        0  redo size  
      665  bytes sent via SQL*Net to client 
      364  bytes received via SQL*Net from client  
        2  SQL*Net roundtrips to/from client  
        1  sorts (memory)   
        0  sorts (disk)    
        3  rows processed   

“5910 consistent gets” – That is pretty expensive 3 rows..

“rownum<=3” helps here: query does not have to finish full sort (“SORT ORDER BY STOPKEY”), only 3 top elements required. And depending on sort algorithm been used it can be more or less help. But complete index range scan needs to be done in order to find top elements (“INDEX RANGE SCAN rows=818K”). Rownum does not help there. And its bigger overheat.

The ideal TOP-N optimization for the query would be: adding “ORDER BY” column to the end of the index:

 
SQL> CREATE INDEX IDX2_EMPLOYEE ON EMPLOYEE (BRANCH,DEPARTMENT,SALARY);

(for reference will call this Example 2)

SQL> select  * from (
select * from employee a
where department='IT'
  and branch='US'
order by salary
) where rownum<=3;

3 rows selected.

Elapsed: 00:00:01.03

Execution Plan
----------------------------------------------------------
Plan hash value: 3266040937

-----------------------------------------------------------------------------------------------
| Id  | Operation                     | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |               |     3 |   297 |     5	(0)| 00:00:01 |
|*  1 |  COUNT STOPKEY                |               |       |       |            |          |
|   2 |   VIEW                        |               |     3 |   297 |     5	(0)| 00:00:01 |
|   3 |    TABLE ACCESS BY INDEX ROWID| EMPLOYEE      |   818K|    22M|     5	(0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN          | IDX2_EMPLOYEE |     3 |       |     3	(0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=3)
   4 - access("BRANCH"='US' AND "DEPARTMENT"='IT')

Statistics
----------------------------------------------------------
	  1  recursive calls
	  0  db block gets
	  7  consistent gets
	  2  physical reads
	  0  redo size
	665  bytes sent via SQL*Net to client
	364  bytes received via SQL*Net from client
	  2  SQL*Net roundtrips to/from client
	  0  sorts (memory)
	  0  sorts (disk)
	  3  rows processed

This execution have several advantages comparing to previous:

  • In this execution plan “SORT” is not mentioned at all, because index is ordered structure.
  • Oracle just takes takes first 3 elements, relying on fact its already ordered,- it will be top entries. I.e. only doing Bounded Range scan is performed (INDEX RANGE SCAN Rows=3). It is huge saving, what is reflected in cost and gets. Consistent gets dropped to 7.

Pretty simple stuff.
Picture below shows Bounded INDEX RANGE SCAN stopped by rownum STOPKEY in visual form:

I would generalize conditions where index helps for TOP-N query:

  1. Having index what has leading columns as:
    • Zero or more columns mentioned in “WHERE” clause. If column is in WHERE condition but not part of index, that condition need to be have low selectivity.
    • “ORDER BY” column(s).
  2. Between 1st column in the index and “ORDER BY” column it should be no columns what are not mentioned in “WHERE” clause.
  3. Columns what stand before “ORDER BY” column in index, has to have “=” or “IN” condition in WHERE predicate. “IN” need additional handing (explained below).
    It actually can be condition in WHERE clause with “>” “<” “!=” “BETWEEN” (i.e. condition with indefinite set of possible combinations on the right), but columns mentioned in it should not be part of the index.

All that may sound a little too abstract. Here are some examples.

RE: “If column is in WHERE condition but not part of index, that condition need to be have low selectivity”

In the table I have 99% of employees was hired 10 years ago or less.
So here is query with low selectivity on HIRE_DATE. Hire date is not part of index:

(for reference will call this Example 3)

SQL> select  * from (
select * from employee a
where department='IT'
  and branch='US'
  and hire_date>sysdate-365*10
order by salary
) where rownum<=3;

Execution Plan
----------------------------------------------------------
Plan hash value: 3266040937

-----------------------------------------------------------------------------------------------
| Id  | Operation                     | Name	      | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT              |               |     3 |   297 |     5	(0)| 00:00:01 |
|*  1 |  COUNT STOPKEY                |               |       |       |            |          |
|   2 |   VIEW                        |               |     3 |   297 |     5	(0)| 00:00:01 |
|*  3 |    TABLE ACCESS BY INDEX ROWID| EMPLOYEE      |   818K|   22M |     5	(0)| 00:00:01 |
|*  4 |     INDEX RANGE SCAN          | IDX2_EMPLOYEE |     3 |       |     3	(0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=3)    3 - filter("HIRE_DATE">SYSDATE@!-3650)
   4 - access("BRANCH"='US' AND "DEPARTMENT"='IT')

Statistics
----------------------------------------------------------
	  1  recursive calls
	  0  db block gets
	  7  consistent gets
	  0  physical reads
	  0  redo size
	665  bytes sent via SQL*Net to client
	364  bytes received via SQL*Net from client
	  2  SQL*Net roundtrips to/from client
	  0  sorts (memory)
	  0  sorts (disk)
	  3  rows processed

We see same effective execution as before: 7 consistent gets. Cost 5.
Oracle here use almost same plan as shown at example 2:
Do bounded RANGE SCAN. And yes, its not 100% guaranteed first 3 rows from index will be the ones query would return, as HIRE_DATE condition still needs to be checked (3 – filter(“HIRE_DATE”>SYSDATE@!-3650)). But 99% of employees were hired less than 10 years do, so 99% chance what HIRE_DATE condition will be positive. And even if not lucky and hit those remaining 1% and one of the records would fail on HIRE_DATE condition, – just get next element from index.

Now, in contrast – query with high selectivity on HIRE_DATE:

SQL> select  * from (
select * from employee a
where department='IT'
  and branch='US'
  and hire_date<sysdate-365*10
order by salary
) where rownum<=3;
Execution Plan
----------------------------------------------------------
Plan hash value: 2997571357

------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |               |     3 |   297 |  7253	 (1)| 00:01:28 |
|*  1 |  COUNT STOPKEY                 |               |       |       |            |	       |
|   2 |   VIEW                         |               |    23 |  2277 |  7253	 (1)| 00:01:28 |
|*  3 |    SORT ORDER BY STOPKEY       |               |    23 |   667 |  7253	 (1)| 00:01:28 |
|*  4 |     TABLE ACCESS BY INDEX ROWID| EMPLOYEE      |    23 |   667 |  7252	 (1)| 00:01:28 |
|*  5 |      INDEX RANGE SCAN          | IDX1_EMPLOYEE |   818K|       |  2728	 (1)| 00:00:33 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=3)
   3 - filter(ROWNUM<=3)
   4 - filter("HIRE_DATE"<SYSDATE@!-3650)
   5 - access("BRANCH"='US' AND "DEPARTMENT"='IT')

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
       5910  consistent gets
          0  physical reads
          0  redo size
        489  bytes sent via SQL*Net to client
        353  bytes received via SQL*Net from client
          1  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          3  rows processed

Plan from example 3 would not work here as “HIRE_DATE<sysdate-3650” would filter out 99% of rows which are gotten from index. And even the facts rows from index are ordered as needed, would not benefit much, as high count of rowid lookups on table (99% of which will fail on “HIRE_DATE<sysdate-3650”) would kill the deal.
So oracle does not use last column in the index for Sorting, but execute explicit sort “SORT ORDER BY STOPKEY” Index only used for WHERE predicates. Very similar plan to the one in example 1.

RE: “Columns what stand before “ORDER BY” column in index, has to have “=” or “IN” condition in WHERE clause…”

This would be representation of the case in scope of example I follow:

SQL> drop index IDX2_EMPLOYEE;

Index dropped.

Elapsed: 00:00:00.98
SQL> CREATE INDEX IDX3_EMPLOYEE ON EMPLOYEE (BRANCH,DEPARTMENT,hire_date,SALARY);

Index created.

SQL> select  * from (
select * from employee a
where department='IT'
  and branch='US'
  and hire_date>sysdate-100
order by salary
) where rownum<=3

Execution Plan
----------------------------------------------------------
Plan hash value: 2997571357

------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name	       | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |               |     3 |   297 |  7254	 (1)| 00:01:28 |
|*  1 |  COUNT STOPKEY                 |               |       |       |            |          |
|   2 |   VIEW                         |               | 12880 |  1245K|  7254	 (1)| 00:01:28 |
|*  3 |    SORT ORDER BY STOPKEY       |               | 12880 |   364K|  7254	 (1)| 00:01:28 |
|*  4 |     TABLE ACCESS BY INDEX ROWID| EMPLOYEE      | 12880 |   364K|  7252	 (1)| 00:01:28 |
|*  5 |      INDEX RANGE SCAN          | IDX1_EMPLOYEE |   818K|       |  2728	 (1)| 00:00:33 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter(ROWNUM<=3)
   3 - filter(ROWNUM<=3)    4 - filter("HIRE_DATE">SYSDATE@!-100)
   5 - access("BRANCH"='US' AND "DEPARTMENT"='IT')

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
       5910  consistent gets
          0  physical reads
          0  redo size
        662  bytes sent via SQL*Net to client
        364  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          1  sorts (memory)
          0  sorts (disk)
          3  rows processed

The plan again looks very similar to Example 1. Oracle does not use SORT of Salary from index, but doing explicit sort.

Picture below shows why.

test1

If single Bounded INDEX RANGE SCAN is used (showed by arrows), the row marked in Red will be missed.

From what shown on the picture – we can do 2 index range scans, and order union of the outputs.
Yes it would be case with “IN” condition. But remember here is “>” condition – i.e. not 2, but possible infinite combination of values, and so – infinite combination of index range scans. It will be for sure more expensive than just follow plan as in Example one, what Oracle did.

RE: “RE: Between 1st column in the index and “ORDER BY” column it should be no columns what are not mentioned in “WHERE” clause.”

This would be query for this case, where offended column in index is HIRE_DATE:

SQL> CREATE INDEX IDX3_EMPLOYEE ON EMPLOYEE (BRANCH,DEPARTMENT,hire_date,SALARY);

SQL> select  * from (
select * from employee a
where department='IT'
  and branch='US'
order by salary
) where rownum<=3 

and query above is basically equivalent to following one:

 SQL> select  * from (
select * from employee a
where department='IT'
  and branch='US'
  and hire_date > sysdate-1000000 and hire_date < sysdate+1000000
order by salary
) where rownum<=3

And it would be exactly case as described in previous example.

There is a little special case with IN condition, as i mentioned earlier.
I will describe it in next blog.

Advertisements
This entry was posted in SQL and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s