Oracle、MySQL、PostgreSQL、OceanBase、万里开源数据库比较系列(十七): IN ( MAX Subquery )

在关系数据库中两表关联在oracle中使用IN( subquery)的语法很常见，但kevin发现在MySQL中subquery使用MAX聚合参数时，会导致主查询Full scan而无法使用索引范围扫描, 当遇到大表时可能性能下降明显，测试发现有的库使用子查询做为驱动表，有的是使用filter从主查询过滤。下面简单测试。

先看oracle

QL> CREATE TABLE employee1 (id int,LastName varchar(255),FirstName varchar(255),Address varchar(255),profile varchar(255));
Table created.

SQL> insert into employee1 select rownum,'anbob'||level,'weejar','xxxxxxxxxxx','yyyyyyy' from dual connect by rownum<=100000; 100000 rows created. SQL> create table employee2 as select * from employee1;
Table created.

SQL> create index idx_emp_id1 on employee1(id);
Index created.

SQL> create index idx_emp_id2 on employee2(id);
Index created.

SQL> create index idx_emp_lname1 on employee1(lastname);
Index created.

SQL> create index idx_emp_lname2 on employee2(lastname);
Index created.

SQL> explain plan for select * from employee2 where id in(select max(id) from employee1 where lastname='anbob1');
Explained.

SQL> @x2

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Plan hash value: 616252811

---------------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |                |     1 |    43 |     4   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED   | EMPLOYEE2      |     1 |    43 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                     | IDX_EMP_ID2    |     1 |       |     1   (0)| 00:00:01 |
|   3 |    SORT AGGREGATE                      |                |     1 |    16 |            |          |
|   4 |     TABLE ACCESS BY INDEX ROWID BATCHED| EMPLOYEE1      |     1 |    16 |     2   (0)| 00:00:01 |
|*  5 |      INDEX RANGE SCAN                  | IDX_EMP_LNAME1 |     1 |       |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("ID"= (SELECT MAX("ID") FROM "EMPLOYEE1" "EMPLOYEE1" WHERE "LASTNAME"='anbob1'))
   5 - access("LASTNAME"='anbob1')

Note:
执行计划可阅读性非常棒，不解释。

再看Postgresql

-- 表结构与构建数据省略
weejar=# explain select id from employee2 where id in(select max(id) from employee1);
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=1.76..10.31 rows=1 width=4)
   ->  Result  (cost=1.62..1.63 rows=1 width=4)
         InitPlan 1 (returns $0)
           ->  Limit  (cost=0.14..1.62 rows=1 width=4)
                 ->  Index Only Scan Backward using idx_emp_1 on employee1  (cost=0.14..44.66 rows=30 width=4)
                       Index Cond: (id IS NOT NULL)
   ->  Index Only Scan using employee2_pkey on employee2  (cost=0.14..8.15 rows=1 width=4)
         Index Cond: (id = ($0))
(8 行记录)

Note:
Postgresql也和oracle一样符合预期。

MySQL社区版v8.0

[root@oel7db1 ~]# mysql -uroot -pwww.anbob.com -h127.0.0.1 -P3306 anbob
mysql: [Warning] Using a password on the command line interface can be insecure.
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 8
Server version: 8.0.20-commercial MySQL Enterprise Server - Commercial

Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MYSQL_root@127.0.0.1 [anbob]> use anbob;
Database changed

MYSQL_root@127.0.0.1 [anbob]> CREATE TABLE employee1 (id int,LastName varchar(255),FirstName varchar(255),Address varchar(255),profile varchar(255));
Query OK, 0 rows affected (0.17 sec)

MYSQL_root@127.0.0.1 [anbob]> CREATE TABLE employee2 (id int primary key,LastName varchar(255),FirstName varchar(255),Address varchar(255),profile varchar(255));
Query OK, 0 rows affected (0.09 sec)

MYSQL_root@127.0.0.1 [anbob]> create index idx_emp_1 on employee1(id);
Query OK, 0 rows affected (0.09 sec)
Records: 0 Duplicates: 0 Warnings: 0

MYSQL_root@127.0.0.1 [anbob]> DELIMITER $$
MYSQL_root@127.0.0.1 [anbob]> CREATE PROCEDURE BulkInsert()
-> BEGIN
-> DECLARE i INT DEFAULT 1;
-> truncate table employee1;
-> truncate table employee2;
-> WHILE (i <= 20000)
 DO -> INSERT INTO employee1 (id, FirstName, Address) VALUES(i, CONCAT("user","-",i), CONCAT("address","-",i));
-> INSERT INTO employee2 (id,FirstName, Address) VALUES(i,CONCAT("user","-",i), CONCAT("address","-",i));
-> SET i = i+1;
-> END WHILE;
-> END
-> $$
Query OK, 0 rows affected (0.12 sec)

MYSQL_root@127.0.0.1 [anbob]> DELIMITER ;
MYSQL_root@127.0.0.1 [anbob]> call BulkInsert();
-- 吐槽， MYSQL这存储过程调用真是慢到怀疑人生。

MYSQL_root@127.0.0.1 [anbob]> explain select id from employee1 where id in(select max(id) from employee2) ;
+----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+------------------------------+
| id | select_type        | table     | partitions | type  | possible_keys | key           | key_len | ref  | rows  | filtered | Extra                        |
+----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+------------------------------+
|  1 | PRIMARY            | employee1 | NULL       | index | NULL          | idx_emp_name1 | 1023    | NULL | 20160 |   100.00 | Using where; Using index     |
|  2 | DEPENDENT SUBQUERY | NULL      | NULL       | NULL  | NULL          | NULL          | NULL    | NULL |  NULL |     NULL | Select tables optimized away |
+----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+------------------------------+
2 rows in set, 1 warning (0.00 sec)


MYSQL_root@127.0.0.1 [anbob]> explain format=tree select id from employee1 where id in(select max(id) from employee2) \G
*************************** 1. row ***************************
EXPLAIN: -> Filter: (employee1.id,(select #2))  (cost=2111.85 rows=20160)
    -> Index scan on employee1 using idx_emp_name1  (cost=2111.85 rows=20160)
    -> Select #2 (subquery in condition; dependent)
        -> Filter: ((employee1.id) = (max(employee2.id)))
            -> Rows fetched before execution

1 row in set (0.00 sec)


MYSQL_root@127.0.0.1 [anbob]> create index idx_emp_name1 on employee1(firstname);
MYSQL_root@127.0.0.1 [anbob]> create index idx_emp_name2 on employee2(firstname);

MYSQL_root@127.0.0.1 [anbob]> explain select id from employee1 where id in(select max(id) from employee2 where firstname='user-1');
+----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+------------------------------+
| id | select_type        | table     | partitions | type  | possible_keys | key           | key_len | ref  | rows  | filtered | Extra                        |
+----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+------------------------------+
|  1 | PRIMARY            | employee1 | NULL       | index | NULL          | idx_emp_name1 | 1023    | NULL | 20160 |   100.00 | Using where; Using index     |
|  2 | DEPENDENT SUBQUERY | NULL      | NULL       | NULL  | NULL          | NULL          | NULL    | NULL |  NULL |     NULL | Select tables optimized away |
+----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+------------------------------+
2 rows in set, 1 warning (0.00 sec)

-------------------------+
 -> Filter: (employee1.id,(select #2))  (cost=2111.85 rows=20160)
    -> Index scan on employee1 using idx_emp_name1  (cost=2111.85 rows=20160)
    -> Select #2 (subquery in condition; dependent)
        -> Filter: ((employee1.id) = (max(employee2.id)))
            -> Rows fetched before execution
+---------------------------------------------------------


MYSQL_root@127.0.0.1 [anbob]> explain select * from employee1 where id in(select max(id) from employee2 where firstname IN('user-1','user-2')) ;
+----+-------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+--------------------------+
| id | select_type | table     | partitions | type  | possible_keys | key           | key_len | ref  | rows  | filtered | Extra                    |
+----+-------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+--------------------------+
|  1 | PRIMARY     | employee1 | NULL       | ALL   | NULL          | NULL          | NULL    | NULL | 20160 |   100.00 | Using where              |
|  2 | SUBQUERY    | employee2 | NULL       | range | idx_emp_name2 | idx_emp_name2 | 1023    | NULL |     2 |   100.00 | Using where; Using index |
+----+-------------+-----------+------------+-------+---------------+---------------+---------+------+-------+----------+--------------------------+
2 rows in set, 1 warning (0.00 sec)

MYSQL_root@127.0.0.1 [anbob]> explain select * from employee1 where id in(select id from employee2 where firstname IN('user-1','user-2')) ;
+----+-------------+-----------+------------+--------+-----------------------+---------------+---------+--------------------+------+----------+--------------------------+
| id | select_type | table     | partitions | type   | possible_keys         | key           | key_len | ref                | rows | filtered | Extra                    |
+----+-------------+-----------+------------+--------+-----------------------+---------------+---------+--------------------+------+----------+--------------------------+
|  1 | SIMPLE      | employee2 | NULL       | range  | PRIMARY,idx_emp_name2 | idx_emp_name2 | 1023    | NULL               |    2 |   100.00 | Using where; Using index |
|  1 | SIMPLE      | employee1 | NULL       | eq_ref | PRIMARY               | PRIMARY       | 4       | anbob.employee2.id |    1 |   100.00 | NULL                     |
+----+-------------+-----------+------------+--------+-----------------------+---------------+---------+--------------------+------+----------+--------------------------+
2 rows in set, 1 warning (0.01 sec)


-- 如果没有聚合函数MAX ，是用子查询做为驱动表。同时emp1 表也使用上了索引.

MYSQL_root@127.0.0.1 [anbob]> explain  analyze select * from employee1 
    where id in(select max(id) from employee2 where firstname IN('user-1','user-2')) \G
*************************** 1. row ***************************
EXPLAIN: -> Filter: (employee1.id,employee1.id in (select #2))  (cost=2111.85 rows=20160) (actual time=0.086..65.882 rows=1 loops=1)
    -> Table scan on employee1  (cost=2111.85 rows=20160) (actual time=0.036..26.958 rows=20000 loops=1)
    -> Select #2 (subquery in condition; run only once)
        -> Filter: ((employee1.id = ``.`max(id)`))  (actual time=0.001..0.001 rows=0 loops=20001)
            -> Limit: 1 row(s)  (actual time=0.001..0.001 rows=0 loops=20001)
                -> Index lookup on  using  (max(id)=employee1.id)  (actual time=0.001..0.001 rows=0 loops=20001)
                    -> Materialize with deduplication  (actual time=0.001..0.001 rows=0 loops=20001)
                        -> Aggregate: max(employee2.id)  (actual time=0.032..0.032 rows=1 loops=1)
                            -> Filter: (employee2.FirstName in ('user-1','user-2'))  (cost=1.50 rows=2) (actual time=0.019..0.028 rows=2 loops=1)
                                -> Index range scan on employee2 using idx_emp_name2  (cost=1.50 rows=2) (actual time=0.017..0.026 rows=2 loops=1)

1 row in set (0.07 sec)

MYSQL_root@127.0.0.1 [anbob]> explain  analyze select * from employee1 
     where id in(select id from employee2 where firstname IN('user-1','user-2'))  \G
*************************** 1. row ***************************
EXPLAIN: -> Nested loop inner join  (cost=3.68 rows=2) (actual time=0.028..0.040 rows=2 loops=1)
    -> Filter: (employee2.FirstName in ('user-1','user-2'))  (cost=1.50 rows=2) (actual time=0.018..0.026 rows=2 loops=1)
        -> Index range scan on employee2 using idx_emp_name2  (cost=1.50 rows=2) (actual time=0.016..0.024 rows=2 loops=1)
    -> Single-row index lookup on employee1 using PRIMARY (id=employee2.id)  (cost=1.04 rows=1) (actual time=0.006..0.006 rows=1 loops=2)

1 row in set (0.01 sec)

Note:
MySQL中子查询使用了MAX聚合函数，导致主查询全表扫，如果表较大时，性能会变化明显。

从MySQL 8.0.21开始再可以支持子查询做为驱动表。 SET SESSION optimizer_switch=”subquery_to_derived=ON”;

MYSQL_root@127.0.0.1 [anbob]> show variables like '%optimizer_switch%'\G;
*************************** 1. row ***************************
Variable_name: optimizer_switch
        Value: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=on,block_nested_loop=on,batched_key_access=off,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on,use_invisible_indexes=off,skip_scan=on,hash_join=on
1 row in set (0.00 sec)

而之前只能改写SQL使用join 
MYSQL_root@127.0.0.1 [anbob]> explain
    -> select * from
    -> employee1 e1
    -> join employee2 e2 on e2.id=e1.id
    -> where e2.firstname IN('user-1','user-2');
+----+-------------+-------+------------+--------+-----------------------+---------------+---------+-------------+------+----------+-----------------------+
| id | select_type | table | partitions | type   | possible_keys         | key           | key_len | ref         | rows | filtered | Extra                 |
+----+-------------+-------+------------+--------+-----------------------+---------------+---------+-------------+------+----------+-----------------------+
|  1 | SIMPLE      | e2    | NULL       | range  | PRIMARY,idx_emp_name2 | idx_emp_name2 | 1023    | NULL        |    2 |   100.00 | Using index condition |
|  1 | SIMPLE      | e1    | NULL       | eq_ref | PRIMARY               | PRIMARY       | 4       | anbob.e2.id |    1 |   100.00 | NULL                  |
+----+-------------+-------+------------+--------+-----------------------+---------------+---------+-------------+------+----------+-----------------------+
2 rows in set, 1 warning (0.00 sec)

Note：
在MYSQL中还是老老实时的单表查询吧，优化器有点弱，相比oracle差的太远。

看一下OceanBase for MySQL模式.

-- 构建数据同MYSQL完全一样。

obclient [test]> explain select id from employee1 where id in(select max(id) from employee2) \G
*************************** 1. row ***************************
Query Plan: ==========================================================
|ID|OPERATOR         |NAME                |EST. ROWS|COST|
----------------------------------------------------------
|0 |NESTED-LOOP JOIN |                    |199      |169 |
|1 | SUBPLAN SCAN    |VIEW1               |1        |46  |
|2 |  SCALAR GROUP BY|                    |1        |46  |
|3 |   SUBPLAN SCAN  |VIEW2               |1        |46  |
|4 |    TABLE SCAN   |employee2(Reverse)  |1        |46  |
|5 | TABLE SCAN      |employee1(idx_emp_1)|199      |77  |
==========================================================

Outputs & filters:
-------------------------------------
  0 - output([employee1.id]), filter(nil),
      conds(nil), nl_params_([VIEW1.max(id)])
  1 - output([VIEW1.max(id)]), filter(nil),
      access([VIEW1.max(id)])
  2 - output([T_FUN_MAX(VIEW2.id)]), filter(nil),
      group(nil), agg_func([T_FUN_MAX(VIEW2.id)])
  3 - output([VIEW2.id]), filter(nil),
      access([VIEW2.id])
  4 - output([employee2.id]), filter(nil),
      access([employee2.id]), partitions(p0),
      limit(1), offset(nil)
  5 - output([employee1.id]), filter(nil),
      access([employee1.id]), partitions(p0)

obclient [test]> explain select * from employee2 where id in(select max(id) from employee1 where firstname='user-1') \G
*************************** 1. row ***************************
Query Plan: ==============================================================
|ID|OPERATOR         |NAME                    |EST. ROWS|COST|
--------------------------------------------------------------
|0 |NESTED-LOOP JOIN |                        |1        |114 |
|1 | SUBPLAN SCAN    |VIEW1                   |1        |92  |
|2 |  SCALAR GROUP BY|                        |1        |92  |
|3 |   SUBPLAN SCAN  |VIEW2                   |1        |92  |
|4 |    LIMIT        |                        |1        |92  |
|5 |     TOP-N SORT  |                        |1        |92  |
|6 |      TABLE SCAN |employee1(idx_emp_name1)|1        |92  |
|7 | TABLE GET       |employee2               |1        |22  |
==============================================================

Outputs & filters:
-------------------------------------
  0 - output([employee2.id], [employee2.LastName], [employee2.FirstName], [employee2.Address], [employee2.profile]), filter(nil),
      conds(nil), nl_params_([VIEW1.max(id)])
  1 - output([VIEW1.max(id)]), filter(nil),
      access([VIEW1.max(id)])
  2 - output([T_FUN_MAX(VIEW2.id)]), filter(nil),
      group(nil), agg_func([T_FUN_MAX(VIEW2.id)])
  3 - output([VIEW2.id]), filter(nil),
      access([VIEW2.id])
  4 - output([employee1.id]), filter(nil), limit(1), offset(nil)
  5 - output([employee1.id]), filter(nil), sort_keys([employee1.id, DESC]), topn(1)
  6 - output([employee1.id]), filter([(T_OP_IS_NOT, employee1.id, NULL, 0)]),
      access([employee1.id]), partitions(p0)
  7 - output([employee2.id], [employee2.LastName], [employee2.FirstName], [employee2.Address], [employee2.profile]), filter(nil),
      access([employee2.id], [employee2.LastName], [employee2.FirstName], [employee2.Address], [employee2.profile]), partitions(p0)

obclient [test]> explain select * from employee2 where id in(select id from employee1 where firstname='user-1') \G
*************************** 1. row ***************************
Query Plan: =============================================================
|ID|OPERATOR        |NAME                    |EST. ROWS|COST|
-------------------------------------------------------------
|0 |NESTED-LOOP JOIN|                        |1        |114 |
|1 | SUBPLAN SCAN   |VIEW2                   |1        |92  |
|2 |  MERGE DISTINCT|                        |1        |92  |
|3 |   SORT         |                        |1        |92  |
|4 |    TABLE SCAN  |employee1(idx_emp_name1)|1        |92  |
|5 | TABLE GET      |employee2               |1        |22  |
=============================================================

Outputs & filters:
-------------------------------------
  0 - output([employee2.id], [employee2.LastName], [employee2.FirstName], [employee2.Address], [employee2.profile]), filter(nil),
      conds(nil), nl_params_([VIEW2.VIEW1.id])
  1 - output([VIEW2.VIEW1.id]), filter(nil),
      access([VIEW2.VIEW1.id])
  2 - output([employee1.id]), filter(nil),
      distinct([employee1.id])
  3 - output([employee1.id]), filter(nil), sort_keys([employee1.id, ASC])
  4 - output([employee1.id]), filter(nil),
      access([employee1.id]), partitions(p0)
  5 - output([employee2.id], [employee2.LastName], [employee2.FirstName], [employee2.Address], [employee2.profile]), filter(nil),
      access([employee2.id], [employee2.LastName], [employee2.FirstName], [employee2.Address], [employee2.profile]), partitions(p0)

Note:
在OB for mysql中也是支持子查询做为驱动表的。在for oracle租户执行计划是一样的。

GreatDB万里开源数据库

GreatDB Cluster[test]> explain format=tree select id from employee1 where id in(select max(id) from employee2 where firstname='user-1');
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| -> Filter: (employee1.id,(select #2))  (cost=14.99 rows=111)
    -> Index scan on employee1 using idx_emp_name1 {[datanode3:datanode3_1]: SELECT `id`, `_hidden_pk_` FROM `test`.`employee1`;}  (cost=14.99 rows=111)
    -> Select #2 (subquery in condition; dependent)
        -> Filter: ((employee1.id) = (max(employee2.id)))
            -> Rows fetched before execution  (cost=0.00..0.00 rows=1)
 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

GreatDB Cluster[test]> explain select id from employee1 where id in(select max(id) from employee2 where firstname='user-1');
+----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+------+----------+------------------------------+
| id | select_type        | table     | partitions | type  | possible_keys | key           | key_len | ref  | rows | filtered | Extra                        |
+----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+------+----------+------------------------------+
|  1 | PRIMARY            | employee1 | NULL       | index | NULL          | idx_emp_name1 | 1023    | NULL |  111 |   100.00 | Using where; Using index     |
|  2 | DEPENDENT SUBQUERY | NULL      | NULL       | NULL  | NULL          | NULL          | NULL    | NULL | NULL |     NULL | Select tables optimized away |
+----+--------------------+-----------+------------+-------+---------------+---------------+---------+------+------+----------+------------------------------+
2 rows in set, 1 warning (0.01 sec)

GreatDB Cluster[test]> explain  select * from employee1 where id in(select max(id) from employee2 where firstname IN('user-1','user-2')) ;
+----+-------------+-----------+------------+-------+---------------+---------------+---------+------+------+----------+---------------------------------------------------------------------------------------------+
| id | select_type | table     | partitions | type  | possible_keys | key           | key_len | ref  | rows | filtered | Extra                                                                                       |
+----+-------------+-----------+------------+-------+---------------+---------------+---------+------+------+----------+---------------------------------------------------------------------------------------------+
|  1 | PRIMARY     | employee1 | NULL       | ALL   | NULL          | NULL          | NULL    | NULL |  111 |   100.00 | Using where                                                                                 |
|  2 | SUBQUERY    | employee2 | NULL       | range | idx_emp_name2 | idx_emp_name2 | 1023    | NULL |   20 |   100.00 | Using pushed condition (`test`.`employee2`.`FirstName` in ('user-1','user-2')); Using index |
+----+-------------+-----------+------------+-------+---------------+---------------+---------+------+------+----------+---------------------------------------------------------------------------------------------+
2 rows in set, 1 warning (0.00 sec)

GreatDB Cluster[test]>

Note:
万里数据库并没有把子查询做为驱动表，仿佛把subquery的值以filter的形式传给了主查询employee1 ,并且使用的索引，与oracle的执行路径接近，而不是像mysql改写后或pg一样子查询驱动的方式。

References

https://www.modb.pro/db/1730045837189931008