Quantcast
Channel: Jimmy He – OracleBlog
Viewing all 129 articles
Browse latest View live

如何找到postgres中疯狂增长的wal日志的语句

$
0
0

很久以前,我写过一个文章,《如何查找疯狂增长arch的进程》,讲述在oracle数据库中如何查找导致当前疯狂增长arch的session。今天,我们在postgresql数据库中也遇到了类似的问题。

在一个时间内,wal日志疯狂的增长,大约每分钟产生1G,而xlog疯狂cp去归档的结果,导致xlog来不及流复制到从库就已经切去了归档目录,进而导致了主从断开。

和开发一起诊断了这个问题之后,发现是一个update语句更新了大量的记录,每次更新1000多万记录中的200多万,这个表上14个字段中10个字段有有索引。更新时是非HOT update。这个语句每小时跑一次,每次跑的时候,有12个类似的语句。开发修改语句之后,增加了where的过滤条件后,减少了更新的数据量,从200多万减少了几行,从而解决了这个问题。

事后,我一直在想,如果没有开发人员,我们dba是否也可以从数据库本身的信息中发现问题?找到语句? 在一次偶然的机会中,和平安科技的梁海安聊天时中得到了答案。

在oracle中导致归档过多的,是过于频繁的dml语句。在pg中也是一样。只是在oracle中有v$sesstat中可以看到redo size的信息,而在pg的pg_stat_activity中只有session的信息,并没有语句的wal信息。但是由于wal的产生也是因为过多的dml引起的,我们可以从pg_catalog.pg_stat_all_tables中去找变动频繁的tuple(n_tup_ins,n_tup_upd,n_tup_del,主要是update),从而发现导致dml过多的语句。

解决方法如下:

1. 开启pg的dml审计。在postgresql.conf中设置log_statement=’mod’

2. 截取一个时间的pg_catalog.pg_stat_all_tables:

create table orasup1 as
select date_trunc('second',now()) as sample_time,schemaname,relname,n_tup_ins,n_tup_upd,n_tup_del,n_tup_hot_upd from pg_catalog.pg_stat_all_tables;

3. 截取另外一个时间的pg_catalog.pg_stat_all_tables:

create table orasup2 as
select date_trunc('second',now()) as sample_time,schemaname,relname,n_tup_ins,n_tup_upd,n_tup_del,n_tup_hot_upd from pg_catalog.pg_stat_all_tables;

4. 检查在单位时间内,那个对象的dml最多:

select t2.schemaname,t2.relname,
 (t2.n_tup_ins-t1.n_tup_ins) as delta_ins,
 (t2.n_tup_upd-t1.n_tup_upd) as delta_upd,
 (t2.n_tup_del-t1.n_tup_del) as delta_del,
(t2.n_tup_ins+t2.n_tup_upd+t2.n_tup_del-t1.n_tup_ins-t1.n_tup_upd-t1.n_tup_del) as del_dml,
(EXTRACT (EPOCH FROM  t2.sample_time::timestamp )::float-EXTRACT (EPOCH FROM  t1.sample_time::timestamp )::float) as delta_second,
round(cast((t2.n_tup_ins+t2.n_tup_upd+t2.n_tup_del-t1.n_tup_ins-t1.n_tup_upd-t1.n_tup_del)/(EXTRACT (EPOCH FROM  t2.sample_time::timestamp )::float-EXTRACT (EPOCH FROM  t1.sample_time::timestamp )::float)as numeric),2) as delta_dml_per_sec
from  orasup2 t2, orasup1 t1
where t2.schemaname=t1.schemaname and t2.relname=t1.relname
order by delta_dml_per_sec desc limit 10;
platform=#

6. 此时我们已经得到了dml最多的对象,结合第1步的审计,就可以找到对应的语句了。

7. 清理战场,drop table orasup1; drop table orasup2;并且恢复审计粒度为log_statement=ddl


远程数据库的表超过20个索引的影响

$
0
0

昨天同事参加了一个研讨会,有提到一个案例。一个通过dblink查询远端数据库,原来查询很快,但是远端数据库增加了一个索引之后,查询一下子变慢了。

经过分析,发现那个通过dblink的查询语句,查询远端数据库的时候,是走索引的,但是远端数据库添加索引之后,如果索引的个数超过20个,就会忽略第一个建立的索引,如果查询语句恰好用到了第一个建立的索引,被忽略之后,只能走Full Table Scan了。

听了这个案例,我查了一下,在oracle官方文档中,关于Managing a Distributed Database有一段话:

Several performance restrictions relate to access of remote objects:

Remote views do not have statistical data.
Queries on partitioned tables may not be optimized.
No more than 20 indexes are considered for a remote table.
No more than 20 columns are used for a composite index.

说到,如果远程数据库使用超过20个索引,这些索引将不被考虑。这段话,在oracle 9i起的文档中就已经存在,一直到12.2还有。

那么,超过20个索引,是新的索引被忽略了?还是老索引被忽略了?如何让被忽略的索引让oracle意识到?我们来测试一下。
(本文基于12.1.0.2的远程库和12.2.0.1的本地库进行测试,如果对测试过程没兴趣的,可以直接拉到文末看“综上”部分)

(一)初始化测试表:

--创建远程表:
DROP TABLE t_remote;

CREATE TABLE t_remote (
col01 NUMBER,
col02 NUMBER,
col03 VARCHAR2(50),
col04 NUMBER,
col05 NUMBER,
col06 VARCHAR2(50),
col07 NUMBER,
col08 NUMBER,
col09 VARCHAR2(50),
col10 NUMBER,
col11 NUMBER,
col12 VARCHAR2(50),
col13 NUMBER,
col14 NUMBER,
col15 VARCHAR2(50),
col16 NUMBER,
col17 NUMBER,
col18 VARCHAR2(50),
col19 NUMBER,
col20 NUMBER,
col21 VARCHAR2(50),
col22 NUMBER,
col23 NUMBER,
col24 VARCHAR2(50),
col25 NUMBER,
col26 NUMBER,
col27 VARCHAR2(50)
);


alter table t_remote modify (col01 not null);

INSERT INTO t_remote
SELECT
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*')
FROM dual
CONNECT BY level <= 10000;

commit;


create unique index t_remote_i01_pk on t_remote (col01);
alter table t_remote add (constraint t_remote_i01_pk primary key (col01) using index t_remote_i01_pk);

create index t_remote_i02 on t_remote (col02);
create index t_remote_i03 on t_remote (col03);
create index t_remote_i04 on t_remote (col04);
create index t_remote_i05 on t_remote (col05);
create index t_remote_i06 on t_remote (col06);
create index t_remote_i07 on t_remote (col07);
create index t_remote_i08 on t_remote (col08);
create index t_remote_i09 on t_remote (col09);
create index t_remote_i10 on t_remote (col10);
create index t_remote_i11 on t_remote (col11);
create index t_remote_i12 on t_remote (col12);
create index t_remote_i13 on t_remote (col13);
create index t_remote_i14 on t_remote (col14);
create index t_remote_i15 on t_remote (col15);
create index t_remote_i16 on t_remote (col16);
create index t_remote_i17 on t_remote (col17);
create index t_remote_i18 on t_remote (col18);
create index t_remote_i19 on t_remote (col19);
create index t_remote_i20 on t_remote (col20);

exec dbms_stats.gather_table_stats(user,'T_REMOTE');

--创建本地表:
drop table t_local;

CREATE TABLE t_local (
col01 NUMBER,
col02 NUMBER,
col03 VARCHAR2(50),
col04 NUMBER,
col05 NUMBER,
col06 VARCHAR2(50)
);

INSERT INTO t_local
SELECT
rownum, rownum, rpad('*',50,'*'),
rownum, rownum, rpad('*',50,'*')
FROM dual
CONNECT BY level <= 50;

COMMIT;

create index t_local_i01 on t_local (col01);
create index t_local_i02 on t_local (col02);
create index t_local_i03 on t_local (col03);
create index t_local_i04 on t_local (col04);
create index t_local_i05 on t_local (col05);
create index t_local_i06 on t_local (col06);

exec dbms_stats.gather_table_stats(user,'t_local');


create database link dblink_remote CONNECT TO test IDENTIFIED BY test USING 'ora121';


SQL> select host_name from v$instance@dblink_remote;

HOST_NAME
----------------------------------------------------------------
testdb2

SQL> select host_name from v$instance;

HOST_NAME
----------------------------------------------------------------
testdb10

SQL>

可以看到,远程表有27个字段,目前还只是在前20个字段建立了索引,且第一个字段是主键。本地表,有6个字段,6个字段都建索引。

(二)第一轮测试,远程表上有20个索引。
测试场景1:
在远程表20索引的情况下,本地表和远程表关联,用本地表的第一个字段关联远程表的第一个字段:

select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25
from t_local l, t_remote@dblink_remote r
where l.col01=r.col01
;

select * from table( dbms_xplan.display_cursor(null, null, 'typical LAST') );

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  04schqc3d9rgm, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col01=r.col01

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |    53 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |    53   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     1   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL01","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL01"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>
-- 我们这里注意一下,WHERE :1="COL01"的存在,正是因为这个条件,所以在远程是走了主键而不是全表扫。我们把这个语句带入到远程执行。

远程:
SQL> explain plan for
  2  SELECT "COL01","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL01";

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
Plan hash value: 829680338

-----------------------------------------------------------------------------------------------
| Id  | Operation                   | Name            | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                 |     1 |    63 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| T_REMOTE        |     1 |    63 |     2   (0)| 00:00:01 |
|*  2 |   INDEX UNIQUE SCAN         | T_REMOTE_I01_PK |     1 |       |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - access("COL01"=TO_NUMBER(:1))

14 rows selected.

我们可以看到,对于远程表的执行计划,这是走主键的。

测试场景2:
在远程表20索引的情况下,本地表和远程表关联,用本地表的第一个字段关联远程表的第20个字段:

select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25
from t_local l, t_remote@dblink_remote r
where l.col01=r.col20
;

select * from table( dbms_xplan.display_cursor(null, null, 'typical LAST') );
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  5rwtbwcnv0tsm, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col01=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

远程:
PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
Plan hash value: 3993494813

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              |     1 |    63 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T_REMOTE     |     1 |    63 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T_REMOTE_I20 |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - access("COL20"=TO_NUMBER(:1))

14 rows selected.

SQL>

我们可以看到,对于远程表的执行计划,这是走索引范围扫描的。

测试场景3:
在远程表20索引的情况下,本地表和远程表关联,用本地表的第2个字段关联远程表的第2个字段:

select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25
from t_local l, t_remote@dblink_remote r
where l.col02=r.col02
;

select * from table( dbms_xplan.display_cursor(null, null, 'typical LAST') );
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  81ctrx5huhfvq, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col02

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL02","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL02"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

远程:
SQL> explain plan for
  2  SELECT "COL02","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL02";

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
Plan hash value: 2505594687

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              |     1 |    63 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T_REMOTE     |     1 |    63 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T_REMOTE_I02 |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - access("COL02"=TO_NUMBER(:1))

14 rows selected.

SQL>

我们可以看到,对于远程表的执行计划,这是走索引范围扫描的。

测试场景4:
在远程表20索引的情况下,本地表和远程表关联,用本地表的第2个字段关联远程表的第20个字段:

select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25
from t_local l, t_remote@dblink_remote r
where l.col02=r.col20
;

select * from table( dbms_xplan.display_cursor(null, null, 'typical LAST') );
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  407pxjh9mgbry, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

远程:
SQL> explain plan for
  2  SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20";

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
Plan hash value: 3993494813

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              |     1 |    63 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T_REMOTE     |     1 |    63 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T_REMOTE_I20 |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - access("COL20"=TO_NUMBER(:1))

14 rows selected.

SQL>

我们可以看到,对于远程表的执行计划,这是走索引范围扫描的。

(三)建立第21个索引:

create index t_remote_i21 on t_remote (col21);
exec dbms_stats.gather_table_stats(user,'T_REMOTE');

(四)远程表上现在有21个索引,重复上面4个测试:

测试场景1:

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  04schqc3d9rgm, child number 1
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col01=r.col01

Plan hash value: 830255788

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   156 (100)|          |        |      |
|*  1 |  HASH JOIN         |          |    50 |  6300 |   156   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE | 10000 |   644K|   153   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("L"."COL01"="R"."COL01")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL01","COL25","COL26","COL27" FROM "T_REMOTE" "R" (accessing
       'DBLINK_REMOTE' )



28 rows selected.

SQL>

--我们看到,这里已经没有了之前的 WHERE :1="COL01",即使不带入到远程看执行计划,我们也可以猜到它是全表扫。

远程:
SQL> explain plan for
  2  SELECT "COL01","COL25","COL26","COL27" FROM "T_REMOTE" "R";

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
Plan hash value: 4187688566

------------------------------------------------------------------------------
| Id  | Operation         | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |          | 10000 |   615K|   238   (0)| 00:00:01 |
|   1 |  TABLE ACCESS FULL| T_REMOTE | 10000 |   615K|   238   (0)| 00:00:01 |
------------------------------------------------------------------------------

8 rows selected.

SQL>

我们可以看到,对于远程表的执行计划,如果关联条件是远程表的第一个字段,第一个字段上的索引是被忽略的,执行计划是选择全表扫描的。

测试场景2:

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  5rwtbwcnv0tsm, child number 1
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col01=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

远程:
SQL> explain plan for
  2  SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20";

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
Plan hash value: 3993494813

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              |     1 |    63 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T_REMOTE     |     1 |    63 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T_REMOTE_I20 |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - access("COL20"=TO_NUMBER(:1))

14 rows selected.

SQL>

我们可以看到,对于远程表的执行计划,如果关联条件是远程表的第20个字段,这第20个字段上的索引是没有被忽略的,执行计划是走索引。

测试场景3:

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  81ctrx5huhfvq, child number 1
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col02

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL02","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL02"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

远程:
SQL> explain plan for
  2  SELECT "COL02","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL02";

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
Plan hash value: 2505594687

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              |     1 |    63 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T_REMOTE     |     1 |    63 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T_REMOTE_I02 |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - access("COL02"=TO_NUMBER(:1))

14 rows selected.

SQL>

我们可以看到,对于远程表的执行计划,如果关联条件是远程表的第2个字段,这第2个字段上的索引是没有被忽略的,执行计划是走索引。

测试场景4:

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  407pxjh9mgbry, child number 1
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

远程:
SQL> explain plan for
  2  SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20";

Explained.

SQL> select * from table(dbms_xplan.display());

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
Plan hash value: 3993494813

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              |     1 |    63 |     2   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| T_REMOTE     |     1 |    63 |     2   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | T_REMOTE_I20 |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

PLAN_TABLE_OUTPUT
------------------------------------------------------------------------------------------------------------
---------------------------------------------------

   2 - access("COL20"=TO_NUMBER(:1))

14 rows selected.

SQL>

我们可以看到,对于远程表的执行计划,如果关联条件是远程表的第20个字段,这第20个字段上的索引是没有被忽略的,执行计划是走索引。

我们目前可以总结到,当远程表第21个索引建立的时候,通过dblink关联本地表和远程表,如果关联条件是远程表的第1个建立的索引的字段,那么这个索引将被忽略,从而走全表扫描。如果关联条件是远程表的第2个建立索引的字段,则不受影响。
似乎是有效索引的窗口是20个,当新建第21个,那么第1个就被无视了。


(五)建立第22个索引,我们在来看看上述猜测是否符合。

create index t_remote_i22 on t_remote (col22);
exec dbms_stats.gather_table_stats(user,'T_REMOTE');

(六),目前远程表有22个索引,重复上面4个测试:

测试场景1:

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  04schqc3d9rgm, child number 2
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col01=r.col01

Plan hash value: 830255788

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   156 (100)|          |        |      |
|*  1 |  HASH JOIN         |          |    50 |  6300 |   156   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE | 10000 |   644K|   153   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("L"."COL01"="R"."COL01")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL01","COL25","COL26","COL27" FROM "T_REMOTE" "R" (accessing
       'DBLINK_REMOTE' )



28 rows selected.

SQL>

测试场景2:

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  5rwtbwcnv0tsm, child number 2
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col01=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

测试场景3:

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  81ctrx5huhfvq, child number 2
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col02

Plan hash value: 830255788

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   156 (100)|          |        |      |
|*  1 |  HASH JOIN         |          |    50 |  6300 |   156   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE | 10000 |   644K|   153   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("L"."COL02"="R"."COL02")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL02","COL25","COL26","COL27" FROM "T_REMOTE" "R" (accessing
       'DBLINK_REMOTE' )



28 rows selected.

SQL>

测试场景4:

PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  407pxjh9mgbry, child number 2
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>


上述的测试,其实是可以验证我们的猜测的。oracle对于通过dblink关联访问远程表,只是会意识到最近创建的20个索引的字段。这个意识到索引的窗口是20个,一旦建立了一个新索引,那么最旧的一个索引会被无视。

(七)我们尝试rebuild索引,看看有没有效果:
rebuild第2个索引

alter index t_remote_i02 rebuild;
exec dbms_stats.gather_table_stats(user,'T_REMOTE');


(八)在第2个索引rebuild之后,重复上面4个测试:

--测试场景1:
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  04schqc3d9rgm, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col01=r.col01

Plan hash value: 830255788

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   156 (100)|          |        |      |
|*  1 |  HASH JOIN         |          |    50 |  6300 |   156   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE | 10000 |   644K|   153   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("L"."COL01"="R"."COL01")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL01","COL25","COL26","COL27" FROM "T_REMOTE" "R" (accessing
       'DBLINK_REMOTE' )



28 rows selected.

SQL>

--测试场景2:
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  5rwtbwcnv0tsm, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col01=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>


--测试场景3:
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  81ctrx5huhfvq, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col02

Plan hash value: 830255788

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   156 (100)|          |        |      |
|*  1 |  HASH JOIN         |          |    50 |  6300 |   156   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE | 10000 |   644K|   153   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("L"."COL02"="R"."COL02")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL02","COL25","COL26","COL27" FROM "T_REMOTE" "R" (accessing
       'DBLINK_REMOTE' )



28 rows selected.

SQL>


--测试场景4:
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  407pxjh9mgbry, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

所以我们看到,索引rebuild,是不能起到重新“唤醒”索引的作用。


(九)我们尝试 drop and recreate 第2个索引。

drop index t_remote_i02;
create index t_remote_i02 on t_remote (col02);

exec dbms_stats.gather_table_stats(user,'T_REMOTE');


(十)重复上面的测试3和测试4:

测试3:
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  81ctrx5huhfvq, child number 1
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col02

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL02","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL02"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

测试4:
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  407pxjh9mgbry, child number 1
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col02=r.col20

Plan hash value: 631452043

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   103 (100)|          |        |      |
|   1 |  NESTED LOOPS      |          |    50 |  6300 |   103   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE |     1 |    66 |     2   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL20","COL25","COL26","COL27" FROM "T_REMOTE" "R" WHERE :1="COL20"
       (accessing 'DBLINK_REMOTE' )



23 rows selected.

SQL>

此时,其实我们可以预测,远程表此时col03上的索引是用不到的,我们来测试验证一下:
测试5:
PLAN_TABLE_OUTPUT
---------------------------------------------------------------------------------------------------------
SQL_ID  bhkczcfrhvsuw, child number 0
-------------------------------------
select l.col06,l.col05,l.col04,r.col27, r.col26,r.col25 from t_local l,
t_remote@dblink_remote r where l.col03=r.col03

Plan hash value: 830255788

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |       |       |   157 (100)|          |        |      |
|*  1 |  HASH JOIN         |          |   500K|    89M|   157   (1)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  5400 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE | 10000 |   781K|   153   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("L"."COL03"="R"."COL03")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL03","COL25","COL26","COL27" FROM "T_REMOTE" "R" (accessing
       'DBLINK_REMOTE' )



28 rows selected.

SQL>

我们可以看到,通过drop之后再重建,是可以“唤醒”第二个索引的。这也证明了我们20个索引识别的移动窗口,是按照索引的创建时间来移动的。

综上:

1. 对于通过dblink关联本地表和远程表,如果远程表的索引个数少于20个,那么不受影响。
2. 对于通过dblink关联本地表和远程表,如果远程表的索引个数增加到21个或以上,那么oracle在执行远程操作的时候,将忽略最早创建的那个索引,但是会以20个为窗口移动,最新建立的索引会被意识到。此时如果查询的关联条件中,使用到最早创建的那个索引的字段,由于忽略了索引,会走全表扫描。
3. 要“唤醒”对原来索引的意识,rebuild索引无效,需要drop & create索引。
4. 在本地表数据量比较少,远程表的数据量很大,而索引数量超过20个,且关联条件的字段时最早索引的情况下,可以考虑使用DRIVING_SITE的hint,将本地表的数据全量到远程中,此时远程的关联查询可以意识到那个索引。可见文末的例子。是否使用hint,需要评估本地表数据全量推送到远程的成本,和远程表使用全表扫的成本。

附:在22个索引的情况下,尝试采用DRIVING_SITE的hint:

SQL> select  l.col06,l.col05,l.col04,r.col27, r.col26,r.col25
  2  from t_local l, t_remote@dblink_remote r
  3  where l.col02=r.col02
  4  ;

50 rows selected.

Elapsed: 00:00:00.03

Execution Plan
----------------------------------------------------------
Plan hash value: 830255788

-----------------------------------------------------------------------------------------------
| Id  | Operation          | Name     | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |          |    50 |  6300 |   156   (0)| 00:00:01 |        |      |
|*  1 |  HASH JOIN         |          |    50 |  6300 |   156   (0)| 00:00:01 |        |      |
|   2 |   TABLE ACCESS FULL| T_LOCAL  |    50 |  3000 |     3   (0)| 00:00:01 |        |      |
|   3 |   REMOTE           | T_REMOTE | 10000 |   644K|   153   (0)| 00:00:01 | DBLIN~ | R->S |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - access("L"."COL02"="R"."COL02")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL02","COL25","COL26","COL27" FROM "T_REMOTE" "R" (accessing
       'DBLINK_REMOTE' )



Statistics
----------------------------------------------------------
        151  recursive calls
          0  db block gets
        246  consistent gets
         26  physical reads
          0  redo size
       2539  bytes sent via SQL*Net to client
        641  bytes received via SQL*Net from client
          5  SQL*Net roundtrips to/from client
         10  sorts (memory)
          0  sorts (disk)
         50  rows processed

SQL>

--可以看到远程表示走全表扫。

SQL> select /*+DRIVING_SITE(r)*/ l.col06,l.col05,l.col04,r.col27, r.col26,r.col25
  2  from t_local l, t_remote@dblink_remote r
  3  where l.col02=r.col02
  4  ;

50 rows selected.

Elapsed: 00:00:00.03

Execution Plan
----------------------------------------------------------
Plan hash value: 1716516160

-------------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name         | Rows  | Bytes | Cost (%CPU)| Time     | Inst   |IN-OUT|
-------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT REMOTE      |              |    50 |  6450 |   103   (0)| 00:00:01 |        |      |
|   1 |  NESTED LOOPS                |              |    50 |  6450 |   103   (0)| 00:00:01 |        |      |
|   2 |   NESTED LOOPS               |              |    50 |  6450 |   103   (0)| 00:00:01 |        |      |
|   3 |    REMOTE                    | T_LOCAL      |    50 |  3300 |     3   (0)| 00:00:01 |      ! | R->S |
|*  4 |    INDEX RANGE SCAN          | T_REMOTE_I02 |     1 |       |     1   (0)| 00:00:01 | ORA12C |      |
|   5 |   TABLE ACCESS BY INDEX ROWID| T_REMOTE     |     1 |    63 |     2   (0)| 00:00:01 | ORA12C |      |
-------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   4 - access("A2"."COL02"="A1"."COL02")

Remote SQL Information (identified by operation id):
----------------------------------------------------

   3 - SELECT "COL02","COL04","COL05","COL06" FROM "T_LOCAL" "A2" (accessing '!' )


Note
-----
   - fully remote statement
   - this is an adaptive plan


Statistics
----------------------------------------------------------
        137  recursive calls
          0  db block gets
        213  consistent gets
         25  physical reads
          0  redo size
       2940  bytes sent via SQL*Net to client
        641  bytes received via SQL*Net from client
          5  SQL*Net roundtrips to/from client
         10  sorts (memory)
          0  sorts (disk)
         50  rows processed

SQL>

--可以看到本地表是走全表扫,但是远程表使用了第2个字段的索引。

Oracle 12.2 新特性介绍

pg数据库授权表给只读用户之后,权限慢慢消失

$
0
0

越来越多的互联网企业在使用postgresql数据库,我们也不例外。

昨天开发请我建立了一个只读用户abc_tmp_test用户,并且将mkl_rw用户下的32个表授权给只读用户用。ok,请简单轻松的一个需求,很快就完成了。

但是今天开发来和我说,昨天授权的几个表中,有部分表还是没有权限去读取,让我帮忙看看。一开始,我以为是昨天遗漏了,先道了一个歉,再次进行了授权,授权完成之后,检查了32个表,都能被只读用户查询,于是放心的告诉开发,昨天的所有表都已经授权好了,我也检查过一次了。这次肯定不会漏了。

万万没想到,半小时后,开发来和我说,不行,还是有其中几个表没有权限。我之前的连接还没断开,再次跑了一遍之前的检查语句,确实没有权限了。卧槽?这是咋回事?数据库中有雷锋了?

我再次授权了一次,并且检查了information_schema.table_privileges,确认了再次授权后,是新增了32行记录。这次我没有先通知开发,说已经授权完成了,而是过了一会,我再次去查,变成了28行,又过了一会,变成了16行!

也就是我授权的32个表的select权限给只读用户,过一段时间之后,这32个表中的一些表的权限会慢慢消失!而且消失权限的表,也没有发现先授权的先消失,后授权的后消息的规律,但是可以发现最终剩下的,就是那16个表。我开始怀疑起人生了……

难道是pg中授权的表的数量有限?不能超过16个?也没查到相关的参数啊。

难道是那16个表有什么特殊设置?从建表语句中也没看到啊。

难道授权之后需要checkpoint刷盘?测试了checkpoint还是一样丢权限。

难道真的有雷锋出现啊。还说什么pg和oracle一样牛,一样稳定,连基本的授权都会丢。

正在逐个检查参数之际,同事通过检查log,发现了drop table的语句……

原来如此,这个案例,可以用下面的测试过程模拟出来了:

-bash-4.2$ psql -d mybidbaa -U mkl_rw
psql (9.6.2)
Type "help" for help.

mybidbaa=> --创建表bbbdba_test,并且授权给abc_tmp_test用户
mybidbaa=> create table bbbdba_test as select now();
SELECT 1
mybidbaa=>
mybidbaa=>
mybidbaa=> grant select on mkl_rw.bbbdba_test to abc_tmp_test;
GRANT
mybidbaa=>
mybidbaa=>
mybidbaa=>
mybidbaa=>
mybidbaa=> \q
-bash-4.2$ psql -d mybidbaa -U abc_tmp_test
psql (9.6.2)
Type "help" for help.

mybidbaa=>
mybidbaa=> --用abc_tmp_test登录,可以查到bbbdba_test表。
mybidbaa=> select * from mkl_rw.bbbdba_test;
              now
-------------------------------
 2017-11-16 16:08:14.123217+08
(1 row)

mybidbaa=> \q
-bash-4.2$
-bash-4.2$
-bash-4.2$ psql -d mybidbaa -U mkl_rw
psql (9.6.2)
Type "help" for help.

mybidbaa=> --删除表bbbdba_test,然后重建表bbbdba_test。
mybidbaa=> drop table mkl_rw.bbbdba_test;
DROP TABLE
mybidbaa=>
mybidbaa=>
mybidbaa=> create table bbbdba_test as select now();
SELECT 1
mybidbaa=> \q
-bash-4.2$
-bash-4.2$
-bash-4.2$
-bash-4.2$ psql -d mybidbaa -U abc_tmp_test
psql (9.6.2)
Type "help" for help.

mybidbaa=> --可以看到,权限丢了!!
mybidbaa=> select * from mkl_rw.bbbdba_test;
ERROR:  permission denied for relation bbbdba_test
mybidbaa=> \q
-bash-4.2$
-bash-4.2$
-bash-4.2$
-bash-4.2$

是的,如果table被drop了之后,再次重建,此时原本授权给只读用户的权限,也会消失。

向开发确认,是否有drop之后重建表的操作,开发确认,有段程序确实会定期的逐个drop表后重建表!!

为什么要进行drop表之后重建表的操作?开发说是通过调用框架清理数据,框架就是这么干的。

ok,明白了目的是为了清理数据,而不涉及到表结构的修改,那么其实用truncate来清理就可以了。如下测试,权限不会丢。

-bash-4.2$
-bash-4.2$
-bash-4.2$ psql -d mybidbaa -U mkl_rw
psql (9.6.2)
Type "help" for help.

mybidbaa=> grant select on mkl_rw.bbbdba_test to abc_tmp_test;
GRANT
mybidbaa=> truncate table mkl_rw.bbbdba_test;
TRUNCATE TABLE
mybidbaa=> \q
-bash-4.2$ psql -d mybidbaa -U abc_tmp_test
psql (9.6.2)
Type "help" for help.

mybidbaa=> select * from mkl_rw.bbbdba_test;
 now
-----
(0 rows)

mybidbaa=> \q
-bash-4.2$

最终,开发修改了代码,再次授权那32张表之后,权限不再慢慢消失了。

这个故事中可以学到的教训有二:
1. 大千世界无奇不有,数据库中没有雷锋,而是有各种万万没想到的逻辑。
2. 幸亏我们在建库的时候,建库标准要求设置了log_statement=ddl, 才能在log中发现线索。(其实我们oracle和pg的建库标准,都设置了记录ddl)

官方推荐的MySQL参数设置值

$
0
0

今天,在找MySQL补丁的时候,在metalink找到一篇非常好的文章。这oracle官方推荐的在OLTP环境下,MySQL参数设置的最佳实践。

下面的参数设置,对系统的性能会很有帮助。但是建议大家还是结合实际情况使用。

APPLIES TO:

MySQL Server – Version 5.6 and later
Information in this document applies to any platform.

PURPOSE

Strongly recommended initial settings for MySQL Server when used for OLTP or benchmarking.

SCOPE

For DBAs having OLTP-like workloads or doing benchmarking.

DETAILS

We recommend that when using MySQL Server 5.6 you include the following settings in your my.cnf or my.ini file if you have a transaction processing or benchmark workload. They are also a good starting point for other workloads. These settings replace some of our flexible server defaults for smaller configurations with values that are better for higher load servers. These are starting points. For most settings the optimal value depends on the specific workload and you should ideally test to find out what settings are best for your situation. The suggestions are also likely to be suitable for 5.7 but 5.7-specific notes and recommendations are a work in progress.

If a support engineer advises you to change a setting, accept that advice because it will have been given after considering the data they have collected about your specific situation.

 

Changes to make in all cases

These improve on the defaults to improve performance in some cases, reducing your chance of encountering trouble.

innodb_stats_persistent = 1         # Also use ANALYZE TABLE for all tables periodically
innodb_read_io_threads = 16       # Check pending read requests in SHOW ENGINE INNODB STATUS to see if more might be useful, if seldom more than 64 * innodb_read_io_threads, little need for more.
innodb_write_io_threads = 4
table_open_cache_instances = 16 # 5.7.8 onwards defaults to 16

metadata_locks_hash_instances = 256 # better hash from 5.6.15,5.7.3. Irrelevant and deprecated from 5.7.4 due to change in metadata locking

 

 

Main settings to review

Also make these additions and adjust as described to find reasonably appropriate values:

innodb_buffer_pool_size: the single most important performance setting for most workloads, see the memory section later for more on this and InnoDB log files. Consider increasing innodb_buffer_pool_instances (5.6 manual) from the 8 default to buffer pool size / 2GB (so 32 for 64g pool) if concurrency is high, some old benchmark results to illustrate why..

innodb_stats_persistent_sample_pages: a value in the 100 to 1000 range will produce better statistics and is likely to produce better query optimising for non-trivial queries. The time taken by ANALYZE TABLE is proportional to this and this many dives will be done for each index, so use some care about setting it to very large values.

innodb_flush_neighbors = 0 if you have SSD storage. Do not change from server default of 1 if you are using spinning disks. Use 0 if both.

innodb_page_size: consider 4k for SSD because this better matches the internal sector size on older disks but be aware that some might use the newer 16k sector size, if so, use that. Check your drive vendor for what it uses.

innodb_io_capacity: for a few spinning disks and lower end SSD the default is OK, but 100 is probably better for a single spinning disk. For higher end and bus-attached flash consider 1000. Use smaller values for systems with low write loads, larger with high. Use the smallest value needed for flushing and purging to keep up unless you see more modified/dirty pages than you want in the InnoDB buffer pool. Do not use extreme values like 20000 or more unless you have proved that lower values are not sufficient for your workload. It regulates flushing rates and related disk i/o. You can seriously harm performance by setting this or innodb_io_capacity_max too high and wasting disk i/o operations with premature flushing.

innodb_io_capacity_max: for a few spinning disks and lower end SSD the default is OK but 200-400 is probably better for a single spinning disk. For higher end and bus-attached flash consider 2500. Use smaller values for systems with low write loads, larger with high. Use the smallest value needed for flushing and purging to keep up. Twice innodb_io_capacity will often be a good choice and this can never be lower than innodb_io_capacity.

innodb_log_file_size = 2000M is a good starting point. Avoid making it too small, that will cause excessive adaptive flushing of modified pages. More guidance here.

innodb_lru_scan_depth: Reduce if possible. Uses disk i/o and can be a CPU and disk contention source. This multiplied by innodb_buffer_pool_instances sets how much work the page cleaner thread does each second, attempting to make that many pages free.Increase or decrease this to keep the result of multiplying the two about the same whenever you change innodb_buffer_pool_instances, unless you are deliberately trying to tune the LRU scan depth. Adjust up or down so that there are almost never no free pages but do not set it much larger than needed because the scans have a significant performance cost. A smaller value than the default is probably suitable for most workloads, give 100 a try instead of the default if you just want a lower starting point for your tuning, then adjust upwards to keep some free pages most of the time. Increase innodb_page_cleaners to lower of CPU count or buffer pools if it cannot keep up, there are limits to how much writing one thread can get done; 4 is a useful change for 5.6, 5.7 default is already 4. The replication SQL thread(s) can be seriously delayed if there are not usually free pages, since they have to wait for one to be made free. Error log Messages like “Log Messages: page_cleaner: 1000ms intended loop took 8120ms. The settings might not be optimal.” usually indicate that you have the page cleaner told to do more work than is possible in one second, so reduce the scan depth, or that there is disk contention to fix. Page cleaner has high thread priority in 5.7, particularly important not to tell it to do too much work, helps it to keep up. Document 2014477.1 has details of related settings and measurements that can help to tune this.

innodb_checksum_algorithm=strict_crc32 if a new installation, else crc32 for backwards compatibility. 5.5 and earlier cannot read tablespaces created with crc32. Crc32 is faster and particularly desirable for those using very fast storage systems like bus-attached flash such as Fusion-IO with high write rates.

innodb_log_compressed_pages = 0 if using compression. This avoids saving two copies of changes to the InnoDB log, one compressed, one not, so reduces InnoDB log writing amounts. Particularly significant if the log files are on SSD or bus-attached flash, something that should often be avoided if practical though it can help with commit rates if you do not have a write caching disk controller, at the cost of probably quite shortened SSD lifetime.

binlog_row_image = minimal assuming all tables have primary key, unsafe if not, it would prevent applying the binary logs or replication from working. Saves binary log space. Particularly significant if the binary logs are on SSD or flash, something that should often be avoided.

table_definition_cache: Set to the typical number of actively used tables within MySQL. Use SHOW GLOBAL STATUS and verify that Opened_table_definitions is not increasing by more than a few per minute. Increase until that is true or the value becomes 30000 or more, if that happens, evaluate needs and possibly increase further. Critical: see Performance Schema memory notes. Do not set to values that are much larger than required or you will greatly increase the RAM needs of PS in 5.6, much less of an issue in 5.7. Note that in 5.6 801 can cause four times the PS RAM usage of 800 by switching to large server calculation rules and 400 can be about half that of 401 if no other setting causes large rules.

table_open_cache: set no smaller than table_definition_cache, usually twice that is a good starting value. Use SHOW GLOBAL STATUS and verify that Opened_tables is not increasing by more than a few per minute. Increase until that is true or the value becomes 30000 or more, if that happens, evaluate needs and possibly increase further. Critical: see Performance Schema memory notes. Do not set to values that are much larger than required in 5.6, much less of an issue in 5.7, or you will greatly increase the RAM needs of PS. Note that in 5.6 4001 can cause four times the PS RAM usage of 4000 by switching to large server calculation rules and 2000 can be about half that of 2001 if no other setting causes large rules.

max_connections: This is also used for autosizing Performance Schema. Do not set it to values that are far higher than really required or you will greatly increase the memory usage of PS. If you must have a large value here because you are using a connection cache, consider using a thread cache as well to reduce the number of connections to the MySQL server. Critical: see Performance Schema memory notes. Do not set to values that are much larger than required or you will greatly increase the RAM needs of PS. Note that 303 can cause four times the PS RAM usage of 302 by switching to large server calculation rules and 151 can be about half that of 302 if no other setting causes large rules.

open_files_limit: This is also used for autosizing Performance Schema. Do not set it to values that are far higher than really required in 5.6, less of an issue in 5.7.

sort_buffer_size = 32k is likely to be faster for OLTP, change to that from the server default of 256k. Use SHOW GLOBAL STATUS to check Sort_merge_passes. It the count is 0 or increasing by up to 10-20 per second you can decrease this and probably get a performance increase. If the count is increasing by less than 100 per second that is also probably good and smaller sort_buffer_size may be better. Use care with large sizes, setting this to 2M can reduce throughput for some workloads by 30% or more. If you see high values for Sort_merge_passes, identify the queries that are performing the sorts and either improve indexing or set the session value of sort_buffer_size to a larger value just for those queries.

innodb_adaptive_hash_index (5.6 manual) Try both 0 and 1, 0 may show improvement if you do a lot of index scans, particularly in very heavy read-only or read-mostly workloads. Some people prefer always 0 but that misses some workloads where 1 helps. There’s an improvement in concurrency from 5.7.8 to use multiple partitions and the option innodb_adaptive_hash_index_parts was added, this may change the best setting from 0 to 1 for some workloads, at the cost of slower DBT3 benchmark result with a single thread only. More work planned for 5.8.

innodb_doublewrite (5.6 manual) consider 0/off instead of the default 1/on if you can afford the data protection loss for high write load workloads. This has gradually changed from neutral to positive in 5.5 to more negative for performance in 5.6 and now 5.7.


Where there is a recommendation to check SHOW GLOBAL STATUS output you should do that after the server has been running for some time under load and has stabilised. Many values take some time to reach their steady state levels or rates.

 

 

SSD-specific settings

Ensure that trim support is enabled in your operating system, it usually is.

Set innodb_page_size=4k unless you want a larger size to try to increase compression efficiency or have an SSD with 16k sectors. Use innodb_flush_neighbors=0 .

 

Memory usage and InnoDB buffer pool

For the common case where InnoDB is storing most data, setting innodb_buffer_pool_size to a suitably large value is the key to good performance. Expect to use most of the RAM in the server for this, likely at least 50% on a dedicated database server.

The Performance Schema can be a far more substantial user of RAM than in previous versions, particularly in 5.6, less of an issue in 5.7. You should check the amount of RAM allocated for it using SHOW ENGINE PERFORMANCE_SCHEMA STATUS . Any increase of max_connections, open_files_limit, table_open_cache or table_definition_cache above the defaults causes PS to switch to allocating more RAM to allow faster or more extensive monitoring. For this reason in 5.6 in particular you should use great care not to set those values larger than required or should adjust PS memory allocation settings directly. You may need to make PS settings directly to lower values if you have tens of thousands of infrequently accessed tables. Or you can set this to a lower value in my.cnf and change to a higher value in the server init file. It is vital to consider the PS memory allocations in the RAM budget of the server. See On configuring the Performance Schema for more details on how to get started with tuning it. If all of max_connections, table_definition_cache and table_open_cache are the same as or lower than their 151, 400 and 2000 defaults small sizing rules will be used. If all are no more than twice the defaults medium will be used at about twice the small memory consumption (eg. 98 megabytes instead of 52 megabytes). If any is more than twice the default, large rules will be used and the memory usage can be about eight times the small consumption (eg. 400 megabytes). For this reason, avoid going just over the 302, 800 and 4000 values for these settings if PS is being used, or use direct settings for PS sizes. The size examples are with little data and all other settings default, production servers may see significantly larger allocations. From 5.7 the PS uses more dynamic allocations on demand so these settings are less likely to be troublesome and memory usage will vary more with demand than startup settings.

Very frequent and unnecessarily large memory allocations are costly and per-connection allocations can be more costly and also can greatly increase the RAM usage of the server. Please take particular care to avoid over-large settings for: read_buffer_size, read_rnd_buffer_size, join_buffer_size, sort_buffer_size, binlog_cache_size and net_buffer_length. For OLTP work the defaults or smaller values are likely to be best. Bigger is not usually better for these workloads. Use caution with larger values, increasing sort_buffer_size from the default 256k to 4M was enough to cut OLTP performance by about 30% in 5.6. If you need bigger values for some of these, do it only in the session running the query that needs something different.

The operating system is likely to cache the total size of log files configured with innodb_log_file_size. Be sure to allow for this in your memory budget.

Thread_stack is also a session setting but it is set to the minimum safe value for using stored procedures, do not reduce it if using those. A maximum reduction of 32k might work for other workloads but remember that the server will crash effectively randomly if you get it wrong. It’s not worth touching unless you are both desperate and an expert. We increase this as and only when our tests show that the stack size is too small for safe operation. There is no need for you to increase it. Best not to touch this setting.

 

Operating systems

CPU affinity: if you are limiting the number of CPU cores, use CPU affinity to use the smallest possible number of physical CPUs to get that core count, to reduce CPU to CPU hardware consistency overhead. On Linux use commands like taskset -c 1-4 pid of mysqld or in windows START /AFFINITY or the Task Manager affinity control options.

 

Linux

Memory allocator: we ship built to use libc which is OK up to about 8-16 concurrent threads. From there switch to using TCMalloc using the mysqld_safe —malloc-lib option or LD_PRELOAD or experiment with the similar and possibly slightly faster jemalloc, which might do better with memory fragmentation, though we greatly reduced potential fragmentation in MySQL 5.6. TCMalloc 1.4 was shipped with many MySQL versions until 5.6.31 and 5.7.13. A the time of writing TCMalloc 2.5 is the latest version so you may want to experiment with that and jemalloc to see which works best for your workload and system.

IO scheduler: use noop or deadline. In rare cases CFQ can work better, perhaps on SAN systems, but usually it is significantly slower. echo noop > /sys/block/{DEVICE-NAME}/queue/scheduler .

nice: using nice -10 in mysqld_safe can make a small performance difference on dedicate servers, sometimes larger on highly contended servers. nice -20 can be used but you may find it hard to connect interactively if mysqld is overloaded and -10 is usually sufficient. If you really want -20, use -19 so you can still set the client mysql to -20  to get in and kill a rogue query.

Use cat “/proc/pgrep -n mysqld/limits  to check the ulimit values for a running process. May need ulimit -n to set maximum open files per process and ulimit -u for a user. The MySQL open_files_limit setting should set this but verify and adjust directly if needed.

It is often suggested to use “numactl –interleave all” to prevent heavy swapping when a single large InnoDB buffer pool is all allocated on one CPU. Two alternatives exist, using multiple InnoDB buffer pools to try to prevent the allocations all going on one CPU is primary. In addition, check using SHOW VARIABLES whether your version has been built with support for the setting innodb_numa_interleave . If the setting is present, turn it on, setting to 1. It changes to interleaved mode (MPOL_INTERLEAVE) before allocating the buffer pool(s) then back to standard (MPOL_DEFAULT) after. The setting is present on builds compiled on a NUMA system from 5.6.27 onwards.

Set vm.swappiness=1 in /etc/sysctl.conf . It is often suggested to use 0 to swap only an out of memory situation but 1 will allow minimal swapping before that and is probably sufficient. Use whatever works for you but please use caution with 0. Higher values can have a tendency to try to swap out the InnoDB buffer pool to increase the OS disk cache size, a really bad idea for a dedicated database server that is doing its own write caching. If using a NUMA system, get NUMA settings in place before blaming swapping, on swappiness. It is known that a large single buffer pool can trigger very high swapping levels if NUMA settings aren’t right, the fix is to adjust the NUMA settings, not swappiness.

Do not set the setting in this paragraph by default. You must test to see if it is worth doing, getting it wrong can harm performance. Default IO queue size is 128, higher or lower can be useful, you might try experimenting with echo 1000 > /sys/block/[DEVICE]/queue/nr_requests . Not likely to be useful for single spinning disk systems, more likely on RAID setups.

Do not set the setting in this paragraph by default. You must test to see if it is worth doing, getting it wrong can harm performance. The VM subsystem dirty ratios can be adjusted from the defaults of 10 and 20. To set a temporary value for testing maybe use echo 5 > /proc/sys/vm/dirty_background_ratio and echo 60 > /proc/sys/vm/dirty_ratio . After proving what works best you can add these parameters to the /etc/sysctl.conf : vm.dirty_background_ratio = 5 vm.dirty_ratio = 60 .  Please do follow the instruction to test, it is vital not to just change this and 5 and 60 are just examples.

Tools to monitor various parts of a linux system.

 

Linux Filesystems

We recommend that you use ext4 mounted with (rw,noatime,nodiratime,nobarrier,data=ordered) unless ultimate speed is required, because ext4 is somewhat easier to work with. If you do not have a battery backed up write caching disk controller you can probably improve your write performance by as much as 50% by using the ext4 option data=journal and then the MySQL option skipinnodb_doublewrite. The ext4 option provides the protection against torn pages that the doublewrite buffer provides but with less overhead. The benefit with a write caching controller is likely to be minimal.

XFS is likely to be faster than ext4, perhaps for fsync speed, but it is more difficult to work with. Use mount options (rw,noatime,nodiratime,nobarrier,logbufs=8,logbsize=32k).

ext3 isn’t too bad but ext4 is better. Avoid ext2, it has significant limits. Best to avoid these two.

NFS in homebrew setups has more reliability problems than NFS in professional SAN or other storage systems which works well but may be slower than directly attached SSD or bus-attached SSD. It’s a balance of features and performance, with SAN performance possibly being boosted by large caches and drive arrays. Most common issue is locked InnoDB log files after a power outage, time or switching log files solves this. Incidence of problems has declined over the last ten years and as of 2016 is now low. If possible use NFSv4 or later protocol for its improved locking handling. If concerned about out of order application of changes, not a problem normally observed in practice, consider using TCP and hard,intr mount option.

 

Solaris

Use LD_PRELOAD for one of the multi-threaded oriented mallocs, either mtmalloc or umem.

Use UFS/forcedirectio

Use ZFS.

 

Windows

To support more connections or connection rates higher than about 32 per second you may need to set MaxUserPort higher and TcpTimedWaitDelay for TCP/IP, particularly for Windows Server 2003 and earlier. The defaults are likely to be no more than 4000 ports and TIME_WAIT of 120 seconds. See Settings that can be Modified to Improve Network Performance. Settings of 32768 ports and between 30 and 5 seconds timeout are likely to be appropriate for server usage. The symptom of incorrect settings is likely to be a sudden failure to connect after the port limit is reached, resuming at a slow rate as the timeout slowly frees ports.

 

Hardware

Battery-backed write-caching disk controllers are useful for all spinning disk setups and also for SSD. SSD alone is a cheaper way to get faster transaction commits than spinning disks, for lower load systems. Do not trust that the controller disables the hard drive write buffers, test with real power outages. You will probably lose data even with a battery if the controller has not disabled the hard drive write buffers.

It is best to split files across disk types in these general groups:

SSD: data, InnoDB undo logs, maybe temporary tables if not using tmpfs or other RAM-based storage for them.

Spinning disks: Binary logs, InnoDB redo logs, bulk data. Also, large SATA drives are cheap and useful for working and archival space as well as the biggest of bulk data sets.

Bus-attached SSD: the tables with the very highest change rates i the most highly loaded systems only.

You can put individual InnoDB tables on different drives, allowing use of SSD for fast storage and SATA for bulk.

Hyperthreading on is likely to be a good choice in most cases. MySQL 5.6 scales up to somewhere in the range of 32-48 cores with InnoDB and hyperthreading counts as an extra core for this purpose. For 5.5 that would be about 16 and before that about 8 cores. If you have more physical cores either without hyperthreading or more when it is enabled, experiment to determine the optimal number to use for MySQL. There is no fixed answer because it depends on workload properties.

 

Thread pool

Use the thread pool if you routinely run with more than about 128 concurrently active connections. Use it to keep the server at the optimal number of concurrently running operations, which is typically in the range between 32 and 48 threads on high core count servers in MySQL 5.6. If not using the thread pool, use innodb_thread_concurrency if you see that your server has trouble with a build-up of queries above about 128 or so concurrently running operations inside InnoDB. InnoDB shows positive scalability up to an optimal number of running jobs, then negative scalability but innodb_thread_concurrency  = 0 has lower overhead when that regulating is not needed, so there is some trade off in throughput stability vs raw performance. The value for peak throughput depends on the application and hardware. If you see a benchmark that compares MySQL with a thread pool to MySQL without, but which does not set innodb_thread_concurrency, that is an indication that you should not trust the benchmark result: no production 5.6 server should be run with thousands of concurrently running threads and no limit to InnoDB concurrency.

 

Background

Here are more details of why some of these changes should be made.

innodb_stats_persistent = 1

Enables persistent statistics in InnoDB, producing more stable and usually better query optimiser decisions. Very strongly recommended for all servers. With persistent statistics you should run ANALYZE TABLE periodically to update the statistics. Once a week or month is probably sufficient for tables that have fairly stable or gradually changing sizes. For tables that are small or have very rapidly changing contents more frequent will be beneficial. There are minimal possible disadvantages, mainly the need for ANALYZE TABLE sometimes.

innodb_read_io_threads = 16, innodb_write_io_threads = 4

Increases the number of threads used for some types of InnoDB operation, though not the foreground query processing work. That can help the server to keep up with heavy workloads. No significant negative effects for most workloads, though sometimes contention for disk resources between these threads and foreground threads might be an issue if disk utilisation is near 100%.

table_open_cache_instances = 16

Improves the speed of operations involving tables at higher concurrency levels, important for reducing the contention in this area to an insignificant level. No significant disadvantages. This is unlikely to be a bottleneck until 24 cores are in full use but given the lack of cost it is best to set it high enough and never worry about it.

metadata_locks_hash_instances = 256

Reduces the effects of locking during the metadata locking that is used mainly for consistency around DDL. This has been an important bottleneck. As well as the general performance benefit, the hashing algorithm used has been shown to be non-ideal for some situations and that also makes it desirable to increase this value above the default, to reduce the chance of encountering that issue. We’re addressing that hash also but this will still be a useful setting with no significant negatives.

innodb_flush_neighbors = 0

When set to 1 InnoDB will look to flush nearby data pages as an optimisation for spinning disks. That optimisation is harmful for SSDs because it increase the number of writes. Set to 0 data on SSDs, 1 for spinning disks. If mixed, 0 is probably best.

innodb_log_file_size = 2000M

This is a critical setting for workloads that do lots of data modification and severe adverse performance will result if it is set too small. You must check the amount of log space used and ensure that it never reaches 75%. You must also consider the effect of your adaptive flushing settings and ensure that the percentage of the log space used does not cause excessive flushing. You can do that by using larger log files or having adaptive flushing start at a higher percentage. There is a trade off in this size because the total amount of log file space will usually be cached in operating system caches due to the nature of the read-modify-write operations performed. You must allow for this in the memory budget of the server to ensure that swapping does not occur. On SSD systems you can significantly extend the life of the drive by ensuring that this is set to a suitably high value to allow lots of dirty page caching and write combining before pages are flushed to disk.

table_definition_cache

Reduces the need to open tables to get dictionary information about the table structures. If set too low this can have a severe negative performance effect. There is little negative effect for the size range given on the table definition cache itself. See the Performance Schema portion of the memory notes above for critical memory usage considerations.

table_open_cache

Reduces the need to open tables to access data. If set too low this can have severe negative performance effects. There is little negative effect for the size range given on the table open cache itself. See the Performance Schema portion of the memory notes above for critical memory usage considerations.

sort_buffer_size = 32k

The key cost here is reduced server speed from setting this too high. Many common recommendations to use several megabytes or more have been made in a wide range of published sources and these are harmful for OLTP workloads. that normally benefit most from 32k or other small values. Do not set this to significantly larger values such as above 256k unless you see very excessive numbers of Sort_merge_passes – many hundreds or thousands per second on busy servers. Even then, it is far better to adjust the setting only in the connection of the few queries that will benefit from the larger size. In cases where it is impossible to adjust settings at the session level and when the workload is mixed it can be useful to use higher than ideal OLTP values to address the needs of the mixture of queries.

 

Other observations

Query cache

The query cache is effectively a single-threaded bottleneck. It can help performance at low query rates and concurrency, perhaps up to 4 cores routinely used. Above that it is likely to become a serious bottleneck. Leave this off unless you want to test it with your workload, and have measurements that will tell you if it is helping or hurting. Ensure that Qcache_free_blocks in global status is not above 10,000. 5,000 is a good action level. Above these levels the CPU time used in scans of the free list can be an issue, check with FLUSH QUERY CACHE, which defragments the free list, the change in CPU use is the cost of the free list size you had. Reducing the size is the most effective way to manage the free list size. Remember that the query cache was designed for sizes of up to a few tens of megabytes, if you’re using hundreds of megabytes you should check performance with great care, it’s well outside of its design limitations. Also check for waits with:

SELECT EVENT_NAME AS nm, COUNT_STAR AS cnt, sum_timer_wait, CONCAT(ROUND( sum_timer_wait / 1000000000000, 2), ‘ s’) AS sec
FROM performance_schema.events_stages_summary_global_by_event_name WHERE COUNT_STAR > 0 ORDER BY SUM_TIMER_WAIT DESC LIMIT 20;

Also see MySQL Query Cache Fragmentation Slows Down the Server (Doc ID 1308051.1).

 

Sync_binlog and innodb_flush_log_at_trx_commit

The 1 setting for these causes one fsync each at every transaction commit in 5.5 and earlier. From 5.6 concurrent commit support helps greatly to reduce that but you should still use care with these settings. Sync_binlog=1 can be expected to cause perhaps a 20% throughput drop with concurrent commit and 30 connections trying to commit, an effect that reduces as actively working connections count increases through to a peak throughput at about 100 working connections. To check the effect, just set sync_binlog to 0 and observe, then set innodb_flush_log_at_trx_commit = 0 and observer. Try innodb_flush_log_at_trx-commit = 2 also, it has less overhead than 1 and more than 0. Finally try both at 0. The speed increase from the 0 settings effect will be greatest on spinning disks with low concurrency and lowest at higher concurrency on fast SSD or with write caching disk controllers.

Note that it is mandatory to use innodb_flush_log_at_trx_commit=1 to get full durability guarantees. Write caching disk controllers with batter backup are the typical way that full durability combined with low performance penalty is achieved.

 

 

 

Bugs that affect upgrades and usage for 5.6 compared to 5.5 and earlier

http://bugs.mysql.com/bug.php?id=69174

Innodb_max_dirty_pages_pct is effectively broken at present, only working when the server has been idle for a second or more. There are a range of implications:

1. In past versions the limit on innodb_log_file_size sometimes made it necessary to use this setting to avoid hitting 75% of log space use and having a production disruption  incident due to hitting async flushing at 75%. The much more gentle flushing batches from innodb_max_dirty_pages_pct were normally acceptable and it wasn’t uncommon for systems with large buffer pools and high needs to have innodb_max_dirty_pages_pct set to values in the 2-5% range just for this reason. In 5.6 you have two possibilities that should work better:

1a. You can use larger values for innodb_log_file_size. That will let you use more of your buffer pool for write combining and reduce total io operations, instead of being forced to do lots of avoidable ones just to avoid reaching 75% of the log file use. Be sure you allow for the RAM your OS will use for buffering the log files, assume as much RAM use as the total log file space you set. This should greatly increase the value of larger buffer pools for high write load workloads.

1b. You can set innodb_adaptive_flushing_lwm to avoid reaching 75% of log space use. The highest permitted value is 70%, so adaptive flushing will start to increase flushing rate before the server gets to 75% of the log file use. 70% is a good setting for systems with low write rates or very fast disk systems that can easily handle a burst of writes. For others you should adjust to whatever lower value it takes to produce a nice and smooth transition from innodb_io_capacity based level flushing to adaptive flushing. 10% is the default but that is probably too low for most production systems, just what we need for a default that has to handle a wide range of possible cases.

2. You can’t effectively use the normal practice of gradually reducing innodb_max_dirty_pages_pct before a shutdown, to reduce outage duration. The best workaround at present is to set innodb_io_capacity to high values so it will cause more flushing.

3. You can’t use innodb_max_dirty_pages_pct to manage crash recovery time, something it could do with less disruptive writing than the alternative of letting the server hit async flushing at 75% of log file space use, after deliberately setting innodb_log_file_size too low. The workarounds are to use higher than desirable innodb_io_capacity and smaller than desirable innodb_log_file_size. Both cause unnecessary flushing compared to using innodb_max_dirty_pages_pct for this task. Before using a too small innodb_log_file_size, experiment with innodb_io_capacity and innodb_adaptive_flushing_lwm. Also ensure that innodb_io_capacity_max is set to around twice innodb_io_capacity, rarely up to four or more times. This may eliminate the issue with less redundant io than very constrained log file sizes because adaptive flushing will increase the writing rate as the percentage of log space used increases, so you should be able to reach almost any target recovery time limit, though still at the cost of more io than using innodb_max_dirty_pages_pct to do it only when a hard cap is reached.

4. You can’t use innodb_max_dirty_pages_pct to effectively regulate the maximum percentage of dirty pages in the buffer pool, constraining them to a target value. This is likely to be of particular significance during data loading and with well cached workloads where you want to control the split between pages used for caching modified data and pages used for caching data used purely for reads.

 

The workaround for this is to regard innodb_adaptive_flushing_lwm as equivalent to the use of innodb_max_dirty_pages_pct for normal production and set it to something like 60% with a suitable value of innodb_io_capacity for the times when the workload hasn’t reached that amount of log file usage. Start low like 100 and gradually increase so that at medium load times it just about keeps up. Have innodb_io_capacity_max set to a relatively high value so that as soon as the low water mark is passed, lots of extra IO will be done to cap the dirty pages/log space use.

You may then be able to reduce the size of your InnoDB log files if you find that you don’t reach 60% of log space use when you have reached a suitable percentage of dirty pages for the page read/write balance for your server. If you can you should do this because you can reallocate the OS RAM used for caching the bigger log files to the InnoDB buffer pool or other uses.

 

REFERENCES

http://mysqlserverteam.com/removing-scalability-bottlenecks-in-the-metadata-locking-and-thr_lock-subsystems-in-mysql-5-7/
https://bugs.mysql.com/bug.php?id=68487
http://www.brendangregg.com/linuxperf.html
https://dev.mysql.com/doc/refman/5.6/en/innodb-adaptive-hash.html
https://dev.mysql.com/doc/refman/5.7/en/innodb-adaptive-hash.html
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_read_io_threads
https://dev.mysql.com/doc/refman/5.6/en/server-status-variables.html#statvar_Innodb_page_size
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_buffer_pool_size
NOTE:2014477.1 – MySQL 5.7 Log Messages: page_cleaner: 1000ms intended loop took 8120ms. The settings might not be optimal. (flushed=0 and evicted=25273, during the time.)
NOTE:1308051.1 – MySQL Query Cache Fragmentation Slows Down the Server
https://dev.mysql.com/doc/refman/5.6/en/mysqld-safe.html#option_mysqld_safe_malloc-lib
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_numa_interleave
http://msdn.microsoft.com/en-us/library/ee377084.aspx
https://dev.mysql.com/doc/refman/5.6/en/server-options.html#option_mysqld_init-file
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_log_compressed_pages
http://dimitrik.free.fr/blog/archives/2013/02/mysql-performance-mysql-56-ga-vs-mysql-55-tuning-details.html
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_stats_persistent
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_stats_persistent_sample_pages
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_buffer_pool_instances
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_neighbors
http://mysqlmusings.blogspot.co.uk/2012/06/binary-log-group-commit-in-mysql-56.html
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_io_capacity_max
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_write_io_threads
https://dev.mysql.com/doc/refman/5.6/en/server-options.html#option_mysqld_open-files-limit
https://dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_sort_buffer_size
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_checksum_algorithm
NOTE:1604225.1 – Autosized Performance Schema Options in MySQL Server in MySQL 5.6
http://marcalff.blogspot.co.uk/2013/04/on-configuring-performance-schema.html
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_doublewrite
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_page_size
https://dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_table_open_cache_instances
https://dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_metadata_locks_hash_instances
https://dev.mysql.com/doc/refman/5.6/en/replication-options-binary-log.html#sysvar_binlog_row_image
https://dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_table_definition_cache
https://dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_table_open_cache
https://dev.mysql.com/doc/refman/5.6/en/server-system-variables.html#sysvar_max_connections
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_lru_scan_depth
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_io_capacity
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_log_file_size
https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_buffer_pool_instances
https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_doublewrite
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_doublewrite
https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_adaptive_hash_index_parts
https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_adaptive_hash_index
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_adaptive_hash_index
http://dimitrik.free.fr/blog/archives/2012/10/mysql-performance-innodb-buffer-pool-instances-in-56.html
https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_buffer_pool_instances
https://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_page_cleaners

参考:Recommended Settings for MySQL 5.6, 5.7 Server for Online Transaction Processing (OLTP) and Benchmarking (Doc ID 1531329.1)

18c新特性的一些小结

$
0
0

Oracle 18c在2018-02-16 release出来的,还是秉承着Oracle的cloud first理念,18c现在cloud和Engineered Systems上推出,想在传统的机器上安装18c,估计还要等到2018年下半年。

下面,我从我的角度,来快速review一下值得关注的18c新特性(当然可能还有其他多值得关注的新特性):

(一)Availability
1. Oracle Data Guard Multi-Instance Redo Apply Supports Use of Block Change Tracking Files for RMAN Backups
现在,Multiple-Instance Redo Apply (也叫MIRA,可以参考我之前写的【Oracle 12.2新特性介绍】的第48页),也可以支持BCT(Block Change Tracking)的备份方式了。这对于超大的数据库,且是主从都是RAC,且备份发生在从库上,这是非常有效的一种增量备份方式。

2.Automatic Correction of Non-logged Blocks at a Data Guard Standby Database
新增2种类standby logging模式(主要是为了加快主库loading数据):
一种是Standby Nologging for Data Availability,即loading操作的commit会被delay,直到所有的standby都apply data为止。

SQL> ALTER DATABASE SET STANDBY NOLOGGING FOR DATA AVAILABILITY;

一种是Standby Nologging for Load Performance,这种模式和上一种类似,但是会在load数据的时候,遇到网络瓶颈时,先不发送数据,这就保证了loading性能,但是丢失了数据,但是丢失的数据,会从primary中再次获取。

SQL> ALTER DATABASE SET STANDBY NOLOGGING FOR LOAD PERFORMANCE;

3.Shadow Lost Write Protection
创建一个shadow tablespaces(注,是big file tablespace)来提供保护。注,此时你就可以不需要ADG来提供额外的lost write的保护了。
搞mysql的同学,你们是不是发现了什么?这是不是很像double write? 嘿嘿……

4. Backups from non-CDBs are usable after migration to CDB
原来的non-CDB,可以以这种方式,作为一个PDB迁移到当前已经存在的CDB中。

5. Support for PDBs as Shards and Catalogs
天,终于支持shard是pdb了。但是!但是!但是!也只是支持单个pdb在单个cdb中。哈哈哈哈哈,等于没支持。

6. User-Defined Sharding Method
这个在12.2中的beta版中存在的特性在,在正式发布是被取消了。现在,再次release出来了。

7. Consistency Levels for Multi-Shard Queries
提供MULTISHARD_QUERY_DATA_CONSISTENCY初始化参数,执行的之前可以先设置该初始化参数,避免跨分片查询时的SCN synchronization。

8. Manual termination of run-away queries
现在,你可以手动的杀掉一个语句,而不断开这个session:ALTER SYSTEM CANCEL SQL。

ALTER SYSTEM CANCEL SQL 'SID, SERIAL, @INST_ID, SQL_ID';

(二)Big Data and Data Warehousing
9. Approximate Top-N Query Processing
注,18c中,增加了 APPROX_COUNT和APPROX_SUM来配合APPROX_RANK的使用。

10. LOB support with IMC, Big Data SQL
LOB对象也支持in memory了。

(三)Database Overall
11. Copying a PDB in an Oracle Data Guard Environment
新增了2个参数,方便在ADG环境中创建PDB。
一个是STANDBY_PDB_SOURCE_FILE_DIRECTORY,自动寻找ADG的数据文件路径(注,在18c之前,如果将一个pdb插入到一个standby环境的中cdb,需要手动将文件拷贝到pdb的OMF路径下)

另一个是STANDBY_PDB_SOURCE_FILE_DBLINK,方便remote clone时自动查找ADG文件路径(注,在18c之前,如果是本地clone,就不用复制数据文件,但是远程clone,就需要手动复制。)。

12. PDB Lockdown Profile Enhancements
现在可以在application root和CDB root中创建PDB lockdown profile。如果你还没了解什么事application root和lockdown profile,可以参考我之前写的【Oracle 12.2新特性介绍】的第7页和第40页。

你现在还可以根据一个pdb lockdown profile,创建另外一个pdb lockdown profile。

18c包含三个默认的lockdown profile:PRIVATE_DBAAS,SAAS,PUBLIC_DBAAS

13. Refreshable PDB Switchover
PDB refresh一直号称是穷人的ADG,这个特性,在18c中也越来越好用了。支持了switchover。switchover分成计划内核计划外的两种场景。
计划内的,可以切回去,主要用于平衡CDB的负载。
计划外的,主要用于PDB master失效之后,不用整个CDB做切换。
如果你还没了解什么是PDB refresh,可以参考我之前写的【Oracle 12.2新特性介绍】的第37页)

14. PDB Snapshot Carousel
pdb的snapshot备份转盘,默认保留8份,每24小时备份一次。

ALTER PLUGGABLE DATABASE SNAPSHOT MODE EVERY 24 HOURS;

15. New Default Location of Oracle Database Password File
注意,新的密码文件路径已经在ORACLE_BASE,而不是ORACLE_HOME。

16. Read-Only Oracle Home
可以在dbca或者roohctl -enable来进程read only oracle home的安装,
运行orabasehome命令可以检查当前的Oracle Home是否只读,如果这个命令输出的结果和$ORACLE_HOME一样,则表示Oracle Home是可读写的。如果输出是ORACLE_BASE/homes/HOME_NAME,则表示Oracle Home是只读。

17.Online Merging of Partitions and Subpartitions
支持在线合并分区。注,需要使用ONLINE关键字。

18. Concurrent SQL Execution with SQL Performance Analyzer
SPA可以并行运行了(默认情况还是串行),帮你更快的完成SPA测试。

(四)Performance
19. Automatic In-Memory
自动In Memory会根据Heat Map,在内存使用紧张的情况下,将不常访问的IM列驱逐出内存。

20. Database In-Memory Support for External Tables
外部表支持IM特性。

21. Memoptimized Rowstore
在SGA中有一块memoptimize pool区域,大小受MEMOPTIMIZE_POOL_SIZE参数设置,当开启fast lookup的时候,就能利用该内存区域,进行快速的查找。
开启fast lookup,需要在建表语句中加上关键字:

当基于主键查询时,就能使用到fast lookup。

Memoptimized Rowstore将极大的提高物联网中基于主键的高频查询。

(五)RAC and Grid
22. ASM Database Cloning
可以基于ASM做pdb的克隆。基于asm的flex diskgroup来实现。关于flex diskgroup,参考我之前写的【Oracle 12.2新特性介绍】的第17页。


23. Converting Normal or High Redundancy Disk Groups to Flex Disk Groups without Restricted Mount

呵呵,鼓励往flex diskgroup上转型。

(六)Security
24. Integration of Active Directory Services with Oracle Database
和微软的AD结合。在18c之前,需要使用Oracle Enterprise User Security (EUS)进行交互,现在,可以使用centrally managed users (CMU) 直接将AD的users和groups和Oracle的users和role进行mappiing。

(七)其他
25. 新增初始化参数:

ADG_ACCOUNT_INFO_TRACKING

FORWARD_LISTENER

INMEMORY_AUTOMATIC_LEVEL

INMEMORY_OPTIMIZED_ARITHMETIC

MEMOPTIMIZE_POOL_SIZE

MULTISHARD_QUERY_DATA_CONSISTENCY

OPTIMIZER_IGNORE_HINTS

OPTIMIZER_IGNORE_PARALLEL_HINTS

PARALLEL_MIN_DEGREE

PRIVATE_TEMP_TABLE_PREFIX

STANDBY_PDB_SOURCE_FILE_DBLINK

STANDBY_PDB_SOURCE_FILE_DIRECTORY

TDE_CONFIGURATION

UNIFIED_AUDIT_SYSTEMLOG

WALLET_ROOT

值得说一下的是OPTIMIZER_IGNORE_PARALLEL_HINTS,在纯OLTP的系统中,你终于可以禁用开发人员不受控制的并发了(往往不写并发度)。:)

26. dbms_session.sleep
可以

exec dbms_session.sleep(3);

终于不用再单独grant dbms_lock的权限了。

看到上面的这些新特性,你怎么想?“DBA将死”、“Oracle将实现自动驾驶”?这些新特性是不是还需要DBA? :)


参考:
1. Oracle Database Database New Features Guide, 18c
2. ORACLE 18C: ORACLE 18C.. NEW FEATURES.. WHAT’S NEWS..
3. Franck Pachot (@FranckPachot)

Outline的部署和使用

$
0
0

Outline是一款突破网络封锁的工具,Jigsaw开发的项目,而Jigsaw是属于alphabet旗下的,而alphabet,是google的母公司。
现在你明白了吧,这是一款google出的工具。

outline的官方网站是:
https://getoutline.org/en/home

outline需要服务器端和客户端。
1. 客户端,已经有各种版本,包括Andriod、iOS等等:

iOS的下载地址是这里 ,目前中国区也还有的下载。

2. 服务器端,你需要在你自己搭建的服务器上安装,安装过程非常简单,但是还是需要在电脑上操作一下,我们需要先下载一个Outline Manager:

我们这里以Mac版为例,Mac版的Outline manager的下载地址是这里

下载后安装:


在launchpad启动outline manager,你可以看到他会叫你如何在你自己搭建的服务器上安装outline的服务器端。


默认是用Digital Ocean这家云服务商的服务器


当然你也可以使用其他任意云端的服务器:


我们以使用其他云端服务器为例,进行说明,点击get started:

看到没? 很简单,只有2步骤。

那么我在我的云端服务器运行如下命令即可:

wget -qO- https://raw.githubusercontent.com/Jigsaw-Code/outline-server/master/src/server_manager/install_scripts/install_server.sh | bash


注意,这要求云端的服务器要已经安装好docker,并且启动docker服务,并且关闭防火墙。如果你没有做到这些,你可以会遇到和我一样的报错:

[root@vultr outline_server]# wget -qO- https://raw.githubusercontent.com/Jigsaw-Code/outline-server/master/src/server_manager/install_scripts/install_server.sh | bash
> Verifying that Docker is installed .......... Docker CE must be installed, please run "curl -sS https://get.docker.com/ | sh" or visit https://docs.docker.com/install/

Sorry! Something went wrong. If you can't figure this out, please copy and paste all this output into the Outline Manager screen, and send it to us, to see if we can help you.
[root@vultr outline_server]#


此时我需要先安装docker:

[root@vultr outline_server]# curl -sS https://get.docker.com/ | sh

# Executing docker install script, commit: e749601
+ sh -c 'yum install -y -q yum-utils'
+ sh -c 'yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo'
Loaded plugins: fastestmirror
adding repo from: https://download.docker.com/linux/centos/docker-ce.repo
grabbing file https://download.docker.com/linux/centos/docker-ce.repo to /etc/yum.repos.d/docker-ce.repo
repo saved to /etc/yum.repos.d/docker-ce.repo
+ '[' edge '!=' stable ']'
+ sh -c 'yum-config-manager --enable docker-ce-edge'
Loaded plugins: fastestmirror
============================================================================================== repo: docker-ce-edge ==============================================================================================
[docker-ce-edge]
async = True
bandwidth = 0
base_persistdir = /var/lib/yum/repos/x86_64/7
baseurl = https://download.docker.com/linux/centos/7/x86_64/edge
cache = 0
cachedir = /var/cache/yum/x86_64/7/docker-ce-edge
check_config_file_age = True
compare_providers_priority = 80
cost = 1000
deltarpm_metadata_percentage = 100
deltarpm_percentage =
enabled = 1
enablegroups = True
exclude =
failovermethod = priority
ftp_disable_epsv = False
gpgcadir = /var/lib/yum/repos/x86_64/7/docker-ce-edge/gpgcadir
gpgcakey =
gpgcheck = True
gpgdir = /var/lib/yum/repos/x86_64/7/docker-ce-edge/gpgdir
gpgkey = https://download.docker.com/linux/centos/gpg
hdrdir = /var/cache/yum/x86_64/7/docker-ce-edge/headers
http_caching = all
includepkgs =
ip_resolve =
keepalive = True
keepcache = False
mddownloadpolicy = sqlite
mdpolicy = group:small
mediaid =
metadata_expire = 21600
metadata_expire_filter = read-only:present
metalink =
minrate = 0
mirrorlist =
mirrorlist_expire = 86400
name = Docker CE Edge - x86_64
old_base_cache_dir =
password =
persistdir = /var/lib/yum/repos/x86_64/7/docker-ce-edge
pkgdir = /var/cache/yum/x86_64/7/docker-ce-edge/packages
proxy = False
proxy_dict =
proxy_password =
proxy_username =
repo_gpgcheck = False
retries = 10
skip_if_unavailable = False
ssl_check_cert_permissions = True
sslcacert =
sslclientcert =
sslclientkey =
sslverify = True
throttle = 0
timeout = 30.0
ui_id = docker-ce-edge/x86_64
ui_repoid_vars = releasever,
   basearch
username =

+ sh -c 'yum makecache'
Loaded plugins: fastestmirror
base                                                                                                                                                                                       | 3.6 kB  00:00:00
docker-ce-edge                                                                                                                                                                             | 2.9 kB  00:00:00
docker-ce-stable                                                                                                                                                                           | 2.9 kB  00:00:00
extras                                                                                                                                                                                     | 3.4 kB  00:00:00
updates                                                                                                                                                                                    | 3.4 kB  00:00:00
(1/14): docker-ce-edge/x86_64/filelists_db                                                                                                                                                 | 8.5 kB  00:00:00
(2/14): docker-ce-edge/x86_64/primary_db                                                                                                                                                   |  15 kB  00:00:00
(3/14): docker-ce-stable/x86_64/primary_db                                                                                                                                                 |  12 kB  00:00:00
(4/14): docker-ce-edge/x86_64/other_db                                                                                                                                                     |  62 kB  00:00:00
(5/14): docker-ce-stable/x86_64/other_db                                                                                                                                                   |  66 kB  00:00:00
(6/14): docker-ce-stable/x86_64/filelists_db                                                                                                                                               | 7.3 kB  00:00:00
(7/14): extras/7/x86_64/prestodelta                                                                                                                                                        | 129 kB  00:00:00
(8/14): extras/7/x86_64/filelists_db                                                                                                                                                       | 709 kB  00:00:00
(9/14): updates/7/x86_64/prestodelta                                                                                                                                                       | 960 kB  00:00:00
(10/14): base/7/x86_64/other_db                                                                                                                                                            | 2.5 MB  00:00:01
(11/14): updates/7/x86_64/other_db                                                                                                                                                         | 734 kB  00:00:00
(12/14): extras/7/x86_64/other_db                                                                                                                                                          | 121 kB  00:00:00
(13/14): updates/7/x86_64/filelists_db                                                                                                                                                     | 4.2 MB  00:00:00
(14/14): base/7/x86_64/filelists_db                                                                                                                                                        | 6.7 MB  00:00:01
Loading mirror speeds from cached hostfile
 * base: repo1.dal.innoscale.net
 * extras: repo1.ash.innoscale.net
 * updates: mirror.nodesdirect.com
Metadata Cache Created
+ sh -c 'yum install -y -q docker-ce'
warning: /var/cache/yum/x86_64/7/docker-ce-edge/packages/docker-ce-18.03.0.ce-1.el7.centos.x86_64.rpm: Header V4 RSA/SHA512 Signature, key ID 621e9f35: NOKEY
Public key for docker-ce-18.03.0.ce-1.el7.centos.x86_64.rpm is not installed
Importing GPG key 0x621E9F35:
 Userid     : "Docker Release (CE rpm) <docker@docker.com>"
 Fingerprint: 060a 61c5 1b55 8a7f 742b 77aa c52f eb6b 621e 9f35
 From       : https://download.docker.com/linux/centos/gpg
If you would like to use Docker as a non-root user, you should now consider
adding your user to the "docker" group with something like:

  sudo usermod -aG docker your-user

Remember that you will have to log out and back in for this to take effect!

WARNING: Adding a user to the "docker" group will grant the ability to run
         containers which can be used to obtain root privileges on the
         docker host.
         Refer to https://docs.docker.com/engine/security/security/#docker-daemon-attack-surface
         for more information.
[root@vultr outline_server]#
[root@vultr outline_server]#


然后启动docker服务:

[root@vultr outline_server]# service docker start
Redirecting to /bin/systemctl start docker.service
[root@vultr outline_server]#


然后关闭防火墙:

[root@vultr outline_server]# service firewalld stop

开始安装:

[root@vultr outline_server]# wget -qO- https://raw.githubusercontent.com/Jigsaw-Code/outline-server/master/src/server_manager/install_scripts/install_server.sh | bash
> Verifying that Docker is installed .......... OK
> Verifying that Docker daemon is running ..... OK
> Creating persistent state dir ............... OK
> Generating secret key ....................... OK
> Generating TLS certificate .................. OK
> Generating SHA-256 certificate fingerprint .. OK
> Starting Shadowbox .......................... Unable to find image 'quay.io/outline/shadowbox:stable' locally
stable: Pulling from outline/shadowbox
605ce1bd3f31: Pulling fs layer
9d1b67fd48b4: Pulling fs layer
f87706f29a6f: Pulling fs layer
b50c2fcde876: Pulling fs layer
e1ecd3c15a4b: Pulling fs layer
f72ac4625f86: Pulling fs layer
98be2229c9b1: Pulling fs layer
5b2bb8abc0c7: Pulling fs layer
3852ab6d98b2: Pulling fs layer
8219c6ace457: Pulling fs layer
88c337662eb5: Pulling fs layer
5ce0d168fc22: Pulling fs layer
170df050f533: Pulling fs layer
b50c2fcde876: Waiting
e1ecd3c15a4b: Waiting
f72ac4625f86: Waiting
98be2229c9b1: Waiting
5b2bb8abc0c7: Waiting
3852ab6d98b2: Waiting
8219c6ace457: Waiting
88c337662eb5: Waiting
5ce0d168fc22: Waiting
170df050f533: Waiting
605ce1bd3f31: Verifying Checksum
605ce1bd3f31: Download complete
f87706f29a6f: Verifying Checksum
f87706f29a6f: Download complete
9d1b67fd48b4: Verifying Checksum
9d1b67fd48b4: Download complete
b50c2fcde876: Verifying Checksum
b50c2fcde876: Download complete
e1ecd3c15a4b: Verifying Checksum
e1ecd3c15a4b: Download complete
605ce1bd3f31: Pull complete
98be2229c9b1: Verifying Checksum
98be2229c9b1: Download complete
5b2bb8abc0c7: Verifying Checksum
5b2bb8abc0c7: Download complete
3852ab6d98b2: Verifying Checksum
3852ab6d98b2: Download complete
f72ac4625f86: Verifying Checksum
f72ac4625f86: Download complete
8219c6ace457: Verifying Checksum
8219c6ace457: Download complete
88c337662eb5: Verifying Checksum
88c337662eb5: Download complete
170df050f533: Verifying Checksum
170df050f533: Download complete
5ce0d168fc22: Verifying Checksum
5ce0d168fc22: Download complete
9d1b67fd48b4: Pull complete
f87706f29a6f: Pull complete
b50c2fcde876: Pull complete
e1ecd3c15a4b: Pull complete
f72ac4625f86: Pull complete
98be2229c9b1: Pull complete
5b2bb8abc0c7: Pull complete
3852ab6d98b2: Pull complete
8219c6ace457: Pull complete
88c337662eb5: Pull complete
5ce0d168fc22: Pull complete
170df050f533: Pull complete
Digest: sha256:ed974a668b0c858781188882cde0c802afa9a36337587884a4e7ff6a5e96ec5b
Status: Downloaded newer image for quay.io/outline/shadowbox:stable
OK
> Starting Watchtower ......................... Unable to find image 'v2tec/watchtower:latest' locally
latest: Pulling from v2tec/watchtower
a5415f98d52c: Pulling fs layer
c3f7208ad77c: Pulling fs layer
169c1e589d74: Pulling fs layer
a5415f98d52c: Verifying Checksum
a5415f98d52c: Download complete
c3f7208ad77c: Verifying Checksum
c3f7208ad77c: Download complete
169c1e589d74: Verifying Checksum
169c1e589d74: Download complete
a5415f98d52c: Pull complete
c3f7208ad77c: Pull complete
169c1e589d74: Pull complete
Digest: sha256:4cb6299fe87dcbfe0f13dcc5a11bf44bd9628a4dae0035fecb8cc2b88ff0fc79
Status: Downloaded newer image for v2tec/watchtower:latest
OK
> Waiting for Outline server to be healthy .... OK
> Creating first user ......................... OK
> Adding API URL to config .................... OK
> Checking host firewall ...................... OK

CONGRATULATIONS! Your Outline server is up and running.

To manage your Outline server, please copy the following text (including curly
brackets) into Step 2 of the Outline Manager interface:

{
  "apiUrl": "https://11.22.33.44:51714/-9w7ZBvaEt88dwpb1dASFD",
  "certSha256": "2349DDF1D15SGDEESE504TSREQQ59060B42044B04A47A32635ASB4EE249HSFES"
}

If have connection problems, it may be that your router or cloud provider
blocks inbound connections, even though your machine seems to allow them.

- If you plan to have a single access key to access your server make sure
  ports 51714 and 50581 are open for TCP and UDP on
  your router or cloud provider.
- If you plan on adding additional access keys, you’ll have to open ports
  1024 through 65535 on your router or cloud provider since the Outline
  Server may allocate any of those ports to new access keys.

[root@vultr outline_server]#

注意,上面那段apiUrl和certSha256就是要填到outline manager中的:


点击done,就会连接到远处的server。然后,在界面中点击ADD Key:


生成key之后点击share


会生成一个分享连接。把连接发给别人或者自己。


点击连接就能看到一个connect to this server,点击之后,可以看到一个ss://开头的地址,将这个地址填写到你的iPhone的客户端中,点击add server:


然后,就可以使用了。

现在可以通畅的访问所有的网络了。

注1,如果你在outline manager remove了某个key,那边你发送给别人或者自己的这个key就失效了。后续iPhone等客户端无法使用这个key连接。
注2,这是全局代理,没法写规则,所以要注意一下流量。
注3,如果你想根据规则,其实也很容易。因为ss:\\的这个地址,你复制到shadowrocket中,就会自动的转换成IP,密码,端口,加密访问,你就可以直接用在shadowrocket中走规则。

最后,再简单分析一下。
outline其实还是基于shadowsocks协议进行的通信,只不过包了一层docker。即将ss server包在docker里面,然后将docker部署到你的机器上。

centos 7中配置keepalived日志为别的路径

$
0
0

keepalived 安装:

cd 
./configure --prefix=/usr/local/keepalived

make &&  make install

mkdir /etc/keepalived
mkdir /etc/keepalived/scripts
cp /usr/local/keepalived/etc/keepalived/keepalived.conf /etc/keepalived/
cp /root/keepalived-2.0.6/keepalived/etc/init.d/keepalived  /etc/init.d/
cp /usr/local/keepalived/sbin/keepalived /sbin/keepalived
cp /usr/local/keepalived/etc/sysconfig/keepalived /etc/sysconfig/
chmod +x /etc/init.d/keepalived

由于在默认状态下keepalived的日志会写入到/var/log/message中,我们需要将此剥离出来。

在centos 6下可以:

(1)首先修改/etc/sysconfig/keepalived文件,注释掉如下,添加如下: 
#KEEPALIVED_OPTIONS="-D"
KEEPALIVED_OPTIONS="-D -d -S 0" 

(2)其次修改 /etc/rsyslog.conf 文件,添加如下:
local0.* /var/log/keepalived.log

在centos 7 下,还需要修改/lib/systemd/system/keepalived.service 文件:

## centos 7使用。因为centos 7使用systemctl,通过systemctl调用service,所以需要修改/lib/systemd/system/keepalived.service文件。

将里面的:
EnvironmentFile=-/usr/local/keepalived/etc/sysconfig/keepalived
ExecStart=/usr/local/keepalived/sbin/keepalived $KEEPALIVED_OPTIONS
修改成:
EnvironmentFile=/etc/sysconfig/keepalived
ExecStart=/sbin/keepalived $KEEPALIVED_OPTIONS

然后重新加载service:
systemctl daemon-reload

整体的思路就是,

1. 通过systemctl start keepalived去启动;
2. 启动keepalived的时候,会去读service的配置文件:/lib/systemd/system/keepalived.service;
3. 在service的配置文件时:
3.1 启动文件路径ExecStart=/sbin/keepalived $KEEPALIVED_OPTIONS,即启动方式是带环境变量文件中参数来启动;
3.2 读取环境变量参数EnvironmentFile=/etc/sysconfig/keepalived。
4. $KEEPALIVED_OPTIONS参数是在/etc/sysconfig/keepalived的配置;我们配置的是KEEPALIVED_OPTIONS="-D -d -S 0";而-S是syslog的facility,0表示放在local0,在/etc/rsyslog.conf 中配置local0.* /var/log/keepalived.log
5. 所以,写日志就去/var/log/keepalived.log了。


raft协议学习笔记

$
0
0

注,需要注意的是raft是个默认消息可靠,但是不提防消息有害的系统。

(一). 共识机制有2种:
一种是leader-less(对称的),即没有leader,大家都是平等的,客户端可以连接任意的节点。
一种是leader-base(非对称的),即有leader,在任意的某个时间点,只有一个leader,其他的节点接受leader的决定。客户端只和leader 节点发生交互。

raft是属于leader-base的共识机制。注意raft是一种协议,那么它就不是一种公式,而是一种分布式系统达成共识的各种条件的约定。

(二)节点状态:
Server states
节点的状态有3种,一个是叫follower,一个是candidate,一个是叫leader。各自角色的作用,有如下约定:
(1)leader:会处理来自所有客户端的交互请求,会记录复制的进度,同一时刻,集群中只会有一个leader。
(2)follower:完全被动,不发起RPC请求(Remote Procedure Call,即远程过程调用请求),只是回应RPC请求。
(3)candidate:用于选举leader。

开始的时候,所有节点的状态为follower(后续简称为F),节点在规定的时间内,没有收到来自leader(后续简称为L)的RPC请求,发起投票,先是让自己成为candidate(后续简称C)。
在某一时刻,可能存在多个C,这些C是从F变过来的,从C变成F的时间,各个节点也不尽相同,有些可能久一点,有些可能短一点;各个C也会在不同的瞬间发起投票;而发给F的路径长度不一样,可能收到F的反馈的时间点也不一样。
如果在某个选举的时间单位内,C收到了大部分节点的同意的信息,那么C就变成L,如果没有收到信息,那么发起下来一轮任期的投票。

(三)任期(Term)
raft将时间划分成一个一个的任期。一个是选举期间,一个是常规操作的时间(此时只有一个L)。
有些term是没有L的,
每个节点都保留这当前的任期值。
raft term
所以,一开始,大家都是F,一开始是出于选举阶段的(蓝色),等选出了L,进入了常规操作时间(绿色),当出现超时或者故障的时候,此时就进入了投票阶段,节点们会发起投票,最后投票结束,又进入了常规操作阶段。
在Term1,需要从F开始选举,所以蓝色部分比较长;在Term2,此时已经有了L,所以只需要确认L和F之间的心跳,所以蓝色比较短;在Term3,L挂掉需要重新投票,新节点获取投票成为L后,进入Term4;Term5和Term2一样。

(四)节点信息持久化
1. 每个节点会以同步的方式,在回应RPC之前,持久化如下信息:
Persistent state
这个可以看出是不是像一个小型的数据库?有日志(记录历史term和command),有数据(current term和votefor)?

2. 不用持久化的信息:
Non-persistent state

(五)心跳和timeout:
1. RPC请求包括两种类型的RPC请求,一种是AppendEntries,如写日志,如发送心跳,是有L发出来的;一种是VOTE,是由C发出来的。
timeout分成两种,第一种是(待补充,见http://thesecretlivesofdata.com/raft/)
2. 初始状态,大家都是F
3. F期望收到的是来自C或L的RPC请求
4. L必须不停的发送AppendEntries来维持自己的领导地位
5. 如果在选举时间(通常是100-500ms)内,L没有收到RPC请求,那么F就认为L已经死掉,F会开始一个新的选举。

(六)选举:
F选举时,会先设定一个超时的时间,是∆election到2倍的∆election时间之间。
会递增当前的Term的值。
F的状态会变成C。并且投票的第一票是给自己。并且给其他所有的节点发送VOTE的RPC请求。然后:
1. 如果在timeout内收到了大部分节点的回应,则这自己成为L。并且发送AppendEntries 给其他所有节点。
2. 如果别的节点已经成为了L,此时从别的节点收到AppendEntries的请求,则自己降成F。
3. 如果从其他节点没有收到任何消息,timeout,则重新发起选举。
4. 当选举完成之后,如果保证选举正确?
4.1 Safty(允许在一个Term内,最多只有一个L):
(a)每个节点在每次term内,只投出一票。
(b)2个不同的C,不能在同一个的term内累积“大多数”的投票。
4.2 Liveness(一些C必须最终获胜)
(a)election timeouts是随机的,(在∆election到2倍的∆election之间)
(b)在其他节点醒来之前,一个先发起的节点通常会timeout,并且赢得选举。
(c)如果∆election >> broadcast time,这种Liveness的机制将会工作的很好。

(七)日志的结构:
1. log是被持久化在磁盘上,用于crash恢复
2. commited的意思是,被大多数节点已经写入。
3. 最终一致性
Log structure
可以从上图看到日志的结构,包含2部分,一部分是term的值,一部分是command的历史信息。

(八)常规操作:
1. 客户端将command传给L
2. L将command写日志
3. L将AppendEntries的RPC请求发给F
4. 一旦新条目commit,L将command传自己的状态机,向客户端返回结果。L在后续AppendEntries的RPC请求中告诉告诉F,被commit的条目。F将commit命令发给自己的状态机。
5. 遇到crash或者slow的客户端,L会一直尝试发送直到成功。通常情况下的性能最佳:一个成功的RPC请求给大多数个节点。

(九)一致性
1. 什么是日志的一致性:所有的节点的日志,都有一样的index和term;如果某个给定的条目是已经commit的了,那么前面的所有的条目也是commit的。
如下图所示,数字123456是代表log index,所有的节点,都有一样的index和term,且某个给定的条目,如index=4,T2的条目,已经是commit的,那么之前的条目也都是commit的。
Consistency in logs

2. 什么是AppendEntries的一致性检查:每个AppendEntries的RPC请求,都包含需要处理的新的index和term;F必须包含有相符合的条目,不然就拒绝新来的AppendEntries请求;将上述步骤实施递归步骤,确保一致性。
如下图所示,每个新来的AppendEntries RPC请求,都包含了index和term,且在下面的第二个图中,由于F包含的条目和L不一致,所以会拒绝log index=5的新的AppendEntries的请求。
AppendEntries Consistency

(十)Leader的产生
1. L产生的开始:
1.1 旧的L可能会留下一下部分被同步的条目
1.2 新的L只是做“常规操作”,并不会做一些特别的动作。
1.3 L的日志,是“真理”,会以L的日志为准。
1.4 F的日志最终会到达和L一致。最终一致性。
1.5 多次崩溃可能会留下许多无关的日志条目。
At beginning of new leader’s term

2. Safty的要求:
2.1. 如果L已确认某个log条目是已经commit了的,则该条目将出现在所有未来L的日志中
2.2. L不会覆盖写日志的条目:只有在L的日志中的条目,才能被commit;日志条目只有commit之后,才会被同步到其他节点。
2.3. 集群成员数量改变,每次只变一台,不同时变多台。即使有,在内部操作时也是拆成一台一台的改变。如果同时改变多台,可能出现脑裂的情况,同一时刻有老配资的leader和新配置的leader,两个leader。

2.4. 集群成员数量改变,采用两阶段方法变动(即存在同时为C-old和C-new,老配置和新配置同时生效的时刻)。

Azure云MySQL数据库受限功能列表

$
0
0

 

微软Azure云MySQL功能受限列表(截止2018年4月)

多样性 支持的数据库种类 SQL Server

MySQL、MariaDB

Postgresql

CosmosDB(类似MongoDB)

Redis

高可用性

支持的区域

Azure中国账号支持2个区域

Azure全球账号支持22个区域

高可用性(RDS 本身的高可用性,如 multiAZ) 支持
备份恢复

支持在线复制功能

不支持

支持克隆功能

Azure中国账号支持

Azure全球账号不支持

备份日志

不支持

支持恢复到任意时间点(多种存储引擎)

不支持,未看到恢复到指定时间点的菜单
同步

支持 Paas 层的专线同步(如中美专线)

需要后台帮助配置代理才能做专线同步

同步延时界面

不支持

有 RDS 专用的全量迁移工具

不支持
有 RDS 专用的增量迁移工具 支持 ,VNET service tunnel
支持MySQL的GTID同步 不支持,但是可以开工单让后台开启
权限管理 用户管理的模式(root 用户的权限) 不支持
安全和审计 支持VPC 支持
支持安全组 不支持

支持审计

不支持,没有看到数据库审计相关菜单

支持存储加密

不支持,没有看到数据落盘加密菜单

支持连接加密(SSL)

支持

监控

支持常用指标监控及配置告警

Azure中国账号没有提供监控指标

Azure全球账号有提供监控指标

支持数据库层面的性能监控(等待事件,长事务) 不支持

查看数据库错误日志

不支持

销毁 保留最后快照到云存储上 不支持
扩展性

支持的最高 CPU,最高内存,最高

IOPS

32 v-core, 160G mem

100-30000 IOPS

支持在线扩容,缩容 支持

MySQL 不显示输出结果

$
0
0

有的时候,想看看语句执行时间有多长,但是有不想看的刷屏的输出,各个数据库可以用下面的方法:
(1)Oracle: set autotrace trace,恢复的话,用set autottrace off
(2)postgresql: EXPLAIN ANALYZE
(3)MySQL: pager cat > /dev/null,恢复的话,直接打pager

MySQL的举例说明一下:

mysql> pager
Default pager wasn't set, using stdout.
mysql> 
mysql> select count(*) from orasup1;
+----------+
| count(*) |
+----------+
|   960896 |
+----------+
1 row in set (0.60 sec)

mysql> pager cat > /dev/null
PAGER set to 'cat > /dev/null'
mysql> 
mysql> select count(*) from orasup1;
1 row in set (0.65 sec)

mysql> pager
Default pager wasn't set, using stdout.
mysql> 
mysql> select count(*) from orasup1;
+----------+
| count(*) |
+----------+
|   960896 |
+----------+
1 row in set (0.63 sec)

mysql>

参考: Fun with the MySQL pager command

SQL Server报错The datediff function resulted in an overflow

$
0
0

zabbix的监控有一个报错:

The datediff function resulted in an overflow. The number of dateparts separating two date/time instances is too large. Try to use datediff with a less precise datepart.

经检查,这个报错,调用的是下面的一个监控:

select count(*) as cnt from sys.sysprocesses 
where  DateDiff(ss,last_batch,getDate())>=10 
and lastwaittype Like 'PAGE%LATCH_%' 
And waitresource Like '2:%'

这个监控脚本,是用来监控发生在temp上的pagelatch_up的争用。监控脚本中,包含了datediff函数。datediff的返回值如果overflow,将导致上面的报错。

我们来看看,datediff这个值溢出的情况。在官方文档中,datediff函数定义返回的是int值,int值的取值范围是 (-2,147,483,648 to +2,147,483,647)。所以,第一步的怀疑,是抓取的起始时间和结束时间之差,溢出了。

那么,什么时候会溢出? 如果进程是刚刚发起的,那么之间的差值,应该会很短,不会溢出。那么离目前时间最远的进程,会不会溢出?

在SQL Server中进程分为客户端进程和系统进程,一般情况下,客户端进程都是最近发起的,所以时间差不会溢出。是否是系统进程导致时间差溢出的呢?

因为系统进程不是客户端发起的,所以系统进程的last_batch时间,就是数据库的启动时间,我们检查了一下数据库的启动时间:

SELECT sqlserver_start_time FROM sys.dm_os_sys_info;

发现是2018-04-14 22:02:46.377。这个时间是否有可能导致溢出?

还是根据官方文档:

可以看到,如果是到秒级,即datediff(ss),中间的时间差是可以长达68年19天3小时14分7秒的。而我们的数据库启动时间,远远没有超过68年。

去掉where条件之后,重新运行了几次上述的SQL语句,没有发现早于2018-04-14的。

正当束手无策的时候,想起在这个数据库上部署过msawr,会定期snapshot各项性能指标,那么可以从msawr中去找找线索。

确实,我们在msawr中发现了有些进程的last_batch早于数据库启动的时间,这个时间,是1900-01-01 00:00:00.000。

last_batch的含义,在官方文档是这样解释的:

last_batch是个datetime的值,在官方文档中说明中,datetime类型默认值是1900:

而last_batch的这个字段,是not null:

也就是说,在为null的情况下,这个datetime类型的值,将有默认值来填充,所以也就出现了1900-01-01 00:00:00.000。

那么sysprocesses的last_batch会出现控制,进而被替代成1900-01-01 00:00:00.000 ?这个在网上找很多文章,都归结到微软的这个文章:”INF: Last Batch Date is Seen as 1900-01-01 00:00:00.000″ at http://support.microsoft.com/?kbid=306625 , 但是点进去你会发现,这个文章已经404找不到了。

幸好,还有另外的一个文章启发了我:

它说:

However, it's possible to create a connection to SQL Server without issuing any RPC calls at all. In this case, the value of last_batch will never have been set and master..sysprocesses will display the value as 1900-01-01 00:00:00.000.

也就是说,由非远程调用(RPC,remote procedure call)发起的进程,其last_batch是null值,而null值继而会被1900-01-01 00:00:00.000所替代。

我们进而看lastwaittype:发现其大部分的,是CXPACKET的并发等待。

所以,应该是并发进程,不是有RPC远程调用的,而是直接在本地调用的。在第一次的时候,last_batch没有被更新,只是留有了null值,进而被替换成了1900年。从而导致了我们的溢出报错。

解决方式也很简单。因为1900年的固定的值,加个条件and last_batch<>‘1900-01-01 00:00:00.000’ 就可以了。

select count(*) as cnt from sys.sysprocesses 
where  DateDiff(ss,last_batch,getDate())>=10 
and lastwaittype Like 'PAGE%LATCH_%' 
And waitresource Like '2:%'
and last_batch<>'1900-01-01 00:00:00.000'

解决openwrt中关于某些域名无法解析的问题

$
0
0

之前刷的一个openwrt的路由,虽然能很方便的登陆google和百度,但是发现不少网站还是登陆不上去,连我自己的博客也无法登陆。

检查连一下,发现是我的博客的域名无法解析。

root@OpenWrt:/etc/dnsmasq.d# dig oracleblog.org

; <<>> DiG 9.9.4 <<>> oracleblog.org
;; global options: +cmd
;; connection timed out; no servers could be reached
root@OpenWrt:/etc/dnsmasq.d# 
root@OpenWrt:/etc/dnsmasq.d#
root@OpenWrt:/etc/dnsmasq.d#
root@OpenWrt:/etc/dnsmasq.d#
root@OpenWrt:/etc/dnsmasq.d# dig youtube.com

; <<>> DiG 9.9.4 <<>> youtube.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21312
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1024
;; QUESTION SECTION:
;youtube.com.                   IN      A

;; ANSWER SECTION:
youtube.com.            613     IN      A       216.58.221.238

;; Query time: 33 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Feb 01 15:05:10 CST 2019
;; MSG SIZE  rcvd: 56

root@OpenWrt:/etc/dnsmasq.d#

那就是一个域名解析的问题了,由于我是通过dnsmasq进行域名解析,在我的/etc/dnsmasq.d目录下,已经有连需要特别解析的配置,那么剩下的就是一般的域名走默认配置。

检查了一下,发现没有配置no-resolv和server。把/etc/dnsmasq.conf添加如下,就可以解决了(见第6行~12行):

# for targets which are names from DHCP or /etc/hosts. Give host
# "bert" another name, bertrand
# The fields are &lt;cname&gt;,&lt;target&gt;
#cname=bertand,bert
conf-dir=/etc/dnsmasq.d
#Add by Jimmy BEGIN HERE 
no-poll
no-resolv
all-servers
cache-size=5000
server=114.114.114.114
#Add by Jimmy END HERE

root@OpenWrt:/etc/dnsmasq.d# dig oracleblog.org

; <<>> DiG 9.9.4 <<>> oracleblog.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 24044
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4000
;; QUESTION SECTION:
;oracleblog.org.                        IN      A

;; ANSWER SECTION:
oracleblog.org.         300     IN      A       45.76.217.207

;; Query time: 259 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Feb 01 16:24:10 CST 2019
;; MSG SIZE  rcvd: 59

root@OpenWrt:/etc/dnsmasq.d#

aws RDS 版本升级最佳实践的探讨

$
0
0

这篇文章其实在草稿箱中存在了挺长的一段时间,去年10月就已经开始写了,但是由于工作上的其他事情的干扰,一直还没写完。所以你可以看到我画的图中,now其实是指2018年10月(OCT)。趁着过年休假,把这个文章终于写完了。

aws rds被强制升级是个无奈的事情,版本不支持,而被强制升级会影响业务可用性。与其被动强制升级,不如制定主动升级战略。

1. aws RDS 的升级周期说明:
根据亚马逊的文档 Amazon RDS FAQs上的说明,aws RDS的大版本,至少能支持3年,小版本至少会支持1年。

根据和aws的交流得知,一般社区基本版本发布约5个月之后,aws会发布基于aws的RDS。

因此,aws的RDS升级周期是,待社区版本发布后,约5个月,aws发布对应的版本,每个大版本至少支持3年,每个小版本至少支持1年。

2. aws RDS的版本过期的后果:
根据亚马逊的文档 Amazon RDS FAQs上的说明,当某个大版本或者小版本,过了亚马逊的服务支持期,亚马逊会提前提醒客户(大版本提前6个月提醒,小版本提前3个月提醒),在提醒期过后,aws会强制自动升级数据库到最新的版本(即使客户选择的是关闭了自动小版本升级)。升级的过程,应用程序无法连接数据库,造成业务影响。

  • 注1:无论大版本,还是小版本,一旦过了亚马逊的服务支持期,都会面临强制升级的过程。
  • 注2:小版本的升级过程,会包含备份,升级,再次备份。经验值是第一次备份和最后一次备份,不影响业务正常访问,升级数据库的过程,影响业务正常访问。整个升级的过程,大约30分钟,其中影响业务访问的时间为3分半钟。但具体的业务影响时间,以实际测试为准。
  • 注3:小版本在提醒期的deadline来之前的一周,已经不能对数据库做任何modify的操作,包括搭建replica或者更改维护窗口。但是可以从备份的snapshot还原出来一个数据库,用于测试升级的时长。
  • 注4:小版本升级步骤是先升级从库,再升级主库。

3. 内部升级步骤解析:

即:
a). 在升级前,做一次快照,注意这个快照的时间,和数据库的大小的有关。
b). 进行slow shutdown,即set global innodb_fast_shutdown=0然后进行shutdown。由于设置了slow shutdown,因此dirty buffer会刷到磁盘上+insert buffer 也会刷到磁盘上(即system tablespace,ibdata1中)+full purge(即清理无用的undo页)
c). 将mysql挂载到新的存储引擎下,并且禁止远程网络访问;
d). 运行mysql_upgrade程序,升级数据字典。
e). 运行RDS特殊的一些脚本,以便升级RDS封装的表和存储过程。
f). 重启实例,开放网络远程连接。

  • 注1,在某些情况下,mysql_upgrade这个步骤会物理的重建表,表的大小会影响升级时间,所以实际升级的时间,需要以测试为准。如 MySQL 5.6.4 升级到5.7版本,因为 5.6.4 版本中的TIME, DATETIME, 和TIMESTAMP类型的存储有改变,升级的时候,需要重建表。
  • 注2,由于大版本不能跨大版本升级,如升级MySQL 5.5.46到5.7.19,不能直接升级,需要先将5.5.46升级到5.6,如5.6.37,再升级到5.7.19。因此业务受影响的时间,是两次升级的时间。而不是一次。故不做大版本的交替升级。如分成5.5 升级 5.7,5.4升级5.6。

4.版本发布路线图:
根据社区发布的版本时间,和aws已经发布的版本的时间,我们可以作出下面的发布路线图。
MySQL:

Postgresql:

  • 注1,最开始的浅绿色表示社区版第一版的发布时间,后面的灰色,表示社区版基于第一版之后的小版本GA的时间,而其对应的aws发布的版本是彩色的。
  • 注2,通常情况,aws小版本至少支持一年(即12个月),但是有些小版本,aws已经支持超过了12个月,有可能会随时终止支持,所以我画到了截止当前时间(2018年10月),后面的时间没有继续画。(即没有画的不表示不支持,只是表示aws版本发布超过了12个月,在此之后可能会被终止支持而强制下线)

5.升级最佳实践:
5.1. 大版本升级:
a). 先创建2个replica实例;
b). 升级其中一个实例到高版本,此时,还保持着主从的同步关系;
c) .创建dms实例,配置好源和目标的endpoint,和创建好task,注意创建task时选择changes only,并且取消 Start task on create的勾勾。
d). 业务中断开始,将新建的replica实例提升为主库;
d). 点击dms的task中的start ,等待其完成全量数据库的对比,开始准备同步增量数据;
e). 切换应用连接到高版本的数据库;

  • 注1,从5.6.4以下的版本升级到5.6.4之后,需要alter table table_name force,重建表,才能使用online ddl的方式create index。
  • 注2,大版本升级,需要验证应用程序的性能,需要抓取至少一周的SQL,进行sql replay看性能的变化。
  • 注3,升级之后,为了减少物理读,尽快的将更多的数据加载到内存,可以用mysqldump做prewarm
  • 注4,减少downtime,其中一个步骤是dms点击task的start进行全量数据的校对,如果加大主库的IOPS,有助于提高该步骤的速度。(该步骤是业务停机操作的,因此减少该步骤的时间,等于减少停业务时间)。


aws mysql major version upgrade best practise.pdf

5.2. 小版本升级:

方法一:
a). 先创建replica实例,或直接使用现有的replica实例;
b). 升级replica实例到高版本,此时,还保持着主从的同步关系;
c). 业务中断开始,将高版本的replica实例提升为主库;
d). 切换应用连接到高版本的数据库。应用的连接串配置,可以提前配置好,重启应用即可;

aws mysql minor version upgrade best practise.pdf


方法二:
a). 先升级replica实例到高版本,这是所有aws升级到必要前提,即必须先升级从库;
b). 中断业务和数据库之间的连接,开始升级主库;
c). 将主库升级到高版本;
d). 恢复应用连接;

aws mysql minor version upgrade best practise_2.pdf

 

  • 注1,方法一是aws推荐的方案,但是方案二,对于小系统也是非常合适的。
  • 注2,方法一的应用影响时间,是提升从库为主库的时间+应用重启的时间。根据我们的某个数据库的测试,提升的时间,大约是3分钟02秒。加上应用重启时间,也大约是3分半钟。
  • 注3,方法二,我们的某个数据库测试数据是,整体的升级时间大约是34分钟(因为包含了升级前数据库做backup和升级完成后做backup,这都是升级过程中,aws自己做的),而这34分钟,并不是应用都不可用,在做数据库backup时,数据库还是可以用的,真正业务不能连数据库的时间,是3分32秒。
  • 注4,两个方法,服务不可用的时间都差不多,都是大约3分半钟。但是方法一有个风险,就是如果是因为需要强制升级小版本,已经快到升级的维护时间,且已经是deadline的维护时间,那么虽然我们没有去动主库,但万一失败需要切换回主库,而强制升级的时间又到了,触发强制升级,那么此时就是一个不可控的状态了。因此我们还是选择了方法二。
  • 注5,最终应该选择哪个方法,还是要依赖实际做升级测试的演练情况而定。

6.总结:
因此,我们可以制定如下的主动升级战略:

(1). 禁止所有的小版本自动升级;

(2). 根据上面的所述,规定今后MySQL的新安装版本的为5.7.23;

(3). 在一年内,对于之前MySQL 5.5版本,小版本统一过渡到5.5.61,MySQL 5.6版本,小版本统一过渡到5.6.41。这个可以避免MySQL的小版本因为不被支持导致强制升级,并且这2个版本的下一次强制升级时间,至少是在2019年9月之后。(pg类似指导思路);

(4). 在一年内,对于之前的MySQL 5.5版本升级到5.6版本;在两年内,对于MySQL 5.6版本,升级到5.7版本;在两到三年内,统一到MySQL 8.0版本。解决由于多版本共存,导致运维难度增加的问题。(pg类似指导思路);

(5). 后续的版本升级,将会按照1年一升小版本,3年一升大版本的进度推进,以符合aws RDS的版本支持规则。



参考文档:

Upgrading the MySQL DB Engine

AWS RDS Blog

AWS RDS forum

What’s New with AWS

What’s New with AWS – RDS/

Changes in MySQL 5.7

MySQL 8.0 Release Notes 

MySQL 5.7 Release Notes 

MySQL 5.6 Release Notes

MySQL 5.5 Release Notes

Check and Upgrade MySQL Tables

Amazon RDS FAQs

Best Practices for Upgrading Amazon RDS for MySQL and Amazon RDS for MariaDB

Innodb三大特性之insert buffer

InnoDB Insert Buffer(插入缓冲)

Upgrading the PostgreSQL DB Engine

PostgreSQL Release Notes

在Docker上安装oracle 19c

$
0
0

基于docker的安装非常简单。
其实就两行核心命令:

./buildDockerImage.sh -v 19.2.0 -e
docker run --name oracle19c -p 1521:1521 -p 5500:5500 -v /Users/lovehouse/iDocker/dockervolums/oradata/oracle19c:/opt/oracle/oradata oracle/database:19.2.0-ee

我们假设你已经在Mac上安装好了docker,我们开始安装oracle 19c。在docker上安装数据库或应用,是基于dockerfile的,目前Oracle官方还没发布基于19c的dockerfile,但是我们可以使用别人已经做好的dockerfile(感谢kamus告诉我这个docker file)。

如果你不知道如何在Mac上安装docker,可以参考我这篇《在Mac上安装docker并部署oracle 12.2

我们先来试一下官方在github上的dockerfile:

LoveHousedeiMac:iDocker lovehouse$ pwd
/Users/lovehouse/iDocker/oracle
LoveHousedeiMac:iDocker lovehouse$ git clone https://github.com/oracle/docker-images.git
Cloning into 'docker-images'...
remote: Enumerating objects: 77, done.
remote: Counting objects: 100% (77/77), done.
remote: Compressing objects: 100% (52/52), done.
remote: Total 9878 (delta 25), reused 55 (delta 23), pack-reused 9801
Receiving objects: 100% (9878/9878), 10.20 MiB | 2.47 MiB/s, done.
Resolving deltas: 100% (5686/5686), done.
LoveHousedeiMac:iDocker lovehouse$ 
LoveHousedeiMac:iDocker lovehouse$ 
LoveHousedeiMac:iDocker lovehouse$ ls -l
total 0
drwxr-xr-x  31 lovehouse  staff  1054 Feb 16 17:07 docker-images
LoveHousedeiMac:iDocker lovehouse$ cd docker-images/OracleDatabase/SingleInstance/dockerfiles    
LoveHousedeiMac:dockerfiles lovehouse$ ls -l
total 16
drwxr-xr-x   8 lovehouse  staff   272 Feb 16 17:07 11.2.0.2
drwxr-xr-x  18 lovehouse  staff   612 Feb 16 17:07 12.1.0.2
drwxr-xr-x  16 lovehouse  staff   544 Feb 16 17:07 12.2.0.1
drwxr-xr-x  16 lovehouse  staff   544 Feb 16 17:07 18.3.0
drwxr-xr-x   8 lovehouse  staff   272 Feb 16 17:07 18.4.0
-rwxr-xr-x   1 lovehouse  staff  5088 Feb 16 17:07 buildDockerImage.sh
LoveHousedeiMac:dockerfiles lovehouse$

我们看到只有11.2.0.2,12.1.0.2,12.2.0.1,18.3.0和18.4.0几个版本,还没发布19c。

我们用marcelo-ochoa做好的dockerfile,具体的信息在这里。我们开始安装:
1. 先利用git clone下载marcelo-ochoa做好的dockerfiles:

LoveHousedeiMac:iDocker lovehouse$ mkdir marcelo-ochoa
LoveHousedeiMac:iDocker lovehouse$ cd /Users/lovehouse/iDocker/marcelo-ochoa
LoveHousedeiMac:marcelo-ochoa lovehouse$ 
LoveHousedeiMac:marcelo-ochoa lovehouse$  git clone https://github.com/marcelo-ochoa/docker-images.git
Cloning into 'docker-images'...
remote: Enumerating objects: 24, done.
remote: Counting objects: 100% (24/24), done.
remote: Compressing objects: 100% (20/20), done.
remote: Total 9111 (delta 7), reused 7 (delta 3), pack-reused 9087
Receiving objects: 100% (9111/9111), 10.01 MiB | 1.59 MiB/s, done.
Resolving deltas: 100% (5204/5204), done.
LoveHousedeiMac:marcelo-ochoa lovehouse$

我们看到是存在19.2.0的dockerfile的,同时检查其安装的安装包文件名:

LoveHousedeiMac:dockerfiles lovehouse$ cd /Users/lovehouse/iDocker/marcelo-ochoa/docker-images/OracleDatabase/SingleInstance/dockerfiles
LoveHousedeiMac:dockerfiles lovehouse$ ls -l
total 16
drwxr-xr-x   8 lovehouse  staff   272 Feb 16 17:09 11.2.0.2
drwxr-xr-x  18 lovehouse  staff   612 Feb 16 17:09 12.1.0.2
drwxr-xr-x  16 lovehouse  staff   544 Feb 16 17:09 12.2.0.1
drwxr-xr-x  16 lovehouse  staff   544 Feb 16 17:09 18.3.0
drwxr-xr-x   8 lovehouse  staff   272 Feb 16 17:09 18.4.0
drwxr-xr-x  17 lovehouse  staff   578 Feb 16 17:33 19.2.0
-rwxr-xr-x   1 lovehouse  staff  5145 Feb 16 17:09 buildDockerImage.sh
LoveHousedeiMac:dockerfiles lovehouse$ cd 19.2.0
LoveHousedeiMac:19.2.0 lovehouse$ ls -l
total 136
-rw-r--r--  1 lovehouse  staff    49 Feb 16 17:09 Checksum.ee
-rw-r--r--  1 lovehouse  staff  3405 Feb 16 17:09 Dockerfile
-rwxr-xr-x  1 lovehouse  staff  1148 Feb 16 17:09 checkDBStatus.sh
-rwxr-xr-x  1 lovehouse  staff   905 Feb 16 17:09 checkSpace.sh
-rwxr-xr-x  1 lovehouse  staff  3012 Feb 16 17:09 createDB.sh
-rw-r--r--  1 lovehouse  staff  6878 Feb 16 17:09 db_inst.rsp
-rw-r--r--  1 lovehouse  staff  9204 Feb 16 17:09 dbca.rsp.tmpl
-rwxr-xr-x  1 lovehouse  staff  2526 Feb 16 17:09 installDBBinaries.sh
-rwxr-xr-x  1 lovehouse  staff  6526 Feb 16 17:09 runOracle.sh
-rwxr-xr-x  1 lovehouse  staff  1015 Feb 16 17:09 runUserScripts.sh
-rwxr-xr-x  1 lovehouse  staff   758 Feb 16 17:09 setPassword.sh
-rwxr-xr-x  1 lovehouse  staff   932 Feb 16 17:09 setupLinuxEnv.sh
-rwxr-xr-x  1 lovehouse  staff   678 Feb 16 17:09 startDB.sh
LoveHousedeiMac:19.2.0 lovehouse$ 
LoveHousedeiMac:19.2.0 lovehouse$ 
LoveHousedeiMac:19.2.0 lovehouse$ 
LoveHousedeiMac:19.2.0 lovehouse$ cat Dockerfile 
LoveHousedeiMac:19.2.0 lovehouse$ cat Dockerfile |grep INSTALL_FILE_1
    INSTALL_FILE_1="V981623-01.zip" \
COPY --chown=oracle:dba $INSTALL_FILE_1 $INSTALL_RSP $INSTALL_DB_BINARIES_FILE $INSTALL_DIR/
LoveHousedeiMac:19.2.0 lovehouse$ 
LoveHousedeiMac:dockerfiles lovehouse$

我们可以看到,安装包就是叫V981623-01.zip,这和edelivery.oracle.com上下载的db安装包是同名的,不用改名。

2. 将安装包拷贝到该目录下,运行开始安装:

LoveHousedeiMac:19.2.0 lovehouse$ pwd
/Users/lovehouse/iDocker/marcelo-ochoa/docker-images/OracleDatabase/SingleInstance/dockerfiles/19.2.0
LoveHousedeiMac:19.2.0 lovehouse$ cp /Users/lovehouse/Downloads/V981623-01.zip ./
LoveHousedeiMac:19.2.0 lovehouse$ 
LoveHousedeiMac:19.2.0 lovehouse$ 
LoveHousedeiMac:19.2.0 lovehouse$ ls -l
total 11528424
-rw-r--r--  1 lovehouse  staff          49 Feb 16 17:09 Checksum.ee
-rw-r--r--  1 lovehouse  staff        3405 Feb 16 17:09 Dockerfile
-rw-r--r--@ 1 lovehouse  staff  3032822863 Feb 16 17:33 V981623-01.zip
-rw-r--r--@ 1 lovehouse  staff  2869657581 Feb 16 17:14 V981627-01.zip
-rwxr-xr-x  1 lovehouse  staff        1148 Feb 16 17:09 checkDBStatus.sh
-rwxr-xr-x  1 lovehouse  staff         905 Feb 16 17:09 checkSpace.sh
-rwxr-xr-x  1 lovehouse  staff        3012 Feb 16 17:09 createDB.sh
-rw-r--r--  1 lovehouse  staff        6878 Feb 16 17:09 db_inst.rsp
-rw-r--r--  1 lovehouse  staff        9204 Feb 16 17:09 dbca.rsp.tmpl
-rwxr-xr-x  1 lovehouse  staff        2526 Feb 16 17:09 installDBBinaries.sh
-rwxr-xr-x  1 lovehouse  staff        6526 Feb 16 17:09 runOracle.sh
-rwxr-xr-x  1 lovehouse  staff        1015 Feb 16 17:09 runUserScripts.sh
-rwxr-xr-x  1 lovehouse  staff         758 Feb 16 17:09 setPassword.sh
-rwxr-xr-x  1 lovehouse  staff         932 Feb 16 17:09 setupLinuxEnv.sh
-rwxr-xr-x  1 lovehouse  staff         678 Feb 16 17:09 startDB.sh
LoveHousedeiMac:19.2.0 lovehouse$ 
LoveHousedeiMac:19.2.0 lovehouse$ 
LoveHousedeiMac:19.2.0 lovehouse$

LoveHousedeiMac:19.2.0 lovehouse$ cd ..
LoveHousedeiMac:dockerfiles lovehouse$ ls
11.2.0.2                12.1.0.2                12.2.0.1                18.3.0                  18.4.0                  19.2.0                  buildDockerImage.sh
LoveHousedeiMac:dockerfiles lovehouse$ ./buildDockerImage.sh -v 19.2.0 -e
Ignored MD5 sum, 'md5sum' command not available.
==========================
DOCKER info:
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 18.09.2
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true

......

Step 24/24 : CMD exec $ORACLE_BASE/$RUN_FILE
 ---> Running in f4bddc96e630
Removing intermediate container f4bddc96e630
 ---> 65cdd07a7bc1
Successfully built 65cdd07a7bc1
Successfully tagged oracle/database:19.2.0-ee


  Oracle Database Docker Image for 'ee' version 19.2.0 is ready to be extended: 
    
    --> oracle/database:19.2.0-ee

  Build completed in 574 seconds.
  
LoveHousedeiMac:dockerfiles lovehouse$     
LoveHousedeiMac:dockerfiles lovehouse$

附件是完整的log:build19c.log

我们看到image已经安装好,注意它是附带安装了一个slim版的oracle linux,这个在12.2安装的时候,就是这种模式:

LoveHousedeiMac:dockerfiles lovehouse$ docker image ls
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
oracle/database     19.2.0-ee           65cdd07a7bc1        About an hour ago   6.33GB
oraclelinux         7-slim              c3d869388183        4 weeks ago         117MB
LoveHousedeiMac:dockerfiles lovehouse$

3. 我们开始安装数据库实例:
注意oracle 企业版的docker run的命令格式如下(XE版的都有所区别):

docker run --name <container name> \
-p <host port>:1521 -p <host port>:5500 \
-e ORACLE_SID=<your SID> \
-e ORACLE_PDB=<your PDB name> \
-e ORACLE_PWD=<your database passwords> \
-e ORACLE_CHARACTERSET=<your character set> \
-v [<host mount point>:]/opt/oracle/oradata \
oracle/database:18.3.0-ee

Parameters:
   --name:        The name of the container (default: auto generated)
   -p:            The port mapping of the host port to the container port. 
                  Two ports are exposed: 1521 (Oracle Listener), 5500 (OEM Express)
   -e ORACLE_SID: The Oracle Database SID that should be used (default: ORCLCDB)
   -e ORACLE_PDB: The Oracle Database PDB name that should be used (default: ORCLPDB1)
   -e ORACLE_PWD: The Oracle Database SYS, SYSTEM and PDB_ADMIN password (default: auto generated)
   -e ORACLE_CHARACTERSET:
                  The character set to use when creating the database (default: AL32UTF8)
   -v /opt/oracle/oradata
                  The data volume to use for the database.
                  Has to be writable by the Unix "oracle" (uid: 54321) user inside the container!
                  If omitted the database will not be persisted over container recreation.
   -v /opt/oracle/scripts/startup | /docker-entrypoint-initdb.d/startup
                  Optional: A volume with custom scripts to be run after database startup.
                  For further details see the "Running scripts after setup and on startup" section below.
   -v /opt/oracle/scripts/setup | /docker-entrypoint-initdb.d/setup
                  Optional: A volume with custom scripts to be run after database setup.
                  For further details see the "Running scripts after setup and on startup" section below.

我们开始安装实例(注意这里会生成一个sys,system和pdbadmin的密码):

LoveHousedeiMac:dockerfiles lovehouse$ docker run --name oracle19c -p 1521:1521 -p 5500:5500 -v /Users/lovehouse/iDocker/dockervolums/oradata/oracle19c:/opt/oracle/oradata oracle/database:19.2.0-ee
ORACLE PASSWORD FOR SYS, SYSTEM AND PDBADMIN: L40uti33Ojk=1

LSNRCTL for Linux: Version 19.0.0.0.0 - Production on 16-FEB-2019 10:55:17

......

The Oracle base remains unchanged with value /opt/oracle
#########################
DATABASE IS READY TO USE!
#########################
The following output is now a tail of the alert.log:
ORCLPDB1(3):CREATE SMALLFILE TABLESPACE "USERS" LOGGING  DATAFILE  '/opt/oracle/oradata/ORCLCDB/ORCLPDB1/users01.dbf' SIZE 5M REUSE AUTOEXTEND ON NEXT  1280K MAXSIZE UNLIMITED  EXTENT MANAGEMENT LOCAL  SEGMENT SPACE MANAGEMENT  AUTO
ORCLPDB1(3):Completed: CREATE SMALLFILE TABLESPACE "USERS" LOGGING  DATAFILE  '/opt/oracle/oradata/ORCLCDB/ORCLPDB1/users01.dbf' SIZE 5M REUSE AUTOEXTEND ON NEXT  1280K MAXSIZE UNLIMITED  EXTENT MANAGEMENT LOCAL  SEGMENT SPACE MANAGEMENT  AUTO
ORCLPDB1(3):ALTER DATABASE DEFAULT TABLESPACE "USERS"
ORCLPDB1(3):Completed: ALTER DATABASE DEFAULT TABLESPACE "USERS"
2019-02-16T11:06:30.379489+00:00
ALTER SYSTEM SET control_files='/opt/oracle/oradata/ORCLCDB/control01.ctl' SCOPE=SPFILE;
2019-02-16T11:06:30.383959+00:00
ALTER SYSTEM SET local_listener='' SCOPE=BOTH;
   ALTER PLUGGABLE DATABASE ORCLPDB1 SAVE STATE
Completed:    ALTER PLUGGABLE DATABASE ORCLPDB1 SAVE STATE

注,如果“DATABASE IS READY TO USE!”字样已经出现,且后面的log一直停着不动,可以在别的窗口重启container。
附件是完整的log:run19c.log

登陆主机或数据库进行操作:

LoveHousedeiMac:19.2.0 lovehouse$  docker ps -a
CONTAINER ID        IMAGE                       COMMAND                  CREATED             STATUS                    PORTS                                            NAMES
39284f79172b        oracle/database:19.2.0-ee   "/bin/sh -c 'exec $O…"   26 minutes ago      Up 11 minutes (healthy)   0.0.0.0:1521->1521/tcp, 0.0.0.0:5500->5500/tcp   oracle19c
LoveHousedeiMac:19.2.0 lovehouse$ 
LoveHousedeiMac:19.2.0 lovehouse$ 
LoveHousedeiMac:19.2.0 lovehouse$ docker exec -it 39284f79172b /bin/bash
[oracle@39284f79172b ~]$ 
[oracle@39284f79172b ~]$ 
[oracle@39284f79172b admin]$ sqlplus sys/L40uti33Ojk=1@ORCLPDB1 as sysdba

SQL*Plus: Release 19.0.0.0.0 - Production on Sat Feb 16 11:44:29 2019
Version 19.2.0.0.0

Copyright (c) 1982, 2018, Oracle.  All rights reserved.


Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.2.0.0.0

SQL>

LoveHousedeiMac:~ lovehouse$ docker ps   
CONTAINER ID        IMAGE                       COMMAND                  CREATED             STATUS                    PORTS                                            NAMES
39284f79172b        oracle/database:19.2.0-ee   "/bin/sh -c 'exec $O…"   About an hour ago   Up 39 minutes (healthy)   0.0.0.0:1521->1521/tcp, 0.0.0.0:5500->5500/tcp   oracle19c
LoveHousedeiMac:~ lovehouse$ 
LoveHousedeiMac:~ lovehouse$ docker run --rm -ti oracle/database:19.2.0-ee sqlplus pdbadmin/L40uti33Ojk=1@//172.17.0.2:1521/ORCLPDB1

SQL*Plus: Release 19.0.0.0.0 - Production on Sat Feb 16 11:50:21 2019
Version 19.2.0.0.0

Copyright (c) 1982, 2018, Oracle.  All rights reserved.

Last Successful login time: Sat Feb 16 2019 11:50:12 +00:00

Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.2.0.0.0

SQL>

另外注意一下,以主机的方式登陆进去之后,直接sqlplus会报错ORA-12162,是因为docker镜像中没有指定ORACLE_SID,export一下就可以了:

[oracle@39284f79172b admin]$ sqlplus "/ as sysdba"

SQL*Plus: Release 19.0.0.0.0 - Production on Sat Feb 16 12:09:40 2019
Version 19.2.0.0.0

Copyright (c) 1982, 2018, Oracle.  All rights reserved.

ERROR:
ORA-12162: TNS:net service name is incorrectly specified


Enter user-name: 
ERROR:
ORA-12162: TNS:net service name is incorrectly specified


Enter user-name: 
ERROR:
ORA-12162: TNS:net service name is incorrectly specified


SP2-0157: unable to CONNECT to ORACLE after 3 attempts, exiting SQL*Plus
[oracle@39284f79172b admin]$ 
[oracle@39284f79172b admin]$ ps -ef |grep SID
[oracle@39284f79172b admin]$ ps -ef |grep ora_smon |grep -v grep
oracle      65     1  0 11:10 ?        00:00:00 ora_smon_ORCLCDB
[oracle@39284f79172b admin]$ 
[oracle@39284f79172b admin]$ export ORACLE_SID=ORCLCDB
[oracle@39284f79172b admin]$ 
[oracle@39284f79172b admin]$ 
[oracle@39284f79172b admin]$ sqlplus "/ as sysdba"

SQL*Plus: Release 19.0.0.0.0 - Production on Sat Feb 16 12:13:14 2019
Version 19.2.0.0.0

Copyright (c) 1982, 2018, Oracle.  All rights reserved.


Connected to:
Oracle Database 19c Enterprise Edition Release 19.0.0.0.0 - Production
Version 19.2.0.0.0

SQL>


小记scn head room

$
0
0

小记一下前段时间看的scn head room的问题。

1. scn的最大值。scn的表示是SCN_WRAP.SCN_BASE,最大值是 ffff.ffffffff,即65535.4294967295,也就是每当scn_base到ffffffff(或者说4294967295)的时候,scn wrap翻一位。因此最大值是:65535*4294967295=281470681677825(281万亿)

2. scn head room的问题,并不是到达了scn的最大值(281万亿),而是接近或者达到了当前最大可用scn。

3. 当前最大可用scn定义,是以1988年1月1号为起点,值是1,然后以每秒递增16384(也就是16k)。得出当前最大可用的scn值。

4. 如果当前的scn(即dbms_flashback.get_system_change_number ),达到了当前最大可用scn,此时就hang了,等待下一秒新的SCN上限。

5. 安全值是当前scn,如果以每秒16k的速度增长,到当前最大可用scn需要超过62天的时间。

2012年之后的补丁或者新版本的数据库,带了3个防止dblink传播高scn的隐含参数:

_external_scn_logging_threshold_seconds-- 默认86400秒,即24小时,表示如果受到外来scn影响跳变时间超过24小时,就在alertlog记录一条告警信息。
_external_scn_rejection_delta_threshold_minutes --外来scn每分钟变化超过xx值,就拒绝。默认值为0,不拒绝
_external_scn_rejection_threshold_hours --默认值24小时,外来scn超过本地最大可用scn24小时,就拒绝。

11.2.0.2之后,默认每秒递增的速率是32k(即_max_reasonable_scn_rate=32768),不再是16k。可以解决上面3的问题。

SCN Compatibility问题:
打过补丁或者12c的数据库,

set serverout on
declare
 EFFECTIVE_AUTO_ROLLOVER_TS date;
 TARGET_COMPAT number;
 IS_ENABLED boolean;
 begin
  dbms_scn.GETSCNAUTOROLLOVERPARAMS(EFFECTIVE_AUTO_ROLLOVER_TS,TARGET_COMPAT,IS_ENABLED);
  dbms_output.put_line('EFFECTIVE_AUTO_ROLLOVER_TS='||to_char(EFFECTIVE_AUTO_ROLLOVER_TS,'yyyy-mm-dd hh24:mi:ss'));
  dbms_output.put_line('TARGET_COMPAT=' || TARGET_COMPAT);
 if(IS_ENABLED)then
  dbms_output.put_line('IS_ENABLED IS TURE'); 
 else 
  dbms_output.put_line('IS_ENABLED IS FALSE'); 
 end if;
 end;
 /
 
EFFECTIVE_AUTO_ROLLOVER_TS=2019-06-23 00:00:00
TARGET_COMPAT=3
IS_ENABLED IS TURE
 
PL/SQL procedure successfully completed
 
SQL>

可以Alter Database Set SCN Compatibility 3;但是降低需要重启。注意,3是96k

参考:
《SCN Head Room 原理全解析》
SCN Compatibility问题汇总-2019年6月23日

隐式转换检查

$
0
0

数据库中是隐式转换往往是性能的杀手,下面2个语句分别可以在sql server和oracle查询到目前在内存中的,使用了隐式转换的SQL:

  • sql server 隐式转换:
  • DECLARE @dbname SYSNAME  
    SET @dbname = QUOTENAME(DB_NAME());  
    WITH XMLNAMESPACES(DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')  
    SELECT stmt.value('(@StatementText)[1]', 'varchar(max)') AS SQL_Text ,  
             t.value('(ScalarOperator/Identifier/ColumnReference/@Schema)[1]', 'varchar(128)') AS SchemaName ,  
             t.value('(ScalarOperator/Identifier/ColumnReference/@Table)[1]', 'varchar(128)') AS TableName ,  
             t.value('(ScalarOperator/Identifier/ColumnReference/@Column)[1]', 'varchar(128)') AS ColumnName ,  
             ic.DATA_TYPE AS ConvertFrom ,  
             ic.CHARACTER_MAXIMUM_LENGTH AS ConvertFromLength ,  
             t.value('(@DataType)[1]', 'varchar(128)') AS ConvertTo ,  
             t.value('(@Length)[1]', 'int') AS ConvertToLength ,  
             query_plan  
    FROM sys.dm_exec_cached_plans AS cp  
    --FROM sys.dm_exec_query_stats qs
    CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS qp  
    CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple')AS batch ( stmt )  
    CROSS APPLY stmt.nodes('.//Convert[@Implicit="1"]') AS n ( t )  
    JOIN INFORMATION_SCHEMA.COLUMNS AS ic ON QUOTENAME(ic.TABLE_SCHEMA) = t.value('(ScalarOperator/Identifier/ColumnReference/@Schema)[1]', 'varchar(128)')  
        AND QUOTENAME(ic.TABLE_NAME) = t.value('(ScalarOperator/Identifier/ColumnReference/@Table)[1]','varchar(128)')  
        AND ic.COLUMN_NAME = t.value('(ScalarOperator/Identifier/ColumnReference/@Column)[1]','varchar(128)')  
    WHERE t.exist('ScalarOperator/Identifier/ColumnReference[@Database=sql:variable("@dbname")][@Schema!="[sys]"]') = 1 
    and ic.DATA_TYPE != t.value('(@DataType)[1]', 'varchar(128)')

  • oracle隐式转换:
  • SELECT sql_id,plan_hash_value FROM v$sql_plan x WHERE  x.FILTER_PREDICATES LIKE '%INTERNAL_FUNCTION%'
    GROUP BY sql_id,plan_hash_value

    数据库序列的监控

    $
    0
    0

    需要监控数据库的序列,在达到最大值前,进行告警。特别是mysql,往往因为字段的定义和auto incremental的定义不同,导致各自的上限不同。

    Oracle:

    SELECT 
    x.*,
    CASE WHEN increment_by<0
     THEN round(last_number/min_value*100,4)
    WHEN increment_by>0
      THEN round(last_number/max_value*100,4)
     ELSE 0
      END 
    AS percent_usage
    from DBA_SEQUENCES x  WHERE cycle_flag='N'
    ORDER BY percent_usage DESC;

    SQL Server:

    SELECT 
    max(
    CASE WHEN increment_by<0
     THEN round(last_number/min_value*100,4)
    WHEN increment_by>0
      THEN round(last_number/max_value*100,4)
     ELSE 0
      END 
    ) AS percent_usage
    from DBA_SEQUENCES x  WHERE cycle_flag='N'
    ORDER BY percent_usage DESC;

    pg:

    --(1)初始化部署:
    --(1.1)数据库内部署表和函数(注,如果一个pg实例中有多个数据库需要监控,需要部署到多个库):
    drop table oracleblog_pg_sequence;
    create table oracleblog_pg_sequence
    (
     record_time timestamp with time zone,
     sequence_name TEXT   , 
     last_value    bigint , 
     start_value   bigint , 
     increment_by  bigint , 
     max_value     bigint , 
     min_value     bigint , 
     cache_value   bigint , 
     log_cnt       bigint , 
     is_cycled     boolean, 
     is_called     boolean,
     sequence_schema TEXT,
     table_name   text,
     column_name TEXT,
     data_type text,
     datatype_maxval bigint,
     datatype_minval bigint,
     cap_value   bigint 
    );
    
    
    DROP function oracleblog_get_seqval();
    CREATE OR REPLACE FUNCTION oracleblog_get_seqval() RETURNS void AS $sequence_values$
    DECLARE
       nsp_name TEXT;
       seq_name TEXT;
    BEGIN
       EXECUTE 'truncate table oracleblog_pg_sequence';
       FOR nsp_name, seq_name IN
           SELECT nspname::text, relname::text
              FROM pg_class 
              JOIN pg_namespace
              ON pg_class.relnamespace = pg_namespace.oid WHERE relkind='S'
       LOOP
           EXECUTE 'INSERT into oracleblog_pg_sequence(sequence_name,last_value,start_value,increment_by,max_value,min_value,cache_value,log_cnt,is_cycled,is_called,sequence_schema) SELECT x.*,'
                   || ''''
                   || nsp_name
                   || ''''
                   ||' from '
                   || nsp_name
                   ||'.'
                   ||seq_name
                   ||' x';
       END LOOP;
       
    update oracleblog_pg_sequence a
    set 
    table_name=(select table_name from INFORMATION_SCHEMA.COLUMNS b where SPLIT_PART(COLUMN_DEFAULT, '''', 2) = SEQUENCE_NAME and sequence_schema=table_schema),
    column_name=(select column_name from INFORMATION_SCHEMA.COLUMNS b where SPLIT_PART(COLUMN_DEFAULT, '''', 2) = SEQUENCE_NAME and sequence_schema=table_schema),
    data_type=(select data_type from INFORMATION_SCHEMA.COLUMNS b where SPLIT_PART(COLUMN_DEFAULT, '''', 2) = SEQUENCE_NAME and sequence_schema=table_schema),
    datatype_maxval=(select
    CASE lower(DATA_TYPE)
             when 'smallint'  then     32767
             when 'integer'   then     2147483647
             when 'serial'    then     2147483647
             when 'bigint'    then     9223372036854775807
             when 'bigserial' then     9223372036854775807
             else  null 
             end from INFORMATION_SCHEMA.COLUMNS b where SPLIT_PART(COLUMN_DEFAULT, '''', 2) = SEQUENCE_NAME and sequence_schema=table_schema),
    datatype_minval=(select
    CASE lower(DATA_TYPE)
             when 'smallint'   then      -32767
             when 'integer'    then      -2147483647
             when 'serial'     then      1
             when 'bigint'     then      -9223372036854775807
             when 'bigserial'  then      1
             else  null 
             end from INFORMATION_SCHEMA.COLUMNS b where SPLIT_PART(COLUMN_DEFAULT, '''', 2) = SEQUENCE_NAME and sequence_schema=table_schema); 
             
    update oracleblog_pg_sequence 
    set cap_value=(
    case when INCREMENT_BY < 0 then 
              (case  when min_value>=datatype_minval then min_value
                     when min_value<=datatype_minval then datatype_minval
               end)      
         when INCREMENT_BY > 0 then 
              (case  when max_value>=datatype_maxval then datatype_maxval
                     when max_value<=datatype_maxval then max_value
               end)         
    end),
    record_time=now()
    ;
              
    END;
    $sequence_values$
    LANGUAGE plpgsql;
    
    --(1.2)在没有安装pgAgent的环境下,用crontab实现定时刷新(注,如果一个pg实例中有多个数据库需要监控,需要部署多个crontab,每个crontab的-d参数后跟不同的库):
    cat /data001/PRD/postgres/9.6.2/home/postgres/crontab_script/fresh_oracleblog_pg_sequence.sh
    #!/bin/bash
    /data/PRD/postgres/base/9.6.2/bin/psql -d dbinfo -c "select oracleblog_get_seqval()"
    chmod +x /data001/PRD/postgres/9.6.2/home/postgres/crontab_script/fresh_oracleblog_pg_sequence.sh
    
    crontab -l
    01 * * * *  /bin/bash /data001/PRD/postgres/9.6.2/home/postgres/crontab_script/fresh_oracleblog_pg_sequence.sh
    
    
    --(2)给zabbix用户授权:
    grant select on oracleblog_pg_sequence to zabbix;
    
    --(3)查询结果:
    --(3.1)用于监控语句:
    select 
    max(round((1-(cap_value-last_value)::numeric/(cap_value-start_value)::numeric)*100,4)) as max_usage_percent
    from oracleblog_pg_sequence where is_cycled='f';
    
    
    --(3.2)平时运维检查:
    select 
    SEQUENCE_NAME,
    last_value,
    start_value,
    increment_by,
    cap_value,
    round((1-(cap_value-last_value)::numeric/(cap_value-start_value)::numeric)*100,4) as usage_percent
    from oracleblog_pg_sequence where  is_cycled='f' order by usage_percent desc;

    MySQL:

    CREATE DEFINER=`root`@`localhost` PROCEDURE `proc_awr_getauto_increment_status`()
    BEGIN
        TRUNCATE TABLE myawr.auto_increment_status;
        INSERT INTO myawr.auto_increment_status (clock, table_schema, table_name,auto_increment_increment,auto_increment_offset,auto_increment_max,auto_increment_used)
        SELECT now() AS clock, b.table_schema, b.table_name
            ,(select VARIABLE_VALUE from performance_schema.global_variables where VARIABLE_NAME = 'auto_increment_increment') as auto_increment_increment
            ,(select VARIABLE_VALUE from performance_schema.global_variables where VARIABLE_NAME = 'auto_increment_offset') as auto_increment_offset
            , CASE 
                WHEN COLUMN_TYPE LIKE 'bigint%' THEN 9223372036854775808
                WHEN COLUMN_TYPE LIKE 'int(%)' THEN 2147483647
                WHEN COLUMN_TYPE LIKE 'int(%) unsigned' THEN 4294967295
                WHEN COLUMN_TYPE LIKE 'MEDIUMINT(%)' THEN 8388607
                WHEN COLUMN_TYPE LIKE 'MEDIUMINT(%) unsigned' THEN 16777215
                WHEN COLUMN_TYPE LIKE 'SMALLINT(%)' THEN 32767
                WHEN COLUMN_TYPE LIKE 'SMALLINT(%) unsigned' THEN 65535
                WHEN COLUMN_TYPE LIKE 'TINYINT(%)' THEN 127
                WHEN COLUMN_TYPE LIKE 'TINYINT(%) unsigned' THEN 255
                ELSE 'other'
            END AS auto_increment_max
            , CASE 
                WHEN COLUMN_TYPE LIKE 'bigint%' THEN format(b.auto_increment / 9223372036854775808 * 100, 0)
                WHEN COLUMN_TYPE LIKE 'int(%)' THEN format(b.auto_increment / 2147483647 * 100, 0)
                WHEN COLUMN_TYPE LIKE 'int(%) unsigned' THEN format(b.auto_increment / 4294967295 * 100, 0)
                WHEN COLUMN_TYPE LIKE 'MEDIUMINT(%)' THEN format(b.auto_increment / 8388607 * 100, 0)
                WHEN COLUMN_TYPE LIKE 'MEDIUMINT(%) unsigned' THEN format(b.auto_increment / 16777215 * 100, 0)
                WHEN COLUMN_TYPE LIKE 'SMALLINT(%)' THEN format(b.auto_increment / 32767 * 100, 0)
                WHEN COLUMN_TYPE LIKE 'SMALLINT(%) unsigned' THEN format(b.auto_increment / 65535 * 100, 0)
                WHEN COLUMN_TYPE LIKE 'TINYINT(%)' THEN format(b.auto_increment / 127 * 100, 0)
                WHEN COLUMN_TYPE LIKE 'TINYINT(%) unsigned' THEN format(b.auto_increment / 255 * 100, 0)
                ELSE '100'
            END AS auto_increment_used
        FROM information_schema.columns a
            JOIN information_schema.tables b
            ON a.table_name = b.table_name
                AND a.table_schema = b.table_schema
        WHERE EXTRA = 'auto_increment'
        ORDER BY auto_increment_used + 0 DESC
        LIMIT 10;
    
    CREATE DEFINER=`root`@`localhost` EVENT `event_awr_getauto_increment_status` ON SCHEDULE EVERY 1 HOUR STARTS '2019-04-17 11:45:34' ON COMPLETION PRESERVE ENABLE DO call myawr.proc_awr_getauto_increment_status()

    MySQL waiting for metadata lock的分析

    $
    0
    0

    处理waiting for metadata lock,需要:

  • 1. 平时打开performance_schema(以下简称PS)的instruments。
  • 2. 查询PS.metadata_locks ,找到状态为PENDING的thread。
  • 3. 查询PS.threads,关联PS.metadata_locks中的thread_id, 找到PROCESSLIST_ID,PROCESSLIST_ID即为show processlist中的metadata lock的holder。
  • 注意,默认情况下,instruments是没有打开的:

    mysql> select * from performance_schema.setup_instruments WHERE NAME = 'wait/lock/metadata/sql/mdl';
    +----------------------------+---------+-------+
    | NAME                       | ENABLED | TIMED |
    +----------------------------+---------+-------+
    | wait/lock/metadata/sql/mdl | NO      | NO    |
    +----------------------------+---------+-------+
    1 row in set (0.00 sec)
    
    mysql>

    如下的方式可以打开instruments:

    mysql> UPDATE performance_schema.setup_instruments SET ENABLED = 'YES' WHERE NAME = 'wait/lock/metadata/sql/mdl';
    Query OK, 1 row affected (0.00 sec)
    Rows matched: 1  Changed: 1  Warnings: 0
    
    mysql>  
    mysql> select * from performance_schema.setup_instruments WHERE NAME = 'wait/lock/metadata/sql/mdl';
    +----------------------------+---------+-------+
    | NAME                       | ENABLED | TIMED |
    +----------------------------+---------+-------+
    | wait/lock/metadata/sql/mdl | YES     | NO    |
    +----------------------------+---------+-------+
    1 row in set (0.00 sec)
    
    mysql>

    我们来测试一下,如何利用PS的instruments找到metadata lock的holder:

    session 1:

    mysql> begin;
    Query OK, 0 rows affected (0.00 sec)
    
    mysql> delete from orasup_test1;
    Query OK, 131072 rows affected (0.48 sec)
    
    mysql>

    session 3:

    mysql> begin;
    Query OK, 0 rows affected (0.00 sec)
    
    mysql> drop table orasup_test1
    --hang

    session 2: (此时,在查innodb_lock_waits等表是看不到信息的,因为innodb_lock_waits是行锁,而metadata lock是表锁。)

    mysql> select * from sys.innodb_lock_waits \G;
    Empty set (0.00 sec)
    
    mysql> 
    mysql> select * from information_schema.INNODB_LOCK_WAITS;
    Empty set (0.00 sec)
    
    mysql> select * from information_schema.INNODB_LOCKS;
    Empty set (0.00 sec)
    
    mysql> 
    mysql>

    session 2:(需要从PS.metadata_locks 进行查询)

    mysql> select * from performance_schema.metadata_locks where OBJECT_NAME='orasup_test1' \G;
    *************************** 1. row ***************************
              OBJECT_TYPE: TABLE
            OBJECT_SCHEMA: myawr
              OBJECT_NAME: orasup_test1
    OBJECT_INSTANCE_BEGIN: 140695273259776
                LOCK_TYPE: EXCLUSIVE
            LOCK_DURATION: TRANSACTION
              LOCK_STATUS: PENDING
                   SOURCE: sql_parse.cc:5776
          OWNER_THREAD_ID: 525483
           OWNER_EVENT_ID: 21
    *************************** 2. row ***************************
              OBJECT_TYPE: TABLE
            OBJECT_SCHEMA: myawr
              OBJECT_NAME: orasup_test1
    OBJECT_INSTANCE_BEGIN: 140694337352464
                LOCK_TYPE: SHARED_WRITE
            LOCK_DURATION: TRANSACTION
              LOCK_STATUS: GRANTED
                   SOURCE: sql_parse.cc:5776
          OWNER_THREAD_ID: 523569
           OWNER_EVENT_ID: 194
    2 rows in set (0.00 sec)
    
    ERROR: 
    No query specified
    
    mysql> 
    mysql> 
    mysql> 
    mysql> 
    mysql> show processlist;
    +--------+-----------------+-------------------+----------+-------------+---------+---------------------------------------------------------------+-------------------------+
    | Id     | User            | Host              | db       | Command     | Time    | State                                                         | Info                    |
    +--------+-----------------+-------------------+----------+-------------+---------+---------------------------------------------------------------+-------------------------+
    |      1 | event_scheduler | localhost         | NULL     | Daemon      |       6 | Waiting for next activation                                   | NULL                    |
    |  28368 | dbsync          | 10.10.1.75:13766  | NULL     | Binlog Dump | 1330136 | Master has sent all binlog to slave; waiting for more updates | NULL                    |
    | 521693 | itsm            | 10.10.2.102:45586 | dji_itsm | Sleep       |     284 |                                                               | NULL                    |
    | 522968 | itsm            | 10.10.1.175:55165 | dji_itsm | Sleep       |     166 |                                                               | NULL                    |
    | 523305 | root            | localhost         | myawr    | Sleep       |    5393 |                                                               | NULL                    |
    | 523542 | root            | localhost         | myawr    | Sleep       |      35 |                                                               | NULL                    |
    | 523950 | itsm            | 10.10.2.102:46490 | dji_itsm | Sleep       |     168 |                                                               | NULL                    |
    | 524308 | itsm            | 10.10.1.175:55619 | dji_itsm | Sleep       |     165 |                                                               | NULL                    |
    | 524531 | itsm            | 10.10.2.102:46685 | dji_itsm | Sleep       |      16 |                                                               | NULL                    |
    | 524718 | itsm            | 10.10.1.175:55755 | dji_itsm | Sleep       |      26 |                                                               | NULL                    |
    | 524858 | itsm            | 10.10.1.175:55797 | dji_itsm | Sleep       |       0 |                                                               | NULL                    |
    | 524873 | itsm            | 10.10.1.175:55806 | dji_itsm | Sleep       |     143 |                                                               | NULL                    |
    | 525070 | itsm            | 10.10.2.102:46883 | dji_itsm | Sleep       |      16 |                                                               | NULL                    |
    | 525309 | itsm            | 10.10.2.102:46984 | dji_itsm | Sleep       |       1 |                                                               | NULL                    |
    | 525456 | root            | localhost         | myawr    | Query       |      21 | Waiting for table metadata lock                               | drop table orasup_test1 |
    | 525496 | itsm            | 10.10.2.102:47050 | dji_itsm | Sleep       |     174 |                                                               | NULL                    |
    | 525497 | itsm            | 10.10.2.102:47051 | dji_itsm | Sleep       |       0 |                                                               | NULL                    |
    | 525498 | itsm            | 10.10.2.102:47052 | dji_itsm | Sleep       |     301 |                                                               | NULL                    |
    | 525522 | itsm            | 10.10.1.175:56005 | dji_itsm | Sleep       |       1 |                                                               | NULL                    |
    | 525545 | root            | localhost         | NULL     | Query       |       0 | starting                                                      | show processlist        |
    +--------+-----------------+-------------------+----------+-------------+---------+---------------------------------------------------------------+-------------------------+
    20 rows in set (0.00 sec)
    
    mysql> 
    mysql> 
    mysql> select * from performance_schema.threads where thread_id='523569'\G;
    *************************** 1. row ***************************
              THREAD_ID: 523569
                   NAME: thread/sql/one_connection
                   TYPE: FOREGROUND
         PROCESSLIST_ID: 523542
       PROCESSLIST_USER: root
       PROCESSLIST_HOST: localhost
         PROCESSLIST_DB: myawr
    PROCESSLIST_COMMAND: Sleep
       PROCESSLIST_TIME: 62
      PROCESSLIST_STATE: NULL
       PROCESSLIST_INFO: NULL
       PARENT_THREAD_ID: NULL
                   ROLE: NULL
           INSTRUMENTED: YES
                HISTORY: YES
        CONNECTION_TYPE: Socket
           THREAD_OS_ID: 17276
    1 row in set (0.00 sec)
    
    ERROR: 
    No query specified
    
    mysql> kill 523542;
    Query OK, 0 rows affected (0.00 sec)
    
    mysql>

    session 3:

    mysql> drop table orasup_test1;
    Query OK, 0 rows affected (1 min 3.79 sec)
    
    mysql>

    阿里云关于MySQL数据库myisam的支持

    $
    0
    0

    最近在做一个阿里云跨账号的数据库迁移,这个库是和论坛相关的,用的是Discuz程序,数据库中有一张myisam表,用于记录帖子和楼层。

    mysql> show create table abc_ddxid_ppiy_gdsitnpl \G;
    *************************** 1. row ***************************
           Table: abc_ddxid_ppiy_gdsitnpl
    Create Table: CREATE TABLE `abc_ddxid_ppiy_gdsitnpl` (
      `pid` int(10) unsigned NOT NULL,
      `fid` mediumint(8) unsigned NOT NULL DEFAULT '0',
      `tid` mediumint(8) unsigned NOT NULL DEFAULT '0',
      `first` tinyint(1) NOT NULL DEFAULT '0',
      `floorid` int(8) unsigned NOT NULL AUTO_INCREMENT,
      PRIMARY KEY (`tid`,`floorid`),
      UNIQUE KEY `pid` (`pid`),
      KEY `fid` (`fid`),
      KEY `first` (`tid`,`first`)
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8 ROW_FORMAT=FIXED

    这里可以看到tid是帖子号,floorid是楼层,这里用了一个myisam的复合索引第二列作为自增值,到的分组自增值的效果。

    见mysql官方文档:3.6.9 Using AUTO_INCREMENT

    对于分组自增值,在innodb和myisam的不同行为,可以用下面的测试过程演示:

    mysql> show variables like '%engine%';
    +----------------------------+--------+
    | Variable_name              | Value  |
    +----------------------------+--------+
    | default_storage_engine     | InnoDB |
    | default_tmp_storage_engine | InnoDB |
    | storage_engine             | InnoDB |
    +----------------------------+--------+
    3 rows in set (0.00 sec)
    
    mysql> 
    mysql> 
    mysql> CREATE TABLE myisam_animals (
        ->     grp ENUM('fish','mammal','bird') NOT NULL,
        ->     id MEDIUMINT NOT NULL AUTO_INCREMENT,
        ->     name CHAR(30) NOT NULL,
        ->     PRIMARY KEY (grp,id)
        -> ) ENGINE=MyISAM;
    Query OK, 0 rows affected, 2 warnings (0.00 sec)
    
    mysql> 
    mysql> 
    mysql> INSERT INTO myisam_animals (grp,name) VALUES
        ->     ('mammal','dog'),('mammal','cat'),
        ->     ('bird','penguin'),('fish','lax'),('mammal','whale'),
        ->     ('bird','ostrich');
    Query OK, 6 rows affected (0.01 sec)
    Records: 6  Duplicates: 0  Warnings: 0
    
    mysql> 
    mysql> 
    mysql> SELECT * FROM myisam_animals ORDER BY grp,id;
    +--------+----+---------+
    | grp    | id | name    |
    +--------+----+---------+
    | fish   |  1 | lax     |
    | mammal |  1 | dog     |
    | mammal |  2 | cat     |
    | mammal |  3 | whale   |
    | bird   |  1 | penguin |
    | bird   |  2 | ostrich |
    +--------+----+---------+
    6 rows in set (0.01 sec)
    
    mysql> 
    mysql> 
    mysql> 
    mysql> 
    mysql> 
    mysql> 
    mysql> 
    mysql> CREATE TABLE innodb_animals (
        ->     grp ENUM('fish','mammal','bird') NOT NULL,
        ->     id MEDIUMINT NOT NULL AUTO_INCREMENT,
        ->     name CHAR(30) NOT NULL,
        ->     PRIMARY KEY (id,grp)
        -> ) engine=innodb;
    Query OK, 0 rows affected (0.01 sec)
    
    mysql> 
    mysql> INSERT INTO innodb_animals (grp,name) VALUES
        ->     ('mammal','dog'),('mammal','cat'),
        ->     ('bird','penguin'),('fish','lax'),('mammal','whale'),
        ->     ('bird','ostrich');
    Query OK, 6 rows affected (0.01 sec)
    Records: 6  Duplicates: 0  Warnings: 0
    
    mysql> 
    mysql> 
    mysql> SELECT * FROM innodb_animals ORDER BY grp,id;
    +--------+----+---------+
    | grp    | id | name    |
    +--------+----+---------+
    | fish   |  4 | lax     |
    | mammal |  1 | dog     |
    | mammal |  2 | cat     |
    | mammal |  5 | whale   |
    | bird   |  3 | penguin |
    | bird   |  6 | ostrich |
    +--------+----+---------+
    6 rows in set (0.00 sec)
    
    mysql>

    阿里云RDS关于MySQL数据库的myisam引擎的支持,规律如下:

    (1)MySQL 5.6的版本
    a)create table engine=myisam 且用到分组自增序列,那么不会进行转换。
    b)create table engine=myisam 不用自增分组序列,那么内部进行转换,转成innodb

    注,用dts迁移整个数据库,迁移之后,在目标库仍保持原状,innodb的表还是innodb的,利用了分组自增序列的myisam的表还是myisam的,原始表如果创建的时候,手动写myisam但是没有用到分组自增序列,那么在创建的时候,就自动转成innodb,因此dts迁移过去之后,也仍然是innodb。

    (2)MySQL 5.7版本以上
    已经不支持myisam,create table engine=myisam 报错,已经不能创建。
    如果用的是innodb创建的,非分组的自增序列,需要实现分组自增功能,需要改造代码如下:

    mysql> SELECT (@I := CASE
        ->           WHEN @GRP = GRP THEN
        ->            @I + 1
        ->           ELSE
        ->            1
        ->         END) ROWNUM,
        ->        innodb_animals.NAME,
        ->        (@GRP := GRP)
        ->   FROM innodb_animals, (SELECT @I := 0, @GRP := '') AS A
        ->  GROUP BY GRP, ID;
    +--------+---------+---------------+
    | ROWNUM | NAME    | (@GRP := GRP) |
    +--------+---------+---------------+
    |      1 | lax     | fish          |
    |      1 | dog     | mammal        |
    |      2 | cat     | mammal        |
    |      3 | whale   | mammal        |
    |      1 | penguin | bird          |
    |      2 | ostrich | bird          |
    +--------+---------+---------------+
    6 rows in set (0.00 sec)

    Viewing all 129 articles
    Browse latest View live