DataGuard

DataGuard

Data Guard Cheatsheet

Terminology
Primary databaseA production database
Standby databaseA database that can become the primary database, should the primary fail
EOREnd Of Redo
LWGRLog Writer process
LNSLog Network Server
ORLOnline Redo Log
RFSRemote File Server
SRLStandby Redo Log file
SYNC and ASYNCSynchronous and Asynchronous
Log Files
DG alert Logdrc<db_unique_name>.log
Alert Logalert_<SID>.log
Logfile locations# change the instance name to reflect the one you have choosen and the path you installed oracle

prod1 (alert log): /u01/app/oracle/diag/rdbms/prod1/PROD1/trace/alert_PROD1.log
prod1 (DG log): /u01/app/oracle/diag/rdbms/prod1/PROD1/trace/drcPROD1.logprod1dr (alert log): /u01/app/oracle/diag/rdbms/prod1dr/PROD1DR/trace/alert_PROD1DR.log
prod1dr (DG log): /u01/app/oracle/diag/rdbms/prod1dr/PROD1DR/trace/drcPROD1DR.log
identify log files## You can get the log locations from the below view

col name for a25
col value for a65;
select name, value from v$diag_info;
Data Guard Broker
Create base configuration# Primary Database server

DGMGRL> create configuration prod1 as
> primary database is prod1
> connect identifier is prod1;

Configuration "prod1" created with primary database "prod1"
Add the standby database
# Primary Database server - if you have setup db_unique_name, tnsname and log_archive_dest_n

DGMGRL> add database prod1dr;

# Primary Database server - the full command set

DGMGRL> connect sys/password
DGMGRL> add database prod1dr

> as connect identifier is prod1dr
> maintained as physical;
Database "prod1dr" added
Display configurationDGMGRL> show configuration
Display DatabaseDGMGRL> show database verbose prod1
Enabling the configuration# Primary Database server

DGMGRL> enable configuration
Enabled.
Edit configurationEDIT CONFIGURATION SET PROPERTY <name>=<value>
EDIT DATABASE <db_name> SET PROPERTY <name>=<value>
EDIT INSTANCE <in_name> SET PROPERTY <name>=value>
There are many options see the broker section for more information
Troubleshooting (Monitoring commands and log files)
configurationDGMGRL> show configuration;
databaseDGMGRL> show database prod1;
DGMGRL> show database prod1dr;

# There are a number of specific information commands, here are the most used
DGMGRL> show database prod1 statusreport;
DGMGRL> show database prod1 inconsistentProperties;
DGMGRL> show database prod1 inconsistentlogxptProps;
DGMGRL> show database prod1 logxptstatus;
DGMGRL> show database prod1 latestlog;
Logfiles# change the instance name to reflect the one you have choosen

prod1 (alert log): /u01/app/oracle/diag/rdbms/prod1/PROD1/trace/alert_PROD1.log
prod1 (DG log): /u01/app/oracle/diag/rdbms/prod1/PROD1/trace/drcPROD1.logprod1dr (alert log): /u01/app/oracle/diag/rdbms/prod1dr/PROD1DR/trace/alert_PROD1DR.log
prod1dr (DG log): /u01/app/oracle/diag/rdbms/prod1dr/PROD1DR/trace/drcPROD1DR.log
There are a number of commands that you can use to change the state of the database
turn off/on the redo transport service for all standby databases
Primary
DGMGRL> edit database prod1 set state=transport-off;

DGMGRL> edit database prod1 set state=transport-on;
turn off/on the apply state
Standby
DGMGRL> edit database prod1dr set state=apply-off;

DGMGRL> edit database prod1dr set state=apply-on;
put a database into a real-time query mode
Standby
DGMGRL> edit database prod1dr set state=apply-off;
sql> alter database open read only;
DGMGRL> edit database prod1dr set state=apply-on;
change the protection mode
Primary
# Choose what level of protection you require
sql> alter database set standby to maximize performance;
sql> alter database set standby to maximize availability;
sql> alter database set standby to maximize protection;
# display the configuration
DGMGRL> show configuration 
Redo Processing
Redo Processes (Primary and Standby Databases)
ProcessesThere are a number of Oracle background processes that play a key role, first the primary database
  • LGWR - log writer process flushes from the SGA to the ORL files
  • LNS - LogWriter Network Service reads redo being flushed from the redo buffers by the LGWR and performs a network send of the redo to the standby
  • ARCH - archives the ORL files to archive logs, that also used to fulfill gap resolution requests, one ARCH processes is dedicated to local redo log activity only and never communicates with a standby database
The standby database will also have key processes
  • RFS - Remote File Server process performs a network receive of redo transmitted from the primary and writes the network redo to the standby redo log (SRL) files.
  • ARCH - performs the same as the primary but on the standby
  • MRP - Managed Recover Process coordinates media recovery management, recall that a physical standby is in perpetual recovery mode
  • LSP - Logical Standby Process coordinates SQL apply, this process only runs in a logical standby
  • PR0x - recovery server process reads redo from the SRL or archive log files and apply this redo to the standby database.
Real-Time Apply
Enable real-time applysql> alter database recover managed standby database using current logfile disconnect;
Determine if real-time apply is enabledsql> select recovery_mode from v$archive_dest_status where dest_id = 2;

RECOVERY_MODE
--------------------------
MANAGED REAL-TIME APPLY
Tools and views to monitor redo
Background processesselect process, client_process, thread#, sequence#, status from v$managed_standby;
## primary (example)

PROCESS   CLIENT_P THREAD#    SEQUENCE#  STATUS
--------- -------- ---------- ---------- ------------
ARCH      ARCH     1          58         CLOSING
ARCH      ARCH     0          0          CONNECTED
ARCH      ARCH     1          59         CLOSING
ARCH      ARCH     1          56         CLOSING
LNS       LNS      1          60         WRITING
LNS       LNS      1          60         WRITING

## physical standby
 (example)


PROCESS   CLIENT_P THREAD#    SEQUENCE#  STATUS
--------- -------- ---------- ---------- ------------
ARCH      ARCH     0          0          CONNECTED
ARCH      ARCH     1          55         CLOSING
ARCH      ARCH     0          0          CONNECTED
ARCH      ARCH     1          59         CLOSING
RFS       N/A      0          0          IDLE
RFS       UNKNOWN  0          0          IDLE
RFS       UNKNOWN  0          0          IDLE
RFS       LGWR     1          60         IDLE
MRP0      N/A      1          60         APPLYING_LOG

## Logical standby (example)

PROCESS   CLIENT_P THREAD#    SEQUENCE#  STATUS
--------- -------- ---------- ---------- ------------
ARCH      ARCH     1          55         CLOSING
ARCH      ARCH     1          10         CLOSING
ARCH      ARCH     0          0          CONNECTED
ARCH      ARCH     0          0          CONNECTED
RFS       UNKNOWN  0          0          IDLE
RFS       LGWR     1          60         IDLE
RFS       UNKNOWN  0          0          IDLE
RFS       UNKNOWN  0          0          IDLE
Information on Redo Dataselect * from v$dataguard_stats;

Note: this indirectly shows how much redo data could be lost if the primary db crashes
Redo apply rateselect to_char(snapshot_time, 'dd-mon-rr hh24:mi:ss') snapshot_time,
       thread#, sequence#, applied_scn, apply_rate
       from v$standby_apply_snapshot;
Note: this command can only run when the database is open
Recovery operationsselect to_char(start_time, 'dd-mon-rr hh24:mi:ss') start_time,
       item, round(sofar/1024,2) "MB/Sec"
       from v$recovery_progress
       where (item='Active Apply Rate' or item='Average Apply Rate');
Logical Standby
schema that are not maintained by SQL applyselect owner from dba_logstdby_skip where statement_opt = 'INTERNAL SCHEMA' order by owner;
Note: system and sys schema are not replicated so don't go creating tables in these schemas, the above command should return about 17 schemas (Oracle 11g) that are replicated.
Check tables with unsupported data typesselect distinct owner, table_name from dba_logstdby_unsupported;
select owner, table_name from logstdby_unsupported_tables;
skip replication of tables## Syntax

dbms_logstdby.skip (
  stmt in varchar2,
  schema_name in varchar2 default null,
  object_name in varchar2 default null,
  proc_name in varchar2 default null,
  use_like in boolean default true,
  esc in char1 default null
);

## Examples
execute dbms_logstdby.skip(stmt => 'DML', schema_name => 'HR', object_name => 'EMPLOYEE');
execute dbms_logstdby.skip(stmt => 'SCHEMA_DDL', schema_name => 'HR', object_name => 'EMPLOYEE');

# skip all DML operations
execute dbms_logstdby.skip(stmt => 'DML', schema_name => 'HR', object_name => '%');
revoke a skipped tablestop SQL apply
execute dbms_logstdby.instantiate_table(schema_name => 'HR', table_name => 'EMPLOYEE', DBLINK => 'INSTANTIATE_TABLE_LINK');
execute dbms_logstdby.skip(stmt => 'DML', schema_name => 'HR', object_name => 'EMPLOYEE');
start SQL apply
Note: the dblink should point to the primary database, we have to stop SQL apply as the instantiate table procedure uses Oracle's data pump network interface to lock the source table to obtain the SCN at the primary database, it then releases the lock and gets a consistent snapshot of the table from the primary database, it remembers the SCN associated with the consistent snapshot.
display what tables are being skippedselect owner, name, use_like, esc from dba_logstdby_skip where statement_opt = 'DML';
setting the guard on a databasealter database guard standby;
Inside SQL Apply
List the above processesselect * from v$logstdby_process
Increase the LCR cache size# Set the cache size to 200MB
execute dbms_logstdby.apply_set('MAX_SGA', 200);
How much LCR cache is being usedselect used_memory_size from v$logmnr_session where session_id = (select value from v$logstdby_stats where name = 'SESSION_ID');
setting SQL apply mode for the applicationexecute dbms_logstdby.apply_set (name => 'PRESERVE_COMMIT_ORDER', value => FALSE);
Determine the number of DDL statements since the last restartselect name, value from v$logstdby_stats where name = 'DDL TXNS DELIVERED';

NAME                  VALUE
------------------------------------------------------------------------
DDL TXNS DELIVERED    510
displaying the barrierselect status_code as sc, status from v$logstdby_process where type = 'BUILDER';

sc     status
-------------------------------------------------------------------------------------
44604  BARRIER SYNCHRONIZATION ON DDL WITH XID 1.15.256 (WAITING ON 17 TRANSACTIONS)
Tuning SQL Apply
MAX_SERVERS# Set the MAX_SERVERS to 8 x the number of cores
execute dbms_logstdby.apply_set ('MAX_SERVERS', 64);
MAX_SGA# Set the MAX_SGA to 200MB
execute dbms_logstdby.apply_set ('MAX_SGA', 200);
_HASH_TABLE_SIZE# Set the Hash table size to 10 million
execute dbms_logstdby.apply_set ('_HASH_TABLE_SIZE', 10000000);
DDLdefer DDLs to off-peak hours
Preserve commit order# Set the PERSERVE_COMMIT_ORDER to false
execute dbms_logstdby.apply_set (name => 'PRESERVE_COMMIT_ORDER', value => FALSE);
lagging SQL Apply# apply lag: indicates how current the replicated data at the logical standby is
# transport lag: indicates how much redo data that has already been generated is missing at the logical #                standby in term of redo records

select name, value, unit from v$dataguard_stats;
SQL Apply component bottleneckselect name, value from v$logstdby_stats where name like 'TRASNACTIONS%';

Name                              Value
-----------------------------------------------------------------------------------------------------
TRANSACTIONS APPLIED     3764
TRANSACTIONS MINED       4985

The mined transactions should be about twice the applied transaction, if this decreases or staying at a low value you need to start looking at the mining engine.
Make sure all preparers are busyselect count(1) as idle_preparers from v$logstdby_process where type = 'PREPARER' and STATUS_CODE = 16166;
IDLE_PREPARER
----------------------------
0
Make sure the peak size is well below the amount allocatedselect used_memory_size from v$logstdby_session where session_id = (select value from v$logstdby_stats where name = 'LOGMINER SESSION ID');

USED_MEMORY_SIZE
----------------------------
32522244
verify that the preparer does not have enough work for the applier processesselect (available_txn - pinned_txn) as pipleline_depth from v$logstdby_session where session_id (select value from v$lostdby_stats where name = 'LOGMINER SESSION ID');
PIPELINE_DEPTH
----------------------------
8

select count(*) as applier_count from v$logstdby_process where type = 'APPLIER';

APPLIER_COUNT
----------------------------
20
Setting max_servers and preparersexecute dbms_logstdby.apply_set('MAX_SERVERS', 36);
execute dbms_logstdby.apply_set('PREPARE_SERVERS', 3);
display the pageout activity## Run this first
select name, value from v$logstdby_stats where name line '%PAGE%' or name like '%UPTIME' or name like '%IDLE%';

## Run the second time about 10 mins later
select name, value from v$logstdby_stats where name line '%PAGE%' or name like '%UPTIME' or name like '%IDLE%';

Now subtract one from the other and work out the percentage rate, if pageout has increase above 5% then increase the MAX_SERVERS
unassigned large transactions## By default SQL apply should be one-sixth of the number of applier processes

select (available_txn - pinned_txn) as pipleline_depth from v$logstdby_session where session_id (select value from v$lostdby_stats where name = 'LOGMINER SESSION ID');
PIPELINE_DEPTH
----------------------------
256
select count(1) as idle_applier from v$logstdby_process where type = 'APPLIER' and statuscode = 16166;

IDLE_APPLIER
---------------------------
12

## Now look for the unassigned large transactions

select value from v$logstdby_stats where name = 'LARGE TXNS WAITING TO BE ASSIGNED';
VALUE
---------------------------
12
Monitoring
archive gap logs# Use the thread# when using RAC an detect missing sequences

select thread#, low_sequence#, high_sequence# from v$archive_gap;
delays in redo transportselect max(sequence#), thread# from v$archived_log group by thread#;
## you can use the dg_archivelog_monitor.sh script, which accepts three parameters, primary, physical
## and the archive log threshold (# of archive logs)

dg_archivelog_monitor.sh <primary> <standby> <threshold>
Identify the missing logs on the primary## On the primary run the below
select L.thread#, L.sequence#
from
  (select thread#, sequence# from v$archived_log where dest_id=1) L
    where L.sequence# not in
       (select sequence# from v$archived_log where dest_id=2 and thread# = L.thread#);
apply rate and active monitoringselect to_char(start_time, 'DD-MON-RR HH24:MI:SS') start_time, item , sofar from v$recovery_progress
  where item in ('Active Apply Rate', 'Average Apply Rate', 'Redo Applied');

Note: the redo applied is measured in megabytes, while the average apply rate and the active apply rate is measured in kilobytes.
transport and apply lagcol name for a13
col value for a13
col unit for a30
set lines 132

select name, value, unit, time_computed from v$dataguard_stats where name in ('transport lag', 'apply lag');

## use the dg_time_lag.ksh script
dg_time_lag.ksh
Viewing the status of the managed recovery processcol client_pid for a10;

select pid, process, status, client_process, client_pid, thread#, sequence#, block#, blocks from v$managed_standby;
Switchover, Failover and FSFO
Complete Switchover## Start the switcover on the original primary
alter database commit to switchover to standby;

## On the new primary complete the switchover
alter database commit to switchover to primary;

## Now open the database on the new primary
alter database open;
Complete Failover## Start the failover
alter database commit to switchover to primary;
# Change the level of protection that you require
sql> alter database set standby to maximize performance;
sql> alter database set standby to maximize availability;
sql> alter database set standby to maximize protection;
Broker switchoverDGMGRL> switchover to prod1lr

Action
Step
Commands
check redo has been received
1
## check the syn status, it should say yes (run on the standby)
sql> select db_unique_name, protection_mode, synchronization_status, synchronized from v$archive_dest_status
## if it says NO then lets make further checks (run on the standby)
sql> select client_process, process, sequence#, status from v$managed_standby;

## now check on the primary we should be one in front (run on the primary)
sql> select thread#, sequence#, status from v$log;
Note: if using a RAC environment make sure you check each instance 
check that redo has been applied (physical)
2
## check that MRP (applying_log) matches the RFS process, if the MRP line is missing then you need to
## start the apply process, you also may see the status of wait_for_gap so wait until the gap have been
## resolved first

sql> select client_process, process, sequence#, status from v$managed_standby;
check that redo has been applied (logical)
3
## if you are using a logical standby then you need to check the following to confirm the redo has been
## applied

sql> select applied_scn, latest_scn, mining_scn from v$logstdby_progress;

## if the mining scn is behind you may have a gap check this by using the following

sql> select status from v$logstdby_process where type = 'READER';
show any running jobs or backups
4
sql> select process, operation, r.status, mbytes_processed pct, s.status from v$rman_status r, v$session s where r.sid=s.sid;
increase logging level (if required)
5
sql> alter system set log_archive_trace=8129;

## to turn it off again
sql> alter system set log_archive_trace=0;
check for active sessions
6
## Display the active sessions
sql> select program, type from v$session where type='USER';
check the switchover status
7
## make sure the status is "to standby", if you get "sessions active", then stop those sessions (see above command to identify
## sessions)
sql> select switchover_status from v$database;
tail the log alert log file
8
tail alert??.log
switchover (primary)
9
## on the primary, after this command completes you will have two physical standbys

sql> alter database commit to switchover to physical standby with session shutdown;

Note: at this point if you want to rollback this switchover see my troubleshooting section to get ot back to normal
check the switchover status
10
sql> select switchover_status from v$database;
complete the switchover (physical)
11
sql> alter database commit to switchover to primary with session shutdown;
open the new primary
12
sql> alter database open;
finish off the old primary
13
sql> shutdown immediate;
sql> startup mount;
sql> alter database recover managed standby database using current logfile disconnect;

Action
Step
Commands
check redo has been received
1
## check the syn status, it should say yes (run on the standby)
sql> select db_unique_name, protection_mode, synchronization_status, synchronized from v$archive_dest_status
## if it says NO then lets make further checks (run on the standby)
sql> select client_process, process, sequence#, status from v$managed_standby;

## now check on the primary we should be one in front (run on the primary)
sql> select thread#, sequence#, status from v$log;
Note: if using a RAC environment make sure you check each instance 
check that redo has been applied (physical)
2
## check that MRP (applying_log) matches the RFS process, if the MRP line is missing then you need to
## start the apply process, you also may see the status of wait_for_gap so wait until the gap have been
## resolved first

sql> select client_process, process, sequence#, status from v$managed_standby;
check that redo has been applied (logical)
3
## if you are using a logical standby then you need to check the following to confirm the redo has been
## applied

sql> select applied_scn, latest_scn, mining_scn from v$logstdby_progress;

## if the mining scn is behind you may have a gap check this by using the following

sql> select status from v$logstdby_process where type = 'READER';
show any running jobs or backups
4
sql> select process, operation, r.status, mbytes_processed pct, s.status from v$rman_status r, v$session s where r.sid=s.sid;
increase logging level (if required)
5
sql> alter system set log_archive_trace=8129;

## to turn it off again
sql> alter system set log_archive_trace=0;
check for active sessions
6
## Display the active sessions
sql> select program, type from v$session where type='USER';
check the switchover status
7
## make sure the status is "to standby", if you get "sessions active", then stop those sessions (see above command to identify
## sessions)
sql> select switchover_status from v$database;
tail the log alert log file
8
tail alert??.log
Prepare the primary standby
9
sql> alter database prepare to switchover to logical standby;

## confirm that the prepare has started to happen, you should now see "preparing switchover"
sql> select switchover_status from v$database;
Prepare the logical standby
10
sql> alter database prepare to switchover to primary;

## confirm that the prepare has started to happen, you should see "preparing dictionary"
sql> select switchover_status from v$database;

## wait a while until the dictionary is built and sent and you should see "preparing switchover"
sql> select switchover_status from v$database;
Check primary database state
11
## you should now see its in the state of "to logical standby"

sql> select switchover_status from v$database;
the last chance to CANCEL the switchover (no going back after this)
12
## On the primary
sql> alter database prepare to switchover cancel;

## on the logical
sql> alter database prepare to switchover cancel;
switchover the primary to a logical standby
13
sql> alter database commit to switchover to logical standby;
switchover the logical standby to a primary
14
## check that its ready to become the primary, you should see "to primary"

sql> select switchover_status from v$database

## Complete the switchover
sql> alter database commit to standby to primary;
start the apply process
15
sql> alter database start logical standby apply immediate;

Action
Step
Commands
Check redo applied
1
## This will tell you the lag time

select name, value, time_computed from v$dataguard_stats where name like '%lag%';

## You can also use the SCN number

select thread#, sequence#, last_change#, last_time from v$standby_log; 
the failover process (physical standby)
2
## Start by telling the apply process that this standby is going to be the new primary, and to apply all
## the redo that it has

alter database recover managed standby database cancel;
alter database recover managed standby database finish;
## At this point the protection mode is lowered

select protection_mode from v$database;

## Now issue the switchover command and then open the database

alter database commit to switchover to primary with session shutdown;
alter database open;

## Startup the other RAC instances if using RAC
## You can then raise the protection mode (if desired)

set standby database to maximum protection;
the failover process (logical standby)
2
alter database activate logical standby database finish apply;

Action
Step
Commands
bring back the old primary (physical standby)
1
## Since redo is applied by SCN we need he failover SCN from the new primary

select to_char(standby_became_primary_scn) failover_scn from v$database;

FAILOVER_SCN
-----------------------------------------------
7658841

## Now flashback the old primary to this SCN and start in mount mode
startup mount;
flashback database to scn 7658841;
alter database convert to physical standby;
shutdown immediate;
startup mount;

## hopefully the old primary will start to resolve any gap issues at the next log switch, which means we can start the MRP
## process to get this standby going to catchup as fast as possible

alter database recover managed standby database using current logfile disconnect;

## eventually the missing redos will be sent to the standby and applied, bring us back to synchronization again.
bring back the old primary (logical standby)
2
## again we need to obtained the SCN
select merge_change# as flashback_scn, processed_change# as recovery_scn from dba_logstdby_history where stream_sequence# = (select max(stream_sequence#)-1 from dba_logstdby_history);

flashback_scn      recovery_scn
---------------------------------------------------------
         7658941              7659568

## Now flashback the old primary to this SCN and start in mount mode
startup mount;
flashback database to scn 7658841;
alter database convert to physical standby;
shutdown immediate;
startup mount;

## Now we need to hand feed the archive logs from the primary to the standby (old primary) into the MRP
## process, so lets get those logs (run on the primary)

select file_name from dba_logstdby_log where first_changed# <= recovery_scn and next_change# > flashback_scn;

## Now you will hopefully have a short list of the files you need, now you need to register them with
## the standby database (old primary)

alter database register logfile '<files from above list>';

## Now you can recover up to the SCN but not including the one you specify
recover managed standby database until change 7659568;

## Now the standby database becomes a logical standby as up to this point it has been a physical one.
alter database active standby database;
## Lastly you need tell your new logical standby to ask the primary for a new copy of the dictionary and
## all the redo in between. The SQL Apply will connect to the new primary using the database link and
## retrieve the LogMiner dictionary, once the dictionary has been built, SQL Apply will apply all the
## redo sent from the new primary and get itself synchronized

create public database link reinstatelogical connect to system identified by password using 'service_name_of_new_primary_database';

alter database start logical standby apply new primary reinstatelogical;
Use the Broker to bring back the old Primary
Use the broker to do it all for you
n/a
DGMGRL> failover to prod1dr;
DGMGRL> reinstate database prod1;

Fast Start Failover (FSFO)
Monitor a specific condition via the BrokerDGMGRL> enable fast_start failover condition "Corrupted Controlfile";
DGMGRL> enable fast_start failover condition "Datafile Offline";
Display conditions that are be monitoredDGMGRL> show fast_start failover;
Select the standby to become the primaryDGMGRL> edit database prod1 set property FastStartFailoverTarget = 'prod1dr';
DGMGRL> edit database prod1dr set property FastStartFailoverTarget = 'prod1';
change thresholdDGMGRL> edit configuration set property FastStartFailoverTargetThreshold = 45;
lag limitDGMGRL> edit configuration set property FastStartFailoverLagLimit = 60;
abort primary if in a hung stateDGMGRL>edit configuration set property FastStartFailoverPmyShutdown = true;
reinstate primary after a failoverDGMGRL>edit configuration set property FastStartFailoverAutoReinstate = true;
Enable FSFODGMGRL> enable fast_start failover;

## Display the configuration

DGMGRL> show fast_start failover;

No comments: