Log Gap Detection and Resolution

Log Gap Detection and Resolution
by Arif Malik, Sr. Database Administrator

Contents

Introduction

Archive gaps are simply a range of archived redo logs that were created at a time when the standby database was not available to receive them as they were being generated by the primary database. These archive gaps occur most often during network unavailability between the primary and standby database. During network outages like this, the standby database fails to receive any redo data from the primary database. When the network failure has been resolved, automatic transmission of redo data from the primary to the standby database is resumed. The missing archived redo logs signify the gap.

Primary Database Heartbeat

In general, it is the primary database that first discovers there is a gap. Every one minute, the primary database polls all of its standby databases to determine if there is a gap in the sequence of archived redo logs. This polling between the primary and standby database is often referred to as the heartbeat. This heartbeat is performed by the primary database serially. The DBA can always determine if there is a gap in the archived redo logs by querying the V$ARCHIVE_GAP view (from the physical standby database) as described in the next section.

Determining if an Archive Gap Exists

As mentioned in the previous section, the DBA can determine if there is a there is a gap in the archived redo logs by querying the V$ARCHIVE_GAP view from the physical standby database. Let's walk through a quick example of how to manually check and resolve a gap in the archived redo logs.
We start with the following assumptions:
  1. The Data Guard configuration consists of a physical primary database and one physical standby database.
  2. The Data Guard configuration is set for Maximum Performance.
  3. The name of the physical primary database is TESTDB.IDEVELOPMENT.INFO and resides on the node vmlinux3.idevelopment.info.
  4. The name of the physical standby database is TESTDB.IDEVELOPMENT.INFO and resides on the node vmlinux4.idevelopment.info.
  5. The primary database is configured to send redo to the physical standby database by defining: log_archive_dest_2='service=TESTDB_VMLINUX4 optional reopen=15'
  6. Neither FAL_CLIENT nor FAL_SERVER is defined on the physical standby database.
  7. The standby database is in managed recovery mode and is current (all archived redo logs have been applied) with the primary database. The current log sequence on the primary database is 16.
With the above assumptions, perform a query of the V$ARCHIVE_GAP view from the physical standby database:
SQL> select * from v$archive_gap;

no rows selected
From the output of the above query, we can conclude that there are no gaps in the archived redo logs on the physical standby database!
Now, let's simulate a network failure by shutting down the Oracle TNS Listener process and physical standby database on vmlinux4.idevelopment.info:
$ lsnrctl stop

$ sqlplus "/ as sysdba" <
> shutdown abort
> EOF

SQL> ORACLE instance shut down.
At this point, the physical standby database is completely down and unavailable to the primary database. One of the first things to notice is a series of RFS errors in the alert.log for the primary database.
Errors in alert.log from the Primary Database
...
Sat Jul 22 23:19:04 2006
ARC1: Evaluating archive   log 3 thread 1 sequence 16
ARC1: Beginning to archive log 3 thread 1 sequence 16
Creating archive destination LOG_ARCHIVE_DEST_2: 'TESTDB_VMLINUX4'
Creating archive destination LOG_ARCHIVE_DEST_1: '/u02/oraarchive/TESTDB/arch_t1_s16.dbf'
ARC1: Completed archiving  log 3 thread 1 sequence 16
Sat Jul 22 23:21:25 2006
Errors in file /u01/app/oracle/admin/TESTDB/bdump/testdb_arc1_3495.trc:
ORA-12541: TNS:no listener
Sat Jul 22 23:22:25 2006
Errors in file /u01/app/oracle/admin/TESTDB/bdump/testdb_arc1_3495.trc:
ORA-12541: TNS:no listener
Sat Jul 22 23:23:25 2006
Errors in file /u01/app/oracle/admin/TESTDB/bdump/testdb_arc1_3495.trc:
ORA-12541: TNS:no listener
...
We can then further investigate the trace file (testdb_arc1_3495.trc) written to the alert.log above.
Errors in testdb_arc1_3495.trc from the Primary Database
/u01/app/oracle/admin/TESTDB/bdump/testdb_arc1_3495.trc

Oracle9i Enterprise Edition Release 9.2.0.7.0 - Production
With the Partitioning, Oracle Label Security, OLAP and Oracle Data Mining options
JServer Release 9.2.0.7.0 - Production
ORACLE_HOME = /u01/app/oracle/product/9.2.0
System name:    Linux
Node name:      vmlinux3
Release:        2.6.9-22.EL
Version:        #1 Sat Oct 8 17:48:27 CDT 2005
Machine:        i686
Instance name: TESTDB
Redo thread mounted by this instance: 0 
Oracle process number: 11
Unix process pid: 3495, image: oracle@vmlinux3 (ARC1)

*** SESSION ID:(10.1) 2006-07-22 23:18:25.890
Destination LOG_ARCHIVE_DEST_2 is in CLUSTER CONSISTENT mode
Destination LOG_ARCHIVE_DEST_2 is in MAXIMUM PERFORMANCE mode
*** 2006-07-22 23:20:25.902
RFS network connection lost at host 'TESTDB_VMLINUX4'
Fail to ping standby 'TESTDB_VMLINUX4', error = 3113
Error 3113 when pinging standby TESTDB_VMLINUX4.
*** 2006-07-22 23:20:25.903
kcrrfail: dest:2 err:3113 force:0
*** 2006-07-22 23:21:25.908
Error 12541 connecting to destination LOG_ARCHIVE_DEST_2 standby host 'TESTDB_VMLINUX4'
Error 12541 attaching to destination LOG_ARCHIVE_DEST_2 standby host 'TESTDB_VMLINUX4'
Heartbeat failed to connect to standby 'TESTDB_VMLINUX4'. Error is 12541.
*** 2006-07-22 23:21:25.909
kcrrfail: dest:2 err:12541 force:0
ORA-12541: TNS:no listener
...
Now let's perform a few log switches from the primary database:
SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

System altered.

SQL> alter system switch logfile;

System altered.
Next, bring up the Oracle TNS Listener process and mount the physical standby database:
$ lsnrctl start

$ sqlplus "/ as sysdba"

SQL> startup nomount

SQL> alter database mount standby database;

SQL> alter database recover managed standby database disconnect from session;
After a minute has passed, the primary database (the ARCH1 process) will wake up and attempt to ping the physical standby database. Once it is determined that the physical standby database is mounted (and the TNS Listener process is up), it will start to push any archived redo logs to the standby database. This can be see in the alert.log from the primary database:
RFS Process Recovery in alert.log from the Primary Database
...
Sat Jul 22 23:28:26 2006
ARC1: Begin FAL archive (thread 1 sequence 17 destination TESTDB_VMLINUX4)
Creating archive destination LOG_ARCHIVE_DEST_2: 'TESTDB_VMLINUX4'
ARC1: Complete FAL archive (thread 1 sequence 17 destination TESTDB_VMLINUX4)
ARC1: Begin FAL archive (thread 1 sequence 18 destination TESTDB_VMLINUX4)
Creating archive destination LOG_ARCHIVE_DEST_2: 'TESTDB_VMLINUX4'
ARC1: Complete FAL archive (thread 1 sequence 18 destination TESTDB_VMLINUX4)
ARC1: Begin FAL archive (thread 1 sequence 19 destination TESTDB_VMLINUX4)
Creating archive destination LOG_ARCHIVE_DEST_2: 'TESTDB_VMLINUX4'
ARC1: Complete FAL archive (thread 1 sequence 19 destination TESTDB_VMLINUX4)
...
Finally, let's query the V$ARCHIVE_GAP view from the physical standby database to determine if there are any gaps:
SQL> select * from v$archive_gap;

no rows selected
As we can see, there are no gaps in the archive redo logs being reported!
 
When the DBA queries the V$ARCHIVE_GAP view and has a record returned, this indicates a gap in the archived redo logs as illustrated below and may require manual intervention by the DBA:
SQL> select * from v$archive_gap;

THREAD#  LOW_SEQUENCE#  HIGH_SEQUENCE#
-------- -------------- --------------
       1            24              28
From the output above, the physical standby database is currently missing logs from sequence 24 to sequence 28 for thread 1. Note that this view only returns the next gap that is currently blocking managed recovery from continuing. After resolving the identified gap and starting managed recovery, the DBA should query the V$ARCHIVE_GAP view again on the physical standby database to determine the next (if any) gap sequence. This process should be repeated until there are no more gaps.
After identifying a gap (as shown above), the DBA will need to query the primary database to locate the archived redo logs on the primary database. The following query assumes the local archive destination on the primary database is LOG_ARCHIVE_DEST_1:
SELECT name
FROM v$archived_log
WHERE thread# = 1
  AND dest_id = 1
  AND sequence# BETWEEN 24 and 28;

NAME
--------------------------------------
/u02/oraarchive/TESTDB/arch_t1_s24.dbf
/u02/oraarchive/TESTDB/arch_t1_s25.dbf
/u02/oraarchive/TESTDB/arch_t1_s26.dbf
/u02/oraarchive/TESTDB/arch_t1_s27.dbf
/u02/oraarchive/TESTDB/arch_t1_s28.dbf
Copy the above redo log files to the physical standby database and register them using theALTER DATABASE REGISTER LOGFILE ... SQL statement on the physical standby database. For example:
SQL> ALTER DATABASE REGISTER LOGFILE '/u02/oraarchive/TESTDB/arch_t1_s24.dbf';
SQL> ALTER DATABASE REGISTER LOGFILE '/u02/oraarchive/TESTDB/arch_t1_s25.dbf';
SQL> ALTER DATABASE REGISTER LOGFILE '/u02/oraarchive/TESTDB/arch_t1_s26.dbf';
SQL> ALTER DATABASE REGISTER LOGFILE '/u02/oraarchive/TESTDB/arch_t1_s27.dbf';
SQL> ALTER DATABASE REGISTER LOGFILE '/u02/oraarchive/TESTDB/arch_t1_s28.dbf';
After the redo logs have been registered on the physical standby database, the DBA can restart the managed recovery operations. For example, to put the physical standby database into automatic recovery managed mode:
SQL> alter database recover managed standby database disconnect from session;

Using FAL_CLIENT and FAL_SERVER

As mentioned throughout this article, we can see that Data Guard provides gap detection and resolution automatically by functionality provided in the primary database. The primary database polls its standby databases using a heartbeat every one minute. No extra configuration settings are required and no intervention is needed by the DBA to detect and resolve these gaps. In addition to this built-in functionality; however, it is also possible to configurelog apply services to take part in automatically resolving archive gaps as they occur on the physical standby database side.
FAL_CLIENT and FAL_SERVER are initialization parameters used to configure log gap detection and resolution at the standby database side of a physical database configuration. This functionality is provided by log apply servicesand is used by the physical standby database to manage the detection and resolution of archived redo logs.
 
Note that FAL_CLIENT and FAL_SERVER only need to be defined in the initialization parameter file for the standby database(s). It is possible; however, to define these two parameters in the initialization parameter for the primary database server to ease the amount of work that would need to be performed if the primary database were required to transition its role.
  • FAL_CLIENT
    Specifies the TNS network services name for the standby database (which is sent to the FAL server process by log apply services) that should be used by the FAL server process to connect to the standby database. The syntax would be:
    FAL_CLIENT=
  • FAL_SERVER
    Specifies the TNS network service name that the standby database should use to connect to the FAL server process. The syntax would be:
    FAL_SERVER=
    Note that the value provided for FAL_SERVER is not limited to defining only one FAL server; it can define multiple FAL servers (possible a FAL server that is running on another standby database) that are separated by commas as in the following example:
    FAL_SERVER='primary_db','standby_db1','standby_db2','standby_db3'

FAL server is a background process that is generally located on the primary database server (however it can also be found on another standby database). FAL server is responsible for servicing incoming requests from the FAL client.

No comments: