[INS-20802] Creating Container Database for Oracle Grid Infrastructure Management Repository failed

July 24, 2017, 1:57 am

≫ Next: Using deprecated ASM parameter might prevent your Cluster to start

≪ Previous: Perl related issues when running ./rootcrs.pl to deconfigure the node

After dealing with root.sh script to configure 3 node clusterware environment I succeeded but ended up with the following error when post configuration OUI returned the following error and was unable to create container database for Oracle Grid Infrastructure Management Repository:

screenshot

There was no information in the mentioned log file, however the trace of database creation job was enabled and I was able to find a long trace file under the log directory, where I saw the following message:

set newname for datafile 1 to new;

set newname for datafile 3 to new;

set newname for datafile 4 to new;

restore datafile 1;

restore datafile 3;

restore datafile 4; }
[Thread-159] [ 2017-07-24 04:24:07.299 EDT ] [RMANEngine.executeImpl:1321] Notify reader to start reading
[Thread-177] [ 2017-07-24 04:24:07.300 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=echo set off
[Thread-177] [ 2017-07-24 04:24:07.300 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=
[Thread-177] [ 2017-07-24 04:24:07.305 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=RMAN> 2> 3> 4> 5> 6> 7> 8> 9> 10> 11> 12> 13>
[Thread-177] [ 2017-07-24 04:24:07.305 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=executing command: SET NEWNAME
[Thread-177] [ 2017-07-24 04:24:07.546 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=
[Thread-177] [ 2017-07-24 04:24:07.547 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=executing command: SET NEWNAME
[Thread-177] [ 2017-07-24 04:24:07.562 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=
[Thread-177] [ 2017-07-24 04:24:07.563 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=executing command: SET NEWNAME
[Thread-177] [ 2017-07-24 04:24:07.578 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=
[Thread-177] [ 2017-07-24 04:24:07.585 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=Starting restore at 24-JUL-17
[Thread-177] [ 2017-07-24 04:24:07.792 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=allocated channel: ORA_DISK_1
[Thread-177] [ 2017-07-24 04:24:07.797 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=channel ORA_DISK_1: SID=18 device type=DISK
[Thread-177] [ 2017-07-24 04:24:08.051 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=
[Thread-177] [ 2017-07-24 04:24:08.383 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=channel ORA_DISK_1: starting datafile backup set restore
[Thread-177] [ 2017-07-24 04:24:08.385 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=channel ORA_DISK_1: specifying datafile(s) to restore from backup set
[Thread-177] [ 2017-07-24 04:24:08.386 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=channel ORA_DISK_1: restoring datafile 00001 to +DATA
[Thread-177] [ 2017-07-24 04:24:08.387 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=channel ORA_DISK_1: reading from backup piece /u01/app/12.2.0.1/grid/assistants/dbca/templates/MGMTSeed_Database.dfb
[Thread-177] [ 2017-07-24 04:24:23.487 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=RMAN-00571: ===========================================================
[Thread-177] [ 2017-07-24 04:24:23.487 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
[Thread-177] [ 2017-07-24 04:24:23.487 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=RMAN-00571: ===========================================================
[Thread-177] [ 2017-07-24 04:24:23.487 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=RMAN-03002: failure of restore command at 07/24/2017 04:24:23
[Thread-177] [ 2017-07-24 04:24:23.487 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=ORA-19870: error while restoring backup piece /u01/app/12.2.0.1/grid/assistants/dbca/templates/MGMTSeed_Database.dfb
[Thread-177] [ 2017-07-24 04:24:23.487 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=ORA-19872: Unexpected end of file at block 4800 while decompressing backup piece /u01/app/12.2.0.1/grid/assistants/dbca/templates/MGMTSeed_Database.dfb
[Thread-177] [ 2017-07-24 04:24:23.495 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=
[Thread-177] [ 2017-07-24 04:24:23.496 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=RMAN>
[Thread-177] [ 2017-07-24 04:24:23.496 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=echo set on
[Thread-177] [ 2017-07-24 04:24:23.504 EDT ] [RMANEngine.readSqlOutput:988] Log RMAN Output=set echo off;
[Thread-177] [ 2017-07-24 04:24:23.504 EDT ] [RMANEngine.readSqlOutput:1031] hasError is true
[Thread-177] [ 2017-07-24 04:24:23.504 EDT ] [RMANEngine.readSqlOutput:1037] ERROR TRACE DETECTED

So in ordreate a container database, the installer was trying to restore the database and was unable to do it and hit the following error:
ORA-19872: Unexpected end of file at block 4800 while decompressing backup piece /u01/app/12.2.0.1/grid/assistants/dbca/templates/MGMTSeed_Database.dfb

So the problem was with the backup piece of the MGMT database. Permission were ok, so I compared the size of the restored backup piece with the one in the downloaded zip file:

[root@oratest01 ~]# cd /u01/app/12.2.0.1/grid/assistants/dbca/templates/ [root@oratest01 templates]# ll total 131452 -rw-r--r-- 1 oracle oinstall 5734 Jan 26 10:48 DomainServicesCluster_GIMR.dbc -rw-r----- 1 oracle oinstall 18628608 Jan 26 10:46 MGMTSeed_Database.ctl -rw-r----- 1 oracle oinstall 5177 Jan 26 10:48 MGMTSeed_Database.dbc -rw-r----- 1 oracle oinstall 39321600 Jan 26 10:46 MGMTSeed_Database.dfb -rw-r----- 1 oracle oinstall 10578 Jun 10 2016 New_Database.dbt -rw-r----- 1 oracle oinstall 76619776 Jan 26 10:11 pdbseed.dfb -rw-r----- 1 oracle oinstall 6579 Jan 26 10:11 pdbseed.xml

It was 39M in the extracted folder, and 104Mb in the zip file itself. Screenshot2

Somehow it was not correctly unzipped. I moved all files to the backup folder, uploaded backup pieces from the downloaded installation zip file to the same folder in the first node and restarted the configuration – and it succeeded.

Screenshot3

Good Luck!

↧

Using deprecated ASM parameter might prevent your Cluster to start

October 20, 2017, 4:54 am

≫ Next: How to pass Oracle Database 12c: RAC and Grid Infrastructure Administration exam – 1Z0-068 and become Oracle Certified Expert

≪ Previous: [INS-20802] Creating Container Database for Oracle Grid Infrastructure Management Repository failed

Few days ago, I was testing some ASM parameters in my 3 nodes 12.2 Clusterware environment and used ASM_PREFERRED_READ_FAILURE_GROUPS parameter to see how I can force ASM to read specific failure group. Testings were successfull but I didn’t know that this parameter is deprecated in 12.2, and beside that, I didn’t imagine that it might cause me a downtime and prevent Clusterware to start.

Here’s the scenario that you can try in your test environment. First of all, I set this parameter to the failure group and then resetted it back:

SQL> alter system set ASM_PREFERRED_READ_FAILURE_GROUPS=”;

System altered.

SQL>

Then I made some hardware changes to my nodes and rebooted them. After nodes are rebooted, I checked the status of the clusterware, and it was down at all nodes.

[oracle@oratest01 ~]$ crsctl check crs

CRS-4638: Oracle High Availability Services is online

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4529: Cluster Synchronization Services is online

CRS-4534: Cannot communicate with Event Manager

[oracle@oratest01 ~]$ crsctl check cluster -all

**************************************************************

oratest01:

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4529: Cluster Synchronization Services is online

CRS-4534: Cannot communicate with Event Manager

**************************************************************

oratest02:

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4530: Communications failure contacting Cluster Synchronization Services daemon

CRS-4534: Cannot communicate with Event Manager

**************************************************************

oratest03:

CRS-4535: Cannot communicate with Cluster Ready Services

CRS-4530: Communications failure contacting Cluster Synchronization Services daemon

CRS-4534: Cannot communicate with Event Manager

**************************************************************

Next, I check if ohasd and crsd background processes are up

[root@oratest01 oracle]# ps -ef|grep init.ohasd|grep -v grep

root 1252 1 0 02:49 ? 00:00:00 /bin/sh /etc/init.d/init.ohasd run >/dev/null 2>&1 </dev/null

[root@oratest01 oracle]#

[root@oratest01 oracle]# ps -ef|grep crsd|grep -v grep

[root@oratest01 oracle]#

OHAS was up and running, but CRSD not. ASM instance should be up in order to bring the crsd, so I checked if ASM instance is up, but it was also down:

[oracle@oratest01 ~]$ ps -ef | grep smon

oracle 5473 3299 0 02:50 pts/0 00:00:00 grep –color=auto smon

[oracle@oratest01 ~]$

Next, I decided to check log files. Logged in to adrci to find the centralized Clusterware log folder:

[oracle@oratest01 ~]$ adrci

ADRCI: Release 12.2.0.1.0 – Production on Fri Oct 20 02:51:59 2017

ADR base = “/u01/app/oracle”

adrci> show home

ADR Homes:

diag/rdbms/_mgmtdb/-MGMTDB

diag/rdbms/proddb/proddb1

diag/asm/user_root/host_4288267646_107

diag/asm/user_oracle/host_4288267646_107

diag/asm/+asm/+ASM1

diag/crs/oratest01/crs

diag/clients/user_root/host_4288267646_107

diag/clients/user_oracle/host_4288267646_107

diag/tnslsnr/oratest01/asmnet1lsnr_asm

diag/tnslsnr/oratest01/listener_scan1

diag/tnslsnr/oratest01/listener_scan2

diag/tnslsnr/oratest01/listener_scan3

diag/tnslsnr/oratest01/listener

diag/tnslsnr/oratest01/mgmtlsnr

diag/asmtool/user_root/host_4288267646_107

diag/asmtool/user_oracle/host_4288267646_107

diag/apx/+apx/+APX1

diag/afdboot/user_root/host_4288267646_107

adrci> exit

[oracle@oratest01 ~]$ cd /u01/app/oracle/diag/crs/oratest01/crs

[oracle@oratest01 crs]$cd trace

[oracle@oratest01 trace]$ tail -f evmd.trc

2017-10-20 02:54:26.533 : CRSOCR:2840602368: OCR context init failure. Error: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]

2017-10-20 02:54:27.552 : CRSOCR:2840602368: OCR context init failure. Error: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]

2017-10-20 02:54:28.574 : CRSOCR:2840602368: OCR context init failure. Error: PROC-32: Cluster Ready Services on the local node is not running Messaging error [gipcretConnectionRefused] [29]

From evmd.trc file it can bees that OCR was not initialized. Then I check alert.log file:

[oracle@oratest01 trace]$ tail -f alert.log

2017-10-20 02:49:49.613 [OCSSD(3825)]CRS-1605: CSSD voting file is online: AFD:DATA1; details in /u01/app/oracle/diag/crs/oratest01/crs/trace/ocssd.trc.

2017-10-20 02:49:49.627 [OCSSD(3825)]CRS-1672: The number of voting files currently available 1 has fallen to the minimum number of voting files required 1.

2017-10-20 02:49:58.812 [OCSSD(3825)]CRS-1601: CSSD Reconfiguration complete. Active nodes are oratest01 .

2017-10-20 02:50:01.154 [OCTSSD(5351)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 5351

2017-10-20 02:50:01.161 [OCSSD(3825)]CRS-1720: Cluster Synchronization Services daemon (CSSD) is ready for operation.

2017-10-20 02:50:02.099 [OCTSSD(5351)]CRS-2403: The Cluster Time Synchronization Service on host oratest01 is in observer mode.

2017-10-20 02:50:03.233 [OCTSSD(5351)]CRS-2407: The new Cluster Time Synchronization Service reference node is host oratest01.

2017-10-20 02:50:03.235 [OCTSSD(5351)]CRS-2401: The Cluster Time Synchronization Service started on host oratest01.

2017-10-20 02:50:10.454 [ORAAGENT(3362)]CRS-5011: Check of resource “ora.asm” failed: details at “(:CLSN00006:)” in “/u01/app/oracle/diag/crs/oratest01/crs/trace/ohasd_oraagent_oracle.trc”

2017-10-20 02:50:18.692 [ORAROOTAGENT(3198)]CRS-5019: All OCR locations are on ASM disk groups [DATA], and none of these disk groups are mounted. Details are at “(:CLSN00140:)” in “/u01/app/oracle/diag/crs/oratest01/crs/trace/ohasd_orarootagent_root.trc”.

CRS didn’t started as the ASM is not up and running. To checking why ASM wasn’t started upon the server book sounded good starting point for the investigation, so logged in and tried to start ASM instance:

[oracle@oratest01 ~]$ sqlplus / as sysasm

SQL*Plus: Release 12.2.0.1.0 Production on Fri Oct 20 02:55:12 2017

Connected to an idle instance.

SQL> startup

ORA-01078: failure in processing system parameters

SQL> startup

ORA-01078: failure in processing system parameters

SQL> startup

ORA-01078: failure in processing system parameters

SQL>

I checked ASM alert.log file, but it didn’t provide enough information why ASM didn’t start:

NOTE: ASM client -MGMTDB:_mgmtdb:clouddb disconnected unexpectedly.
NOTE: check client alert log.
NOTE: Trace records dumped in trace file /u01/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_ufg_20658_-MGMTDB__mgmtdb.trc
NOTE: cleaned up ASM client -MGMTDB:_mgmtdb:clouddb connection state (reg:2993645709)
2017-10-20T02:47:20.588256-04:00
NOTE: client +APX1:+APX:clouddb deregistered
2017-10-20T02:47:21.201319-04:00
NOTE: detected orphaned client id 0x10004.
2017-10-20T02:48:49.613505-04:00
WARNING: Write Failed, will retry. group:2 disk:0 AU:9067 offset:151552 size:4096
path:AFD:DATA1
incarnation:0xf0a9ba5e synchronous result:’I/O error’
subsys:/opt/oracle/extapi/64/asm/orcl/1/libafd12.so krq:0x7f8fced52240 bufp:0x7f8fc9262000 osderr1:0xfffffff8 osderr2:0xc28
IO elapsed time: 0 usec Time waited on I/O: 0 usec
ERROR: unrecoverable error ORA-15311 raised in ASM I/O path; terminating process 20200

The problem seemed to be in the parameter file of ASM, so I decided to start it with default parameters and then investigate. For this, I opened searched for the string “parameters” in the ASM alert.log file to get list of parameters and paramter file location:

[oracle@oratest01 trace]$ more +ASM1_alert.log

Using parameter settings in server-side spfile +DATA/clouddb/ASMPARAMETERFILE/registry.253.949654249

System parameters with non-default values:

large_pool_size = 12M

remote_login_passwordfile= “EXCLUSIVE”

asm_diskstring = “/dev/sd*”

asm_diskstring = “AFD:*”

asm_diskgroups = “NEW”

asm_diskgroups = “TESTDG”

asm_power_limit = 1

_asm_max_connected_clients= 4

NOTE: remote asm mode is remote (mode 0x202; from cluster type)

2017-08-11T10:22:24.834431-04:00

Cluster Communication is configured to use IPs from: GPnP

Then I created parameter file (/tmp/pfile_asm.ora) and started the instance:

SQL> startup pfile=’/home/oracle/pfile_asm.ora’;

ASM instance started

Total System Global Area 1140850688 bytes

Fixed Size 8629704 bytes

Variable Size 1107055160 bytes

ASM Cache 25165824 bytes

ASM diskgroups mounted

SQL> exit

Great! ASM is up. Now I can restore my parameter file and try to start ASM with it:

[oracle@oratest01 ~]$ sqlplus / as sysasm

SQL> create pfile=’/home/oracle/pfile_orig.ora’ from spfile=’+DATA/clouddb/ASMPARAMETERFILE/registry.253.957837377′;

File created.

SQL>

And here is entry of my original ASM parameter file:

[oracle@oratest01 ~]$ more /home/oracle/pfile_orig.ora

+ASM1.__oracle_base=’/u01/app/oracle’#ORACLE_BASE set from in memory value

+ASM2.__oracle_base=’/u01/app/oracle’#ORACLE_BASE set from in memory value

+ASM3.__oracle_base=’/u01/app/oracle’#ORACLE_BASE set from in memory value

+ASM3._asm_max_connected_clients=5

+ASM2._asm_max_connected_clients=8

+ASM1._asm_max_connected_clients=5

*.asm_diskgroups=’DATA’,’ACFSDG’#Manual Mount

*.asm_diskstring=’/dev/sd*’,’AFD:*’

*.asm_power_limit=1

*.asm_preferred_read_failure_groups=”

*.large_pool_size=12M

*.remote_login_passwordfile=’EXCLUSIVE’

Good. Now let’s start ASM with it:

SQL> shut abort

ASM instance shutdown

SQL> startup pfile=’/home/oracle/pfile_orig.ora’;

ORA-32006: ASM_PREFERRED_READ_FAILURE_GROUPS initialization parameter has been deprecated

ORA-01078: failure in processing system parameters

SQL>

Wohoo. ASM failed to start because of deprecated parameter?! Let’s remove it and start ASM without ASM_PREFERRED_READ_FAILURE_GROUPS parameter:

[oracle@oratest01 ~]$ sqlplus / as sysasm

Connected to an idle instance.

SQL> startup pfile=’/home/oracle/pfile_orig.ora’;

ASM instance started

Total System Global Area 1140850688 bytes

Fixed Size 8629704 bytes

Variable Size 1107055160 bytes

ASM Cache 25165824 bytes

ASM diskgroups mounted

SQL>

It is started! Next I create ASM parameter file based on this pfile and start the instance:

SQL> create spfile=’+DATA’ from pfile=’/home/oracle/pfile_orig.ora’;

File created.

SQL> shut immediate

ASM diskgroups dismounted

ASM instance shutdown

SQL> startup

ASM instance started

Total System Global Area 1140850688 bytes

Fixed Size 8629704 bytes

Variable Size 1107055160 bytes

ASM Cache 25165824 bytes

ASM diskgroups mounted

SQL>

After having ASM up and running I restart the clusterware on all nodes and check the status:

[root@oratest01 ~]$ crsctl stop cluster –all

[root@oratest01 ~]$ crsctl start cluster –all

[oracle@oratest01 ~]$ crsctl check cluster -all

**************************************************************

oratest01:

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

**************************************************************

CRS-4404: The following nodes did not reply within the allotted time:

oratest02, oratest03

The first node is up, but I wasn’t able to get status of clusterware in other nodes and got CRS-4404 error. To solve it, kill gpnpd process on all nodes and run the command again:

[oracle@oratest01 ~]$ ps -ef | grep gpn

oracle 3418 1 0 02:49 ? 00:00:15 /u01/app/12.2.0.1/grid/bin/gpnpd.bin

[oracle@oratest01 ~]$ kill -9 3418

[oracle@oratest01 ~]$ ps -ef | grep gpn

oracle 16169 1 3 06:52 ? 00:00:00 /u01/app/12.2.0.1/grid/bin/gpnpd.bin

[oracle@oratest01 ~]$ crsctl check cluster -all

**************************************************************

oratest01:

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

**************************************************************

oratest02:

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

**************************************************************

oratest03:

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

**************************************************************

[oracle@oratest01 ~]$

From this blog post you can learn step by step clusterware startup troubleshooting and not to use depracated ASM parameter

↧

How to pass Oracle Database 12c: RAC and Grid Infrastructure Administration exam – 1Z0-068 and become Oracle Certified Expert

May 2, 2018, 10:12 pm

≫ Next: PRCR-1079 : Failed to start resource oranode1-vip. CRS-2680 Clean failed. CRS-5804: Communication error with agent process

≪ Previous: Using deprecated ASM parameter might prevent your Cluster to start

In this post I will talk about my journey on how to prepare and pass the 12c RAC and Grid Administration exam.

About the exam

Check the following link to get more information about the exam from Oracle University page:

https://education.oracle.com/pls/web_prod-plq-dad/db_pages.getpage?page_id=5001&get_params=p_exam_id:1Z0-068

The exam consists of 3 parts:

– Oracle 12c ASM Administration
– Oracle 12c Grid Infrastructure Installation and Administration
– Oracle 12c RAC Administration

I don’t want to scare you, but the exam is hard enough. The bad thing is – you fail the entire exam if you fail one of the sections. This means that you have to be well prepared for all 3 parts. For me, I was good at ASM and RAC Administration, and was not comfortable with Grid Infrastructure Installation and Administration part which I passed barely.

You may be Oracle high availability expert and fail the exam. You might have an experience but can fail because of useless (or may be uncommon) features and topics that you didn’t practice, or didn’t read or read superficial. Because most of the questions were not checking your practical experience, but theoretical knowledge. I manage high available cluster databases for last 8 years, and it was really hard to answer some of the questions that I haven’t ever faced and I didn’t see the reason to try.
There were a lot of questions like “Choose four option, where blah blah blah ….” And you have to choose 4 options out of 7. You might know 3 correct answers, but because of that 1 wrong option you might fail.

Next, you have to achieve a minimum score for all 3 sections in order to pass the entire exam. You might complete 2 sections with 100% and fail from the one and end up failing the entire exam.

How to prepare for the exam?

You have to read the documentation and play with ASM, RAC database and Grid Infrastructure A LOT!

If you want to learn Oracle 12c Grid Infrastructure installation, check the following video tutorial:

http://www.oraclevideotutorials.com/video/installing-oracle-12cr2-grid-infrastructure

Check the videos section in oraclevideotutorials.com to find out some clusterware related hands-on practices:

http://www.oraclevideotutorials.com/videos

The only available book related with the exam (RAC part mostly) is the following book which is worth reading written by friends of mine Syed Jaffar, Kai Yu and Riyaj Shamsudden:

Expert Oracle RAC 12c
https://www.amazon.com/Expert-Oracle-RAC-Experts-Voice/dp/1430250445/

In my OCM preparation book, I have two chapters that can help you during the preparation:

Chapter 7 – Grid Infrastructure and ASM

Chapter 8 – Real Application Clusters.

To get free trial pdf copy of the book, go to www.ocmguide.com , or purchase it from the following link:

https://www.amazon.com/Oracle-Certified-Master-Exam-Guide/dp/1536800791/

During the exam, I felt regret skipping reading some chapters in the documentation and viewing some of them superficial. I highly recommend to check ASM, RAC and Grid Infrastructure documentation and make sure you went through the entire documentation at least once. Here are the links to the documentations:

Real Application Clusters Administration and Deployment Guide

https://docs.oracle.com/en/database/oracle/oracle-database/12.2/racad/toc.htm

Clusterware Administration and Deployment Guide

https://docs.oracle.com/en/database/oracle/oracle-database/12.2/cwadd/toc.htm

Automatic Storage Management Administrator’s Guide

https://docs.oracle.com/en/database/oracle/oracle-database/12.2/ostmg/toc.htm

Setting deadlines and booking the exam

Most of you (including me) postpone the exam and don’t put deadlines for the preparation and for the exam itself. My advice – set an approximate date for the exam and make a plan for each month, week and day. Then set a date and book the exam! Yes, book it – as you have a chance to rebook if you don’t feel ready unless it’s 24 hours before the exam. Registering for the exam weeks before the exam date will push you to make your preparation completed on time.

I booked the exam for Tuesday, rebooked it to Wednesday, then to Thursday, and then to Friday :). On Wednesday I decided to reschedule it to the next Monday and in the evening I was shocked when I saw that I didn’t actually rescheduled it on Friday. It will happen tomorrow! (on Thursday) Just in a few hours!

I didn’t feel that I’m ready and still having few incomplete sections where I was feeling weak, even was about to cancel the exam and don’t attend, but then decided to push hard and try. And if I lose, I decided to lose like a champ

So I stayed awake till 3am, took a nap till 6am and made last preparations till 9am. Attended exam at 10am and was completely exhausted, overworked and sleepy.

Fortunately I passed the exam successfully and wish you the same.

O_CertExpert_ODatabase12cORACandOGridInfrastructureAdmin_clr

This is my experience with Oracle Database 12c: RAC and Grid Infrastructure Administration exam (1Z0-068). Let me know if you plan to take the exam, so I guide you through it in more detail.

Good luck!

↧

PRCR-1079 : Failed to start resource oranode1-vip. CRS-2680 Clean failed. CRS-5804: Communication error with agent process

April 15, 2019, 2:59 am

≫ Next: Investigation on why database doesn’t start after successfully dropping a diskgroup

≪ Previous: How to pass Oracle Database 12c: RAC and Grid Infrastructure Administration exam – 1Z0-068 and become Oracle Certified Expert

Last week we had a clusterware issue on one of the critical 3 node RAC environment. In the first node, network resource is restarted by ending up killing all sessions on that node abnormally. Oracle VIP that was running on that node failed over to the third node. The first node was up and running, but didn’t accept connections because it was trying to register the instance using LOCAL_LISTENER parameter where the oranode1-vip was specified that was not running on that node. We tried to relocate it back to the first node, but it failed because it couldn’t stop it. Everytime we tried to stop or relocate it, the cleaning process started and failed in a few minutes.

Neither support, nor us didn’t find any readable information in the clusterware log files. Despite the fact that there were 2 instance up and running, as load was so high, they were barely handle all connections. The ping succeeded to the oranode1-vip, but it wasn’t able to stop it even with force mode. We couldn’t able to start it as well, because it didn’t stop successfully and wasn’t able to clean up successfully. The status was “enabled” and “not running”, but ping was ok

db-bash-$ srvctl status vip -i oranode1-vip
VIP oranode1-vip is enabled 
VIP oranode1-vip is not running 
db-bash-$

From crsctl stat res command we could see that it’s OFFLINE and failed over to the node3

db-bash-$ crsctl stat res -t

oranode1-vip  1 OFFLINE UNKNOWN node03

And it failed when we tried to start it:

db-bash-$ srvctl start vip -i oranode1-vip   
PRCR-1079 : Failed to start resource oranode1-vip  
CRS-2680: Clean of 'oranode1-vip  ' on 'node03' failed 
CRS-5804: Communication error with agent process

We cleared socket files of the first node from /var/tmp/.oracle folder, restart the CRS and checked if it failed back, but it didn’t. Support asked us to stop the second node, clear the socket files and start it to see if something changed, but we didn’t do it, because the single node wouldn’t be able to handle all connections.

At the end, we checked the interface of virtual up on OS level, and found it on node03

db-bash-$ netstat -win

lan900:805 1500 #### #### 2481604 0 51 0 0

Instead of restarting the CRS of production database (which takes 10 minutes), we decided to bring that interface down using on OS level. For HP-UX, it’s ifconfig … down command

Before running this command on production environment, we tried it on the test environment and realized that the down parameter is not enough. We have to provide 0.0.0.0 ip address with along the down parameter to bring down that interface. So we run the following command to bring it down:

ifconfig lan900:805 0.0.0.0 down

And it disappeared from the list. Next, we started the vip using srvctl start vip command and it succeeded!

Lessons learned:

Perform all actions on the test environment (if you are not sure what can happen) before trying it on production environment
Don’t try to “restart” or “reboot” the instance, cluster or the node. Sometimes it just doesn’t solve your problem. Even after restart, the system can’t startup correctly (because of changed parameters, configurations and etc.)
In 24 hours, severity #1 SR was assigned to 6 different engineers. It takes a lot of time to gather log files, submit them and have it reviewed by Oracle engineer until his/her shift is changed. Sometimes you just don’t have time to get answer from Oracle, you have to do it by your own and take all risks. It requires an experience.

↧

Investigation on why database doesn’t start after successfully dropping a diskgroup

December 24, 2020, 1:29 am

≪ Previous: PRCR-1079 : Failed to start resource oranode1-vip. CRS-2680 Clean failed. CRS-5804: Communication error with agent process

Few months ago, while performing storage migration I faced an interesting issue which could lead to potential downtime if I didn’t notice a hidden warning in the log file.

The plan was to create a new ASM diskgroup in a normal redundancy with 2 disks from different storages and test the disk crash and confirm that there will be no data loss if one of the storages fail. After creating a diskgroup, creating a test tablespaces on it and corrupting the header of one disks, everything was ok and we decided to drop the diskgroup and start adding new disks as a failgroup to other diskgroups.

Below I created a scenario in my test environment which describes the same problem.

First of all, I get location of controlfiles and datafiles (of course redo log files as well) to make sure which diskgroups contain physical files:

SQL> show parameter control

NAME                                                        TYPE VALUE

------------------------------------ ----------- ------------------------------

control_files                                            string               +CFILE2/TESTDB/CONTROLFILE/current.256.1046097231

SQL> select name from v$datafile;

NAME

--------------------------------------------------------------------------------

+DATA/TESTDB/DATAFILE/system.278.1046088963

+DATA/TESTDB/DATAFILE/sysaux.277.1046088795

+DATA/TESTDB/DATAFILE/undotbs1.280.1046089213

+DATA/TESTDB/DATAFILE/undotbs2.288.1046089391

+DATA/TESTDB/DATAFILE/users.279.1046089209

SQL>

As you see, we have 2 diskgroups involved: +CFILE2 and +DATA. Next, I run srvctl config database command and grep list of Diskgroups which are used by this database. We see the same output – +CFILE2 and +DATA

-bash-4.1$ srvctl config database -d testdb | grep Disk

Disk Groups: DATA,CFILE2

-bash-4.1$

Next, I query V$ASM_DISKGROUP view to get list of all diskgroups that are available in ASM:

SQL> col name format a40

SQL> set linesize 150

SQL> select group_number, name, state, type, total_mb, free_mb from v$asm_diskgroup;

GROUP_NUMBER NAME                                                                    STATE               TYPE       TOTAL_MB    FREE_MB

------------ ---------------------------------------- ----------- ------ ---------- ----------

                   4 TESTDG                                                               MOUNTED     EXTERN       1019                923

                   3 DATA                                                                    CONNECTED   EXTERN      15342             8239

                   1 CFILE2                                                                  CONNECTED   EXTERN       1019                892

SQL>

We have three diskroups – +CFILE2, +DATA and +TESTDG. Next, I will create a new tablespace in the diskgroup +TESTDG to have it become a part of the database configuration:

SQL> create tablespace mytbs datafile '+TESTDG' size 10m;

Tablespace created.

SQL>

Once I create a tablespace in the new diskgroup, it will be part of the database configuration and dependency is established between the database and the diskgroup which can be seen from the output of the alert.log file of the database:

Alert.log file

Wed Jul 29 09:02:47 2020

create tablespace mytbs datafile '+TESTDG' size 10m

Wed Jul 29 09:02:48 2020

NOTE: ASMB mounting group 4 (TESTDG)

NOTE: Assigning number (4,0) to disk (/dev/asm-disk5)

SUCCESS: mounted group 4 (TESTDG)

NOTE: grp 4 disk 0: TESTDG_0000 path:/dev/asm-disk5

Wed Jul 29 09:02:50 2020

NOTE: dependency between database testdb and diskgroup resource ora.TESTDG.dg is established

Completed: create tablespace mytbs datafile '+TESTDG' size 10m

Output of the ASM alert.log file:

Wed Jul 29 09:02:48 2020

NOTE: client testdb1:testdb:rac-scan mounted group 4 (TESTDG)

Wed Jul 29 09:02:49 2020

NOTE: Advanced to new COD format for group TESTDG

From the output of the crsd.trc file it can be seen that there’s a hard dependency between diskgroup and the database:

2020-07-29 09:02:50.015412 :UiServer:204928768: {1:32997:407} Container [ Name: UI_REGISTER

                API_HDR_VER:

                TextMessage[3]

                API_REGUPDATE_TAG:

                TextMessage[1]

                ASYNC_TAG:

                TextMessage[1]

                ATTR_LIST:

TextMessage[MANAGEMENT_POLICY=AUTOMATICSTART_DEPENDENCIES=+hard(ora.TESTDG.dg)+pullup(ora.TESTDG.dg)STOP_DEPENDENCIES=+hard(shutdown:ora.TESTDG.dg)]

                CLIENT:

                TextMessage[]

                CLIENT_NAME:

                TextMessage[Unknown process]

                CLIENT_PID:

                TextMessage[8543]

                CLIENT_PRIMARY_GROUP:

                TextMessage[oinstall]

                LOCALE:

                TextMessage[AMERICAN_AMERICA.AL32UTF8]

                NO_WAIT_TAG:

                TextMessage[1]

                QUEUE_TAG:

                TextMessage[1]

                RESOURCE:

                TextMessage[ora.testdb.db]

                UPDATE_TAG:

                TextMessage[1]

– Now to see the new list of diskgroups which are part of the database configuration, we run the following command:

-bash-4.1$ srvctl config database -d testdb | grep Disk

Disk Groups: DATA,CFILE2,TESTDG

-bash-4.1$

As you see, diskgroup +TESTDG is also part of the database configuration. Next, to imitate a storage failure, or disk crash, I corrupt the disk of the diskgroup +TESTDG using dd command as follows:

-bash-4.1$ dd if=/dev/zero of=/dev/asm-disk5 bs=1024 count=10000

10000+0 records in

10000+0 records out

10240000 bytes (10 MB) copied, 0.125557 s, 81.6 MB/s

-bash-4.1$

– And check the alert.log file. Once it’s detected that the disk of the diskgroup with external redundancy is corrupted, database instance will crash:

Wed Jul 29 09:19:45 2020

USER (ospid: 27939): terminating the instance

Wed Jul 29 09:19:47 2020

Instance terminated by USER, pid = 27939

And from the alert.log file of an ASM instance, it can be seen that the disk is offlined:

Wed Jul 29 09:19:49 2020

NOTE: SMON did instance recovery for group DATA domain 3

NOTE: SMON detected lock domain 4 invalid at system inc 6 07/29/20 09:19:49

NOTE: SMON starting instance recovery of group TESTDG domain 4 inc 6 (mounted) at 07/29/20 09:19:49

NOTE: SMON will attempt offline of disk 0 - no header

NOTE: cache initiating offline of disk 0 group TESTDG

NOTE: process _smon_+asm1 (5245) initiating offline of disk 0.3916011317 (TESTDG_0000) with mask 0x7e in group 4 (TESTDG) with client assisting

NOTE: initiating PST update: grp 4 (TESTDG), dsk = 0/0xe9699735, mask = 0x6a, op = clear

Wed Jul 29 09:19:49 2020

GMON updating disk modes for group 4 at 14 for pid 18, osid 5245

ERROR: disk 0(TESTDG_0000) in group 4(TESTDG) cannot be offlined because the disk group has external redundancy.

Wed Jul 29 09:19:49 2020

ERROR: too many offline disks in PST (grp 4)

Now, we try to start the database

-bash-4.1$ srvctl start database -d testdb

PRCR-1079 : Failed to start resource ora.testdb.db

CRS-5017: The resource action "ora.testdb.db start" encountered the following error:

ORA-01157: cannot identify/lock data file 2 - see DBWR trace file

ORA-01110: data file 2: '+TESTDG/TESTDB/DATAFILE/mytbs.256.1047027769'

. For details refer to "(:CLSN00107:)" in "/u01/app/oracle/diag/crs/node1/crs/trace/crsd_oraagent_oracle.trc".

It will fail. Because it can’t access the datafile which is in the failed diskgroup. Here’s the output of the trace file:

CRS-2674: Start of 'ora.testdb.db' on 'node1' failed

CRS-2632: There are no more servers to try to place resource 'ora.testdb.db' on that would satisfy its placement policy

CRS-5017: The resource action "ora.testdb.db start" encountered the following error:

ORA-01157: cannot identify/lock data file 2 - see DBWR trace file

ORA-01110: data file 2: '+TESTDG/TESTDB/DATAFILE/mytbs.256.1047027769'

. For details refer to "(:CLSN00107:)" in "/u01/app/oracle/diag/crs/node2/crs/trace/crsd_oraagent_oracle.trc".

CRS-2674: Start of 'ora.testdb.db' on 'node2' failed

Output of alert.log file:

Wed Jul 29 09:22:15 2020

Errors in file /u01/app/oracle/diag/rdbms/testdb/testdb1/trace/testdb1_ora_28674.trc:

ORA-01157: cannot identify/lock data file 2 - see DBWR trace file

ORA-01110: data file 2: '+TESTDG/TESTDB/DATAFILE/mytbs.256.1047027769'

ORA-1157 signalled during: ALTER DATABASE OPEN /* db agent *//* {1:32997:676} */...

Wed Jul 29 09:22:17 2020

License high water mark = 1

Wed Jul 29 09:22:17 2020

USER (ospid: 28854): terminating the instance

Wed Jul 29 09:22:18 2020

Instance terminated by USER, pid = 28854

Next, we offline the datafile and restart the database:

SQL> alter database datafile 2 offline;

Database altered.

SQL>

-bash-4.1$ srvctl stop database -d testdb -stopoption abort

-bash-4.1$ srvctl start database -d testdb

Database is UP! Great! But ….. We solved the physical file dependency problem which was preventing database to start. But we still have the failed diskgroup in the configuration of the database resource:

-bash-4.1$ srvctl config database -d testdb | grep Disk

Disk Groups: DATA,CFILE2,TESTDG

-bash-4.1$

It means that once we restart the clusterware stack, the database resource will NOT start, because it has hard dependency with the diskgroup which is part of its configuration, which is FAILED …

Let’s restart the crs and check the status of the database:

-bash-4.1# crsctl stop crs

-bash-4.1# crsctl start crs

From the output of the ASM alert.log file, it can be seen that ASM tried to mount the diskgroup and failed:

Wed Jul 29 09:41:09 2020

ERROR: ALTER DISKGROUP TESTDG MOUNT  /* asm agent *//* {1:42096:2} */

Wed Jul 29 09:41:09 2020

WARNING: Disk Group DATA containing voting files is not mounted

ORA-15032: not all alterations performed

ORA-15017: diskgroup "TESTDG" cannot be mounted

ORA-15040: diskgroup is incomplete

ORA-15017: diskgroup "DATA" cannot be mounted

ORA-15013: diskgroup "DATA" is already mounted

CRS is up

[root@node1 oracle]# crsctl check crs

CRS-4638: Oracle High Availability Services is online

CRS-4537: Cluster Ready Services is online

CRS-4529: Cluster Synchronization Services is online

CRS-4533: Event Manager is online

[root@node1 oracle]#

As we restarted crs in the first node, the instance is not running in the first node, and still up in the second node which will be down upon the next crs or node restart.

[root@node1 oracle]# srvctl status database -d testdb

Instance testdb1 is not running on node node1

Instance testdb2 is running on node node2

[root@node1 oracle]#

If we try to restart the instance in the first node, we’ll fail:

-bash-4.1$ srvctl start instance -d testdb -i testdb1

PRCR-1013 : Failed to start resource ora.testdb.db

PRCR-1064 : Failed to start resource ora.testdb.db on node node1

CRS-2674: Start of 'ora.TESTDG.dg' on 'node1' failed

-bash-4.1$

A message appered in asm trace file once you try to start the instance

Wed Jul 29 09:44:28 2020

ERROR: ALTER DISKGROUP ALL MOUNT FOR testdb /* asm agent *//* {1:42096:192} *//* incarnation::1*/

It’s scary! You have a failed diskgroup which doesn’t contain ANY physical file in it, and it will stop you to start the database instance because the database resource is dependent on it. The only way is to modify the database resource configuration and remove the diskgroup as follows:

-bash-4.1$ srvctl modify database -d testdb -diskgroup DATA,CFILE2

Now if we check the crsd.log file, we can see that we have only two diskgroups : + DATA and CFILE2 with hard dependency

2020-07-29 09:46:09.329870 :UiServer:2822174464: {1:42096:285} Container [ Name: UI_REGISTER

                API_HDR_VER:

                TextMessage[3]

                API_REGUPDATE_TAG:

                TextMessage[1]

                ATTR_LIST:

                TextMessage[START_DEPENDENCIES=hard(ora.DATA.dg,ora.CFILE2.dg) weak(type:ora.listener.type,global:type:ora.scan_listener.type,uniform:ora.ons,global:ora.gns) pullup(ora.DATA.dg,ora.CFILE2.dg)STOP_DEPENDENCIES=hard(intermediate:ora.asm,shutdown:ora.DATA.dg,shutdown:ora.CFILE2.dg)]

                CLIENT:

                TextMessage[]

                CLIENT_NAME:

                TextMessage[/usr/bin/java]

                CLIENT_PID:

                TextMessage[13981]

                CLIENT_PRIMARY_GROUP:

                TextMessage[oinstall]

                LOCALE:

                TextMessage[AMERICAN_AMERICA.US7ASCII]

                QUEUE_TAG:

                TextMessage[1]

                RESOURCE:

                TextMessage[ora.testdb.db]

                UPDATE_TAG:

                TextMessage[1]

To make sure it’s successfully modified, run the following command and check the output:

-bash-4.1$ srvctl config database -d testdb | grep Disk

Disk Groups: DATA,CFILE2

– Now we should be able to start the instance:

-bash-4.1$ srvctl start instance -d testdb -i testdb1

-bash-4.1$

Output of the alert.log file

Wed Jul 29 09:47:30 2020

AQPC started with pid=55, OS id=14549

Starting background process CJQ0

Completed: ALTER DATABASE OPEN /* db agent *//* {1:42096:364} */

What I faced that night, was that the diskgroup was successfully dropped from ASMCA, but in the crsd.log file the hard dependency was not removed from the clusterware configuration, and I decided to not restart the crs, thinking it will not startup because of this dependency. Diskgroup was already empty containing no physical datafiles, dismounted and dropped successfully but it’s hard dependency from the database resource was not changed, probably because of a bug. Which means that after dropping the diskgroup if we tried to reboot both nodes or crs, the database wouldn’t start and would lead the downtime.

Lessons learned:

Make sure to check alert.log file of database and asm instance, and cluster log and trace files once you perform any change (even dropping a diskgroup in the production environment and even if it succeeded)
After making a cluster level change, make sure to restart the crs or even perform a node reboot to see everything is ok after the change.
Don’t stop the entire database. Restart the crs or db instances in rolling fashion. Make sure you have at least once instance available every time.

↧