RAC : August 2016

https://www.youtube.com/watch?v=S1zqpD8IwvY

##################################################################
# 1) Which steps should be taken to troubleshoot the node eviction?
# 2) What are the most common causes of node evictions?
# 3) What are the best practices to avoid node evictions?
# 4) Resources & References Links:
################################################################

References:
-
- Troubleshooting 11.2 Clusterware Node Evictions (Reboots) (Doc ID 1050693.1)
- Top 5 Issues That Cause Node Reboots or Evictions or Unexpected Recycle of CRS [ID 1367153.1]
- Node Eviction due to OLOGGERD High CPU (Doc ID 1636942.1)
- RAC Node Eviction Troubleshooting Tool (Doc ID 1549954.1)
- ORA-15038 On Diskgroup Mount After Node Eviction (Doc ID 555918.1)
- Frequent Instance Eviction in 9i or Node Eviction in 10g/11g (Doc ID 461662.1)
- Bug 16562733 - CSSD crash / node eviction due to failed IO against the voting disk (Doc ID 16562733.8)
- Database Instance Hung and One Of RAC Nodes Rebooted as Membership Kill Escalated (Doc ID 1493110.1)
- 11gR2 CSS Terminates/Node Eviction After Unplugging one Network Cable in Redundant Interconnect Environment (Doc ID 1481481.1)
- Unnecessary Host entries in Oracle Solaris Cluster cause Oracle RAC node evictions (Doc ID 1011142.1)

=======
STEPS:
=======
1. Look at the cssd.log files on both nodes; usually we will get more information on the second node if the first node is evicted.
Also take a look at crsd.log file too

2. The evicted node will have core dump file generated and system reboot info.

3. Find out if there was node reboot , is it because of CRS or others, check system reboot time

4. If you see “Polling” key words with reduce in percentage values in cssd.log file that says the eviction is probably due to Network.
If you see “Diskpingout” are something related to -DISK- then, the eviction is because of Disk time out.

5. After finding Network or Disk issue. Then starting going in depth.

6. Now it’s time to collect NMON/OSW/RDA reports to make sure /justify if it was DISK issue or Network.

7. If in case we see more memory contention/paging in the reports then it’s time to collect AWR report to see what loads/SQL was running during that period?

8. If network was the issue, then check if any NIC cards were down, or if link switching as happen. And check private interconnect is working between both the nodes.

9. Sometimes eviction could also be due to OS error where the system is in halt state for while or Memory over commitment or CPU 100% used.

10. Check OS /system logfiles to get more information.

11. What got changed recently? Ask your coworker to open up a ticket with Oracle and upload logs

12. Check the health of clusterware, db instances, asm instances, uptime of all hosts and all the logs – ASM logs, Grid logs, CRS and ocssd.log,
HAS logs, EVM logs, DB instances logs, OS logs, SAN logs for that particular timestamp.

13. Check health of interconnect if error logs guide you in that direction.

14. Check the OS memory, CPU usage if error logs guide you in that direction.

15. Check storage error logs guide you in that direction.

16. Run TFA and OSWATCHER, NETSTAT, IFCONFIG settings etc based on error messages during your RCA.

17. Node eviction because iptables had been enabled. After iptables was turned off, everything went back to normal.
Avoid to enable firewalls between the nodes, and that appears to be true.
The ACL can open the ports on the interconnect, as we did, but we still experienced all kinds of issues.
(unable to start crs, unable to stop crs and node eviction).
We also had a problem with the voting disk caused by presenting LDEV's using business copies / Shadowimage that made RAC less than happy.

18. Verify user equiv between cluster nodes
19. Verify switch use for only interconnect. DO NOT USE same switch for other network operations.
20. Verify all nodes are 100pct the same configuration, sometimes there are net or config diffs that are not obvious.
look for hangs in the logs and the monitoring tools like NAGIOS to see any memory usage ran out of RAM, or became unresponsive.
21. A major reason however for node evictions at our cluster was at the "patch-levels" not being equal across the two nodes.
Nodes sometimes completely died, without any error what so ever.It turned to be a bug in the installer of 11.1.0.7.1 PSU,
which only installed the local node. After we patched to 11.1.0.7.4 both nodes, these problems did not reoccur in IBM-AIX RAC 11gR1

=====================================
SOME COMMON CAUSES OF NODE EVICTIONS:
=====================================

1. High CPU consumption.resource consumption on the nodes which prevents processes like ocssd, oprocd (10.2,11.1),the cssdagent or cssdmonitor from executing.
2. Less memory available at the OS
3. Bad huge pages values
4. Bad N/W or interconnect delays, hearbeat failure, lost of network connectivity between nodes
5. Storage hung (failed I/O against Voting disk)
6. Oracle bug while performing CRS upgrade, check MOS first for possible bugs with a particular version. CSS not good.
7. All NICs are in same subnet.
8. Unnecessary hosts entries
9. Sometimes after a driver update
10. The loss of a major set of voting disks required to maintain quorum.
11. The size of the control file is increase caused Instability and instance/node evictions on 10.2.0.4.
12. Instances waiting on the control file enqueue were timing out triggering instance and in some cases node evictions.

<<<**Note**>>: The control file in this case was very large with a large number of records.
The fix of recreating the control file,setting the control_file_record_keep_time parameter to 1 and
increasing the online log file size (less log switches, less archivelog records and rman backupset records etc helped reduce occurrences of this problem

RAC

Wednesday, August 17, 2016

Recovering the OCR by using Physical Backups

Wednesday, August 10, 2016

NODE EVICTION TROUBLE SHOOTING