Archive for July, 2019

Solaris Fault Manager

July 1st, 2019, posted in Solaris
Share
Fault Manager is part of self-healing functionality that provides fault isolation and component restart, in this case hardware component 
(SMF will take care of software components).

Make sure that you run the service and have required packages.
# pkginfo |grep fmd
system      SUNWfmd         Fault Management Daemon and Utilities
system      SUNWfmdr        Fault Management Daemon and Utilities (Root) 
# svcs fmd
STATE          STIME    FMRI
online         Jun_29   svc:/system/fmd:default



Display Fault Manager Configuration:
# fmadm config
MODULE                   VERSION STATUS  DESCRIPTION
cpumem-diagnosis         1.6     active  CPU/Memory Diagnosis
cpumem-retire            1.1     active  CPU/Memory Retire Agent
eft                      1.16    active  eft diagnosis engine
fmd-self-diagnosis       1.0     active  Fault Manager Self-Diagnosis
io-retire                1.0     active  I/O Retire Agent
sysevent-transport       1.0     active  SysEvent Transport Agent
syslog-msgs              1.0     active  Syslog Messaging Agent
zfs-diagnosis            1.0     active  ZFS Diagnosis Engine
zfs-retire               1.0     active  ZFS Retire Agent



For example, kernel sends error to FMD and FMD forwards error to module. There are two types of module: 1. Diagnosis engine : provides diagnosis based on symptoms 2. Agents : respond to given diagnosis and takes action, say offline faulty CPU. The fault manager maintains two log files: 1. error log - list of errors sent to the fault manager daemon 2. fault log - list of diagnosed and repaired problems See fault log with: # fmdump See error log with: # fmdump -e Tips: -u - limits the output to a specific UUID -T - displays events that occurred BEFORE specific time yyyy-mm-dd -t - displays events that occurred AFTER specific time yyyy-mm-dd -V - verbose output Run command below to see if Faulty Manager shows some failed resources. In this example we see that memory module DIMM 3 failed.


# fmadm faulty
--------------- ------------------------------------  -------------- ---------
TIME            EVENT-ID                              MSG-ID         SEVERITY
--------------- ------------------------------------  -------------- ---------
Jun 23 02:30:30 2578e639-38cd-4cd8-9c16-87e96116f41e  AMD-8000-2F    Major

Fault class : fault.memory.dimm_sb
Affects     : mem:///motherboard=0/chip=1/memory-controller=0/dimm=3/rank=0
                  degraded but still in service
FRU         : "CPU 1 DIMM 3" (hc://:product-id=Sun-Fire-X4200-Server:chassis-id=0000000000:server-id=oryx/motherboard=0 
		/chip=1/memory-controller=0/dimm=3)

Description : The number of errors associated with this memory module has
              exceeded acceptable levels.  Refer to
              http://sun.com/msg/AMD-8000-2F for more information.

Response    : Pages of memory associated with this memory module are being
              removed from service as errors are reported.

Impact      : Total system memory capacity will be reduced as pages are
              retired.

Action      : Schedule a repair procedure to replace the affected memory
              module.  Use fmdump -v -u <EVENT_ID> to identify the module.



Note that there is the link with more info (like knowledge base), go there and it tells you about resolution. Okay, so say you are replacing DIMM now. Once DIMM is replaced, you need to update resource cache to indicate there is no issue any more.
# fmadm repair 2578e639-38cd-4cd8-9c16-87e96116f41e
fmadm: recorded repair to 2578e639-38cd-4cd8-9c16-87e96116f41e



Reset the Fault Manager module. Don't know which one, previously mentioned web link will tell you.
# fmadm reset eft
fmadm: eft module has been reset



Verify that there is no more faulty resources. # fmadm faulty No output, super! Means there is no h/w issue!
Share