VXVM – Some general troubleshooting

Print Friendly, PDF & Email

“No valid disk found containing disk group” message. vxdisk -o alldgs list shows all disks but you can’t import it – what could be the issue?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Possible causes are:
1. Check udid of the disk as per Veritas (vxdisk list fabric_0 | grep udid) and compare that with the actual udid on the array. If they are different, then reboot the system to pick up the new disks.
2. Check the number of enabled configs on each disk in a diskgroup – if none of the disk have config state=enabled then diskgroup does not have valid configuration to import. Edit the nconfig=all on diskgroup
3. Try importing by clearing the lock
# vxdg -C -f import testdg

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When vxdisk list show dgdisabled and there are other disks in the same diskgroup which are not
imported – How to resolve this?~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Umount the file systems withing this DG and then deport and import the disk group. If this doesnt work, the only option is to reboot the system.

# vxdisk -o alldgs list
DEVICE       TYPE            DISK         GROUP        STATUS
fabric_6     auto:sliced     c90t53d3     dg_test1     online dgdisabled
fabric_7     auto:sliced     -            (dg_test1)   online
# vxdg deport dg_test1
# vxdisk -o alldgs list
fabric_6     auto            -            -            error
fabric_7     auto:sliced     -            (dg_test1)   online
# vxdg import dg_midoffprd1
VxVM vxdg ERROR V-5-1-10978 Disk group dg_test1: import failed:
No valid disk found containing disk group
# reboot -- -r

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
How to remove the ghost entry of a removed disk “failed was:c1t1d1s2”?~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If the disk has recovered (like represented from array again) – you can recover it using “vxreattach Disk_0”
If disk doesn’t get attached and gives information about Serial Split Brain conditin and advising to run -o overridessb, then it is a bad case. The only way to recover is remove the disk and add it back as below:

#/etc/vx/bin/vxreattach -b fabric_6
VxVM vxdg ERROR V-5-1-10127 associating disk-media c90t10d4 with fabric_6:
Serial Split Brain detected. Use -ooverridessb to reattach the disk/site or run vxsplitlines to import 
the diskgroup

Remove the subdisks/plexes from disk and remove the disk from dg.

Disassociate the disabled plex
# vxplex -g dg_smsprd1 dis smsprd1_log
Remove the plex and subdisks
# vxedit -g dg_smsprd1 -rf rm smsprd1_log_4-02
Remove the disk from diskgroup
# vxdg -g dg_smsprd1 rmdisk c90t60d1
Initialise the disk
# vxdisk -f init fabric_4 privoffset=1 privlen=81663 puboffset=0 publen=10354688 format=sliced
Add the disk back into diskgroup
# vxdg -g dg_smsprd1 adddisk c90t60d1=fabric_4

OR

If the disk has failed completely and you have removed it – Remove the disk from DG. If there are no objects, it should succeed.

OR

Find the list of disks which needs to be removed:

# vxprint -g dg_name -d -F "%{name} %{assoc}"
c90t10d1 -
c90t50d2 c3t2d10s2
c90t60d2 c3t2d11s2
c90t70d1 -

Remove the disks which have no access name
# vxdg -g dg_name -o override rmdisk

If it says it has the volumes associated, run above command with -k option:
# vxdg -g dg_name -o override -k rmdisk

After this, vxdisk list will show them “removed was:c1t1d1s2”
Remove it now using vxdiskadm, option 3.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
How to disable boot from vxvm and start it manually?~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. Boot into single user mode
2. Edit /etc/system. Comment out the vx parameter as follow:

      *rootdev:/pseudo/vxio@0:0
      *set vxio:vol_rootdev_is_volume=1

3. cd /etc/vx/reconfig.d/state.d/; rm *; touch install-db
(This should remove root-done; and prevent vxvm from starting)
4. cp -p /etc/vfstab /etc/vfstab; cp -p /etc/vfstab.prevm /etc/vfstab
(restore original vfstab)
5. init 6
6. After the system is up, start the Volume Manager service manually as follows
    # vxiod set 10
    # ps -ef |grep vxconfigd. If vxconfigd is not running, then run "/usr/sbin/vxconfigd -m disable"
    # vxdctl mode. Should see it is in disabled mode.
    # vxdctl init
    # vxdctl enable

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
How to recreate diskgroup info?~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. # vxprint -mpvsh -g DISKGROUP >DISKGROUP.out
2. Destroy the diskgroup
3. Create the diskgroup with the same disk names
4. Edit DISKGROUP.out and change the disknames manually if needed
5. # vxmake -g DGNAME -d /DISKGROUP.out (to rebuild the config in one go)
6. All the volumes should now be defined and in DISABLED/EMPTY state; plex should be in DISABLED/EMPTY state; subdisk should be in ENABLED/ACTIVE state
7. Init and start the volume as below:
# vxvol -g dg_dodgeprd4 init active dodgeprd4_data_1
This command will init the volume to active (start the plexes and volumes)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Metasave and file system corruption ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

DISCLAIMER: I might have copied steps mentioned below from some site while googling a long time ago. I don’t
want to take any credit for steps mentioned below.

When file system complains of corruption, try the following steps.
1. Run file system check utility once on the affected file system and verify if that resolves the issue using –
# fsck -F vxfs /dev/vx/rdsk//
If above command doesn’t resolve then try the command:
# fsck -F vxfs -y -o full,nolog /dev/vx/rdsk//
2. Umount and mount the file system again and verify.
3. Verify if you are able to see the VXFS file system header using –
# /opt/VRTS/bin/fstyp -v /dev/vx/rdsk//
4. Verify if you can see the “lost+found” folder and its content as expected:
# cd /
# cd lost+found
# ls -l

5. When a file system becomes corrupted and the reasons for corruption are unknown, collect a metadata image of a corrupted file system to investigate why corruption happened. The metadata can be captured using a tool called metasave. Metasave is included in the VRTSspt package, which comes with the product CDs and is also available from ftp.veritas.com. The /opt/VRTSspt/FS/MetaSave directory may contain more than one metasave binary, depending on the operating system. For example, on Solaris there are:
metasave_5.8
metasave_5.9
metasave_5.10

To save metadata from a file system, the corrupted file system needs to be unmounted (if it is still mounted). Run the appropriate metasave binary, such as on Sun Solaris 10 systems:
# metasave_5.10 -f /dev/vx/rdsk//

The file created by this command, , can be quite large (depending on the size of the file system and the number of files), so it should be located on a file system that has enough space. Later you should zip the file, which compresses the file to very large extent.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
How to recover from splitbrain error while trying to import?~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Depending on configuration, one/many/all disks in dg stores the diskgroup configurations. When different configuration is found among these disks, splitbrain situation happens while importing it. Try following steps:
– Decide the disk with valid config. If you can’t decide now, you can decide after running vxsplitlines using different diskids
– Run vxsplitlines -g DG to find out the problem
– Run vxdisk list on good disk and note down its Disk ID
– Run vxsplitlines -g DG -c DISKID to get the exact mismatch
– Import the diskgroup with
# vxdg -o overridessb -o selectcp=DISKID import DG

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
How to fix a volume that has plex in DISABLED/RECOVER state?~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

One of the plex is in DISABLED/RECOVER and the other one is ENABLED/ACTIVE.


v  testvol -         ENABLED  ACTIVE   32768000 SELECT    -        gen
pl testvol-01   testvol  ENABLED ACTIVE 32768000 CONCAT -     RW
sd c92t52d1-01  testvol-01 c92t52d1 0  32768000 0         Disk_17  ENA
pl testvol-02   testvol DISABLED RECOVER 32768000 CONCAT -   RW
sd c92t55d1-57  testvol-02 c92t55d1 1707622400 32768000 0 Disk_12  ENA

Force the plex into OFFLINE state:

# vxmend -g testdg -o force off testvol-02 (DISABLED/OFFLINE)

v  testvol -         ENABLED  ACTIVE   32768000 SELECT    -        gen
pl testvol-01 testvol ENABLED ACTIVE 32768000 CONCAT -     RW
sd c92t52d1-01  testvol-01 c92t52d1 0  32768000 0         Disk_17  ENA
pl testvol-02 testvol DISABLED OFFLINE 32768000 CONCAT -   RW
sd c92t55d1-57  testvol-02 c92t55d1 1707622400 32768000 0 Disk_12  ENA

Place into STALE state:

# vxmend -g testdg on testvol-02 (DISABLED/STALE)

v  testvol -         ENABLED  ACTIVE   32768000 SELECT    -        gen
pl testvol-01 testvol ENABLED ACTIVE 32768000 CONCAT -     RW
sd c92t52d1-01  testvol-01 c92t52d1 0  32768000 0         Disk_17  ENA
pl testvol-02 testvol DISABLED STALE 32768000 CONCAT -     RW
sd c92t55d1-57  testvol-02 c92t55d1 1707622400 32768000 0 Disk_12  ENA

If there are other ACTIVE or CLEAN plexes in the volum, reattach those plexes to volume (even though
they already are attached). If the volume is already ENABLED, resynchronisation of the plex is started immediately but unfortunately it waits until it synchronises completely.

# vxplex -g testdg att testvol-02 testvol
# vxprint testvol
v  testvol gen       ENABLED  32768000 -        ACTIVE   -       -
pl testvol-01 testvol ENABLED 32768000 - ACTIVE  -       -
sd c92t52d1-01  testvol-01 ENABLED 32768000 0   -        -       -
pl testvol-02 testvol ENABLED 32768000 - ACTIVE  -       -
sd c92t55d1-57  testvol-02 ENABLED 32768000 0   -        -       -

If there are no other ACTIVE or CLEAN plexes in the volume, make the plex CLEAN

# vxmend -g testdg fix clean testvol-02 (DISABLED/CLEAN)

If the volume is not ENABLED, use the foll command to start it, and perform any resynchronisation
of the plexes in the backgroup
# vxvol -g testdg -o bg start testvol
(If the data in the plex was corrupted, and the volume has no ACTIVE or CLEAN redundant plexes from which its contents can be resynchronized, it must be restored from a backup or from a snapshot image)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
How to get a volume working if it is in “DETACHED DETACH” state?~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It went into this state because underlying plexes went offline thereby causing volume to into
maintenance mode (no ios). This gives a chance to “enable active” individual plex to figure out
the clean plex. If you know which plex is clean for sure, then you can recover using “vxvol start”.

# vxvol -g testdg start testvol

Different scenarios where volumes were in different state before they were recovered using vxvol start

Scenario 1


# vxprint testvol
v  testvol fsgen      DETACHED 1048444928 -      DETACH   -       -
pl testvol-01 testvol ENABLED 1048444928 - ACTIVE  -       -
sd c92t72d1-01  testvol-01 ENABLED 1048444928 0  -        -       -
pl testvol-02 testvol DISABLED 1048444928 - IOFAIL -       -
sd c90t70d1-01  testvol-02 ENABLED 1048444928 0  RELOCATE -       -
# vxplex -g testdg  dis  testvol-02
# vxvol -g testdg start testvol
# vxprint testvol
v  testvol fsgen      ENABLED  1048444928 -      ACTIVE   -       -
pl testvol-01 testvol ENABLED 1048444928 - ACTIVE  -       -
sd c92t72d1-01  testvol-01 ENABLED 1048444928 0  -        -       -

Now attach the plex back to volume. It should start synchronising again.

Scenario 2


# vxprint testvol
v  testvol gen       DETACHED 409600   -        DETACH   -       -
pl testvol-01 testvol DISABLED 409600 - RECOVER  -       -
sd c92t58d1-25  testvol-01 ENABLED 409600 0     -        -       -
pl testvol-02 testvol ENABLED 409600 -  ACTIVE   -       -
sd c92t52d1-66  testvol-02 ENABLED 409600 0     -        -       -
# vxvol -g testdg start testvol
# vxprint testvol
v  testvol gen       ENABLED  409600   -        ACTIVE   -       -
pl testvol-01 testvol ENABLED 409600 -  ACTIVE   -       -
sd c92t58d1-25  testvol-01 ENABLED 409600 0     -        -       -
pl testvol-02 testvol ENABLED 409600 -  ACTIVE   -       -
sd c92t52d1-66  testvol-02 ENABLED 409600 0     -        -       -

Scenario 3


# vxprint testvol_3
TY NAME         ASSOC        KSTATE   LENGTH   PLOFFS   STATE    TUTIL0  PUTIL0
v  testvol_3 gen     DETACHED 409600   -        DETACH   -       -
pl testvol_3-01 testvol_3 DISABLED 409600 - IOFAIL -     -
sd c92t58d1-27  testvol_3-01 ENABLED 409600 0   -        -       -
pl testvol_3-02 testvol_3 ENABLED 409600 - ACTIVE -      -
sd c92t52d1-68  testvol_3-02 ENABLED 409600 0   -        -       -
# vxvol -g testdg start testvol_3
# vxprint testvol_3
v  testvol_3 gen     ENABLED  409600   -        ACTIVE   -       -
pl testvol_3-01 testvol_3 ENABLED 409600 - ACTIVE -      -
sd c92t58d1-27  testvol_3-01 ENABLED 409600 0   -        -       -
pl testvol_3-02 testvol_3 ENABLED 409600 - ACTIVE -      -
sd c92t52d1-68  testvol_3-02 ENABLED 409600 0   -        -       -

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A volume with 2 plexes – one plex with Recover state and other in STALE state, How do you recover?~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Plex P1 is RECOVER indicates it was in the ACTIVE state prior to the failure
Plex P2 is STALE indicates it was not participating in I/Os and had stale data.
Run following commands:
# vxmend fix stale P1
# vxmend fix stale P2
# vxmend fix clean P1
# vxrecover -s V1

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A Volume is disabled and not startable. No CLEAN plexes. Good Plex is not known. How do you recover?~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

* Take all but one plex offline and set that plex to CLEAN
* Run vxrecover -s
* Verify data on the volume
* Run vxvol stop
* Repeat this for all plexes until you identify the plex with good data.

How to remove disabled paths from Veritas?

Run vxdctl enable to make sure veritas has released it’s grip on the device.
# vxdctl enable

Make sure the device is offlined from Solaris’s view.
# luxadm -e offline /dev/rdsk/c2t5006048452A83978d206s2

Clear out the device from Solaris’s view.
# cfgadm -o unusable_FCP_dev -c unconfigure c2::5006048452a83978
# devfsadm -Cv

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Boot time related issues ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Problem: Licenses keys are corrputed, missing, or expired

Causes: The /etc/vx/licenses/lic files become corrupted OR An evaulation license was installed and not updated to a full license.
Solution:
Save /etc/vx/licenses/lic/* to a backup device. If the license files are removed or corrupted, you can copy the files back.
# vxlicinst (install a new license)
# vxiod set 10 (start i/o daemons)
# vxconfigd (start config daemon)

Problem: Boot device can’t be opened

Causes:
Boot disk is not powered on
Boot disk has failed
SCSI bus is not terminated
Controller failure has occurred
Disk is failing and locking the bus
Solution:
Check scsi bus connections: probe-scsi-all
Boot from alternate boot disk

Problem: VxvM start up scripts exit without initialisation

Causes:
/etc/vx/reconfig.d/state.d/install-db exists – indicates that VxVM software packages have been installed,
but vxvm has not been initialised with vxinstall. Therefore vxconfig is not started.
/VXV#.#.#-UPGRADE/.start_runed – indicates that a vxvm upgrade has been started but not completed.
Therefore vxconfigd is not started.
Solution:
Remove the files and take appropriate actions

Problem: A conflicting host ID exists in /etc/vx/volboot file

volboot file contains the host ID that was on the system when vxvm was installed.
Solution:
Change the host name in volboot file: vxdctl hostid
Recreated new volboot file: vxdctl init

Problem: /var/vxvm/tempdb directory is missing, misnamed, or corrupted

It stores configuration information about imported disk groups. The contents are recreated after a reboot.
Causes: Directory is missing, misnamed, or corrupted
Solution:
To remove and recreate this directory:
# vxconfigd -k -x cleartempdir

How to run vxconfigd in debug mode?

# vxconfigd -k -m enable -x debug_level
(0 – no debugging, 9 – highest debugging)
-x log – log all console output to the /var/vxvm/vxconfigd.log file
-x logfile=name – use the specified log file instead
-x syslog – Direct all console output through the syslog interface
-x timestamp – Attach a date and time-of-day timestamp to all messages
-x tracefile=name – log all possible tracing information in the given file

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Useful commands
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To remove “online failin” in vxdisk list output for a good disk

# vxedit -g dgname set failing=off diskname

To check volume sizes

# vxprint -q -v -g <DG> -F "%{name} %{len}" | awk '{printf "%s %d\n", $1,$2/2048}'

To find out the volumes on a particular disk

vxprint -g datadg -e 'any v_plex.pl_sd.sd_disk="datadisk01"'

To find out the volumes in DISABLED state

# vxprint -g dg_dodgepre5 -e 'v_kstate!=V_ENABLED'

To find out the plexes in DISABLED state

# vxprint -g dg_dodgepre5 -e 'pl_kstate!=PL_ENABLED'

To remove multiplex plexes which are DISABLED and have no devices

# vxprint -p -e pl_kstate!=PL_ENABLED -g $dg -F %{name} | while read i; do 
vxplex -g $dg dis $i; vxedit -g -fr rm $i; 
done

To remove the license

Remove the files in /etc/vx/license/lic and run vxdctl license init to pick up the new license

To change disk to “sliced”?


EVA80003_10 auto:none – – online invalid
/etc/vx/bin/vxdisksetup -i EVA80003_10 format=sliced
EVA80003_10 auto:sliced – – online

To start volume without recovery?


# vxrecover -sn

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Useful links
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Removing thin provisioned disk from DG:

http://www.symantec.com/connect/articles/automating-thin-storage-reclamation-veritas-storage-foundation

Veritas MAN pages

http://sfdoccentral.symantec.com/index.html

,

sanaswati
No comments yet.

Leave a Reply

*