ZFS

From DaqWiki
Jump to navigation Jump to search

Documentation

Files

  • /proc/spl/kstat/zfs/arcstats
  • /sys/module/zfs/parameters

Important commands

  • zfs set relatime=on pool ### ensure relatime is enabled otherwise each file access generates a write to the filesystem (to update the "last accessed" timestamp).

isdaq00 tuning

Increase zfs cache to allow "cd /zssd/home1; du -ks *" to run completely from cache without any disk access.

echo 20000000000 > /sys/module/zfs/parameters/zfs_arc_max
echo 50 > /sys/module/zfs/parameters/zfs_arc_dnode_limit_percent
echo 90 > /sys/module/zfs/parameters/zfs_arc_meta_limit_percent
echo 20000000000 > /sys/module/zfs/parameters/zfs_arc_max
echo 2 > /proc/sys/vm/drop_caches

Note:

  • "memory_free_bytes" is same as free memory reported by "top"
  • "memory_available_bytes" minus adjustable safety margin ("avail" in arcstat)
  • "arc_meta_max" is "arc_meta_used" + "memory_available_bytes"
  • "arc_meta_limit" should be set much bigger than that, set by zfs_arc_max and zfs_arc_meta_limit_percent
  • "arc_meta_used" is "size" in arcstat
  • "arc_dnode_limit" should be set much bigger than "dnode_size", set by zfs_arc_dnode_limit_percent
  • all the data should end up in the MFU (not MRU), "mfu_size" should be huge, "mru_size" much smaller.
  • isdaq00 with 24 GB of RAM is just about big enough to fit all of /zssd/home1, arc_meta_used is about 10 GB, arc_meta_max is about 12 GB.

Misc commands

  • zpool status
  • zpool get all
  • zpool iostat 1
  • zpool iostat -v 1
  • zpool history
  • zpool scrub data14
  • zpool events
  • arcstat.py 1
  • cat /proc/spl/kstat/zfs/arcstats
  • echo 30000000000 > /sys/module/zfs/parameters/zfs_arc_meta_limit
  • echo 32000000000 > /sys/module/zfs/parameters/zfs_arc_max
  • zfs get all
  • zfs set dedup=verify zssd/nfsroot
  • zpool create data14 raidz2 /dev/sd[b-h]1
  • zfs create z8tb/data
  • zfs destroy z8tb/data
  • zpool add z10tb cache /dev/disk/by-id/ata-ADATA_SP550_2F4320041688
  • parted /dev/sdx mklabel GPT
  • blkid
  • zpool iostat -v -q 1
  • watch -d -n 1 "cat /proc/spl/kstat/zfs/arcstats | grep l2"
  • zfs set primarycache=metadata tank/datab
  • zfs set secondarycache=metadata tank/datab
  • zfs userspace -p -H zssd/home1
  • zfs groupspace ...
  • zdb -vvv -O pool/gobackup/titan00__home1 data/home1/titan/packages/elog/logbooks/titan/2017
  • zdb -C pool | grep ashift ### find the real value of ashift
  • zfs snapshot -r pool_A@migrate
  • zfs send -R pool_A@migrate | zfs receive -F pool_B
  • echo 1 > /sys/module/zfs/parameters/zfs_send_corrupt_data # zfs send should not stop on i/o errors
  • zpool create test raidz2 `ls -1 /dev/disk/by-id/ata-WDC_WD40EZRX-00SPEB0_WD* | grep -v part`
  • zpool add -f test special mirror /dev/disk/by-id/ata-WDC_WDS120G2G0A-00JH30_1843A2802212 /dev/disk/by-id/ata-KINGSTON_SV300S37A120G_50026B77630CCB2C
invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: pool and new vdev with different redundancy, raidz and mirror vdevs, 2 vs. 1 (2-way)
zpool iostat -l 1 ### queue latencies
-w 1 ### distribution
-q -v 1 ### active and queued requests
-r1 ### size of IO
  • echo 2 > /proc/sys/vm/drop_caches ### clear zfs cache and system cache, all memory is free after this
  • watch -n1 -d "cat /proc/spl/kstat/zfs/arcstats | grep -v l2 | tail -52"

Create raid0 (mirror) volume

echo USE_DISK_BY_ID=\'yes\' >> /etc/default/zfs
dracut -vf
zpool create zssd mirror /dev/sdaX /dev/sdbX
zpool set cachefile=none zssd
zpool set failmode=continue zssd
zpool status
zpool events
zpool get all
df /zssd
ls -l /zssd

Use whole disk for zfs mirror (RAID0)

echo USE_DISK_BY_ID=\'yes\' >> /etc/default/zfs
[root@daq13 ~]# parted /dev/sdb
(parted) mklabel GPT
(parted) q                                                                
[root@daq13 ~]# parted /dev/sdc
(parted) mklabel GPT                                                      
(parted) q                                                                
[root@daq13 ~]# blkid                                                     
/dev/sda1: UUID="ab920e4b-40ae-4551-aab8-f3e893d38830" TYPE="xfs" 
/dev/sdb: PTTYPE="gpt" 
/dev/sdc: PTTYPE="gpt" 
[root@daq13 ~]# zpool create z10tb mirror /dev/sdb /dev/sdc
[root@daq13 ~]# zpool status
  pool: z10tb
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        z10tb       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0

errors: No known data errors
[root@daq13 ~]# 
[root@daq13 ~]# zfs create z10tb/emma
[root@daq13 ~]# df -kl
Filesystem      1K-blocks     Used  Available Use% Mounted on
pool           9426697856        0 9426697856   0% /pool
pool/daqstore  9426697856        0 9426697856   0% /pool/daqstore
[root@daq13 ~]# 

Enable ZFS at boot

systemctl enable zfs-import-cache
systemctl enable zfs-import-scan
systemctl enable zfs-mount
systemctl enable zfs-import.target
systemctl enable zfs.target

Replace failed disk

  • pull failed disk out
  • zpool status # identify failed disk zfs label (it should be labeled FAULTED or OFFLINE
  • safe to reboot here
  • install new disk
  • partition new disk, i.e. "gdisk /dev/sdh", use "o" to create new partition table, use "n" to create new partition, accept all default answers, use "w" to save and exit
  • safe to reboot here
  • run tests on new disk (smart, diskscrub), if unhappy go back to "install new disk"
  • safe to reboot here
  • identify serial number of new disk, i.e. "smartctl -a /dev/sdh | grep -i serial" yields "Serial Number: WD-WCAVY0893313"
  • identify linux id of new disk by "ls -l /dev/disk/by-id | grep -i WD-WCAVY0893313" yields "ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1"
  • zpool replace data11 zfs-label-of-failed-disk ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1
  • zpool status should look like this:
[root@daq11 ~]# zpool status
  pool: data11
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Apr 29 11:51:03 2016
    24.7G scanned out of 795G at 32.3M/s, 6h46m to go
    3.00G resilvered, 3.11% done
config:

        NAME                                                   STATE     READ WRITE CKSUM
        data11                                                 DEGRADED     0     0     0
          raidz2-0                                             DEGRADED     0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA3872943-part1     ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973466-part1     ONLINE       0     0     0
            replacing-2                                        DEGRADED     0     0     0
              17494865033746374811                             FAULTED      0     0     0  was /dev/sdi1
              ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1  ONLINE       0     0     0  (resilvering)
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973369-part1     ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0858733-part1     ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0819555-part1     ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0857075-part1     ONLINE       0     0     0
            ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0347413-part1    ONLINE       0     0     0

errors: No known data errors
  • wait for raid rebuild ("resilvering") to complete
  • zpool status should look like this:
[root@daq11 ~]# zpool status
  pool: data11
 state: ONLINE
  scan: resilvered 96.2G in 1h44m with 0 errors on Fri Apr 29 13:35:40 2016
config:

        NAME                                                 STATE     READ WRITE CKSUM
        data11                                               ONLINE       0     0     0
          raidz2-0                                           ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA3872943-part1   ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973466-part1   ONLINE       0     0     0
            ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0893313-part1  ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WCAZA1973369-part1   ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0858733-part1   ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0819555-part1   ONLINE       0     0     0
            ata-WDC_WD20EARS-00MVWB0_WD-WMAZA0857075-part1   ONLINE       0     0     0
            ata-WDC_WD2002FYPS-01U1B0_WD-WCAVY0347413-part1  ONLINE       0     0     0

errors: No known data errors

replace failed disk (whole disk zfs)

  • roughly same as above
  • parted /dev/sdi, mklabel GPT, q
  • zpool replace pool 5050168421842479357 /dev/disk/by-id/ata-WDC_WD60EFRX-68MYMN1_WD-WX31D944C2AP
  • here "pool" is the zfs pool name
  • first number is the failed "was" disk from "zpool status"
  • second /dev/disk/by-id is the replacement disk from ./smart-status.perl
  • zpool status
[root@tigstore01 ~]# zpool status
  pool: pool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Sep  8 16:06:41 2022
	9.36T scanned at 829M/s, 7.27T issued at 644M/s, 18.0T total
	625G resilvered, 40.48% done, 04:50:08 to go
config:

	NAME                                            STATE     READ WRITE CKSUM
	pool                                            DEGRADED     0     0     0
	  raidz2-0                                      DEGRADED     0     0     0
	    ata-WDC_WD60EFRX-68MYMN1_WD-WX21D9421VHA    ONLINE       0     0     0  (resilvering)
	    ata-WDC_WD60EFRX-68MYMN1_WD-WX31D944C8Y6    ONLINE       0     0     0
	    ata-WDC_WD60EFRX-68MYMN1_WD-WX31D944CKDK    ONLINE       0     0     0
	    replacing-3                                 DEGRADED     0     0     0
	      5050168421842479357                       UNAVAIL      0     0     0  was /dev/disk/by-id/ata-WDC_WD60EFRX-68MYMN1_WD-WX31D944CPDP-part1
	      ata-WDC_WD60EFRX-68L0BN1_WD-WX11DA5NDTPS  UNAVAIL      3     0     0  (resilvering)
	      ata-WDC_WD60EFRX-68MYMN1_WD-WX31D944C2AP  ONLINE       0     0     0  (resilvering)
	    ata-WDC_WD60EFRX-68L0BN1_WD-WX21DA5FNE28    ONLINE       0     0     0
	    ata-WDC_WD60EFRX-68MYMN1_WD-WX41D94RNHS4    ONLINE       0     0     0  (resilvering)
	    ata-WDC_WD60EFRX-68MYMN1_WD-WX41D94RNT2A    ONLINE       0     0     0
	    ata-WDC_WD60EFRX-68MYMN1_WD-WX41D94RNZJ0    ONLINE       0     0     0

errors: No known data errors
  • wait for resilver to complete

Expand zfs pool

replacing 250GB mirrored SSDs with 1TB mirrored SSDs:
zpool scrub ### ensure both mirror halves are consistent and have good data
# confirm have backups of pool contents (amanda and daqbackup)
# pull one 250GB SSD
# insert replacement 1TB SSD
# follow instructions for replacing failed disk:
parted /dev/sda ...
ls -l /dev/disk/by-id/...
zpool replace zssd sda1 ata-WDC_WDS100T2B0A_192872803056
# wait for resilvering to complete
zpool scrub zssd # confirm resilver was ok
# do the same with the second 1TB disk
parted /dev/sdb
ls -l /dev/disk/by-id/...
zpool replace zssd sdb1 ata-WDC_WDS100T2B0A_192872802193
zpool online -e zssd ata-WDC_WDS100T2B0A_192872803056
zpool list -v ### observe EXPANDSZ is now non-zero
# wait for resilver to finish
zpool online -e zssd ata-WDC_WDS100T2B0A_192872803056
zpool list -v ### observe EXPANDSZ is now zero, but SIZE and FREE have changed
[root@alpha00 ~]# zpool list -v zssd
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zssd   222G   202G  20.1G      706G    56%    90%  1.00x  DEGRADED  -
  mirror   222G   202G  20.1G      708G    56%    90%
    ata-WDC_WDS100T2B0A_192872803056      -      -      -         -      -      -
    replacing      -      -      -      708G      -      -
      sdb1      -      -      -      708G      -      -
      ata-WDC_WDS100T2B0A_192872802193      -      -      -         -      -      -
[root@alpha00 ~]# zpool list -v zssd
NAME   SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
zssd   930G   202G   728G         -    13%    21%  1.00x  ONLINE  -
  mirror   930G   202G   728G         -    13%    21%
    ata-WDC_WDS100T2B0A_192872803056      -      -      -         -      -      -
    ata-WDC_WDS100T2B0A_192872802193      -      -      -         -      -      -

Convert pool from single to mirror

  • we will convert a single-disk pool to a mirrored pool
  • initial state:
root@daq13:~# zpool status
  pool: bpool
 state: ONLINE
  scan: none requested
config:

	NAME                                    STATE     READ WRITE CKSUM
	bpool                                   ONLINE       0     0     0
	  489bdda8-989a-f748-95b2-c1041aceed65  ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: none requested
config:

	NAME                                    STATE     READ WRITE CKSUM
	rpool                                   ONLINE       0     0     0
	  d870d08b-5bba-f441-b486-6e4975a384f2  ONLINE       0     0     0

errors: No known data errors
root@daq13:~# zpool attach rpool d870d08b-5bba-f441-b486-6e4975a384f2 /dev/sda2
  • status
root@daq13:~# zpool status
  pool: bpool
 state: ONLINE
  scan: none requested
config:

	NAME                                    STATE     READ WRITE CKSUM
	bpool                                   ONLINE       0     0     0
	  489bdda8-989a-f748-95b2-c1041aceed65  ONLINE       0     0     0

errors: No known data errors

  pool: rpool
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Jan  3 16:17:48 2021
	8.94G scanned at 2.98G/s, 620M issued at 207M/s, 8.94G total
	637M resilvered, 6.78% done, 0 days 00:00:41 to go
config:

	NAME                                      STATE     READ WRITE CKSUM
	rpool                                     ONLINE       0     0     0
	  mirror-0                                ONLINE       0     0     0
	    d870d08b-5bba-f441-b486-6e4975a384f2  ONLINE       0     0     0
	    sda2                                  ONLINE       0     0     0  (resilvering)

errors: No known data errors

Rename zfs pool

zpool export oldname
zpool import oldname z6tb

Quotas and disk use

  • zfs userspace zssd/home1 -s used

Misc

ZFS tunable parameters for hopefully speeding up resilvering:

https://www.reddit.com/r/zfs/comments/4192js/resilvering_raidz_why_so_incredibly_slow/
echo 0 > /sys/module/zfs/parameters/zfs_resilver_delay
echo 512 > /sys/module/zfs/parameters/zfs_top_maxinflight
echo 5000 > /sys/module/zfs/parameters/zfs_resilver_min_time_ms

Enable periodic scrub:

cd ~/git/scripts
git pull
cd zfs
make install

Working with ZFS snapshots:

If ZFS becomes 100% full, "rm" will stop working, but space can still be freed by using "echo > bigfile", afterwards "rm" works again.