At my company we recently installed a new 3-node Vertica cluster, but when testing backup/restore procedures we encountered some issues.
Our Vertica cluster consists of three identical physical hosts running Vertica 8.1 on CentOS 7.3 and connected through 10 Gbit/s Ethernet. Each host has two network interfaces, one for private interconnect and one for public connectivity.
The backup target directory is an NFS mount point which is shared by an external host, an EMC Data Domain server which provides redundancy and data deduplication.
The NFS share was initially mounted with all the default options.
First issue: NFS locking
The first issue we encountered was a locking error when running the
backup task of Vertica’s
$ /opt/vertica/bin/vbr.py -t backup -c /home/dbadmin/vertica_backup/mybackup.ini Error: Error locking backup location. Another vbr task is currently running: unknown. Backup FAILED.
Backup was instead successful when the target directory was on the local file system, so we immediately suspected an NFS issue.
According to EMC Support Solution 304322 “NFS Best Practices for Data Domain and client OS”, Data Domain does not support NFS locking so the
nolock keyword must be added to the NFS mount option on the clients.
Thus, I modified the
/etc/fstab entry on the three Vertica nodes to include all the options recommended by the EMC note:
[dbadmin@vertica01 ~]$ tail -n1 /etc/fstab datadomain.example.com:/data/col1/vertica /media/backup nfs hard,intr,nolock,nfsvers=3,tcp,timeo=1200,rsize=1048600,wsize=1048600,bg
After remounting the NFS directory on all three nodes, the backup was successful.
Increasing the verbosity of
vbr‘s log files can help with troubleshooting. This can be accomplished by adding the
--debug 3 parameter to the
vbr invocation and will generate additional logging under the
Second issue: SSH concurrency
After successfully completing the backup, I wanted to ensure their integrity before testing a full restore.
However, while both the
quick-check tasks were successful, the
full-check task failed:
[dbadmin@vertica01 ~]$ /opt/vertica/bin/vbr.py -t full-check -c /home/dbadmin/vertica_backup/mybackup.ini Checking backup consistency. List all snapshots in backup location: Snapshot name and restore point: mybackup_20170524_140413, nodes:['v_example0001', 'v_example0002', 'v_example0003']. Error: Error accessing remote storage: failed to get remote files: ssh_exchange_identification: Connection closed by remote host rsync: connection unexpectedly closed (0 bytes received so far) [Receiver] rsync error: unexplained error (code 255) at io.c(601) [Receiver=3.0.7] : returncode=255 Full-check FAILED.
The solution was provided by HPE Vertica Support and consists of changing the
MaxStartups configuration parameter of the SSH daemon which specifies the maximum number of concurrent unauthenticated connections allowed before dropping them:
[root@vertica01 ~]# grep MaxStartups /etc/ssh/sshd_config MaxStartups 50 [root@vertica01 ~]# systemctl reload sshd
After this configuration change, both the
restore tasks were successful.