Migrating Data Between Local File Systems
Note
Large scale data migration can be challenging and time consuming. Please review the following guidelines and tools to minimize the time it takes to move your data and ensure successful and complete migration
General Guidelines
Size the dataset and prune unneeded data. Use tools such as
du
,tree
on the directories to understand the data volumes. Ensure there are no duplicate data sets, temporary working files, or other unneeded content. The most efficient way to move data is to reduce the data to move. Usetar
orzip
archiving tools to collapse directories into a single file. As appropriate, archive directories to the site-specific HSMS and delete from scratch file systems.Start early and leave plenty of time for migration. Be aware that everyone on the filesystems will be moving data. Even with data sizes in hand, with limited insight into the data structure of individual directories, it is hard to predict exactly how long a transfer might take. Be sure to plan far ahead and leave yourself plenty of time to complete a migration! Note that transferring many small files is often worse than a few large files because performance is more strongly related to the time it takes to access a file, not transfer it.
Make sure that the user performing the copy has permissions to read all data in the directory to be transferred. If a directory has files or sub-directories which are restricted, you will need to split it up into multiple transfers as multiple users, or change ownership on the source data first.
Disable all batch and cron jobs that may be modifying the directories to be transferred! Any create/modify/delete changes can result in errors for any data transfer tool. For transfer of a large directory it may be OK to perform an initial copy interactively, but definitely quiesce access before performing a final sync.
Use a synchronization tool (NOT just
cp
ormv
) and don’t rely on a one-time transfer completing perfectly. This is important because you will most likely have to run the process more than once, and tools such as rsync will skip already copied files. Then go back and delete the source data once you have confirmed the copy is complete.For small data volumes, use an interactive session on an HPCS head node. In the unlikely event the volume of data to move is less than a terabyte (TB) / 1,000 gigabytes (GB) it is appropriate to use a head node to do an ‘ad-hoc’ data transfer using a tool such as rsync.
For larger data volumes, submit a batch job to a ‘dtn’ or similar queue
Suggested Tools
du
An original part of Unix, the du
disk usage tool will be found on
every HPCS. It can provide a simple overview of the usage of a file
or directory. Output can be easily sorted by piping the output
through sort
. One example command is:
du -sk DIRECTORY/* | sort -n
-s
will summarize sub directory usage-k
will output in 1024-byte (1 kiB) blocks| sort -n
pipes the output through the sort, sorted numerically
tree
A highly useful but optional part of Linux systems that should be
installed on all NOAA RDHPCS, the tree
tool provides
tree-structured output about a directory with the option to summarize
and calculate usage. One example command is:
tree --du -h -d -L 2 --sort=size DIRECTORY
--du
will calculate disk usage on directories-h
will display human-readable (K,M,G,T) volumes-d
will summarize directories-L 2
will only show two levels of directories--sort=size
will sort output by size
% tree --du -h -d --sort=size -L 2 .
[8.8K] .
├── [6.3K] source
│ ├── [2.6K] images
│ ├── [ 416] data
│ ├── [ 416] systems
│ ├── [ 288] software
│ ├── [ 224] slurm
│ ├── [ 192] _templates
│ ├── [ 192] accounts
│ ├── [ 160] _downloads
│ ├── [ 160] files
│ ├── [ 128] _search
│ ├── [ 128] _static
│ ├── [ 128] contributing
│ ├── [ 128] help
│ ├── [ 128] logging_in
│ ├── [ 96] FAQ
│ ├── [ 96] compilers
│ ├── [ 96] connecting
│ └── [ 96] queue_policy
├── [1.7K] build
│ ├── [ 992] html
│ └── [ 608] doctrees
└── [ 96] utils
15K used in 24 directories
rsync
For basic migration, it is recommended to use the rsync
tool to
transfer the files and directories. One example command is:
rsync --archive --verbose --one-file-system /full/path/to/source/directory/ /full/path/to/destination/directory
Warning
It is very important that you have a trailing slash after the
source directory: /full/path/to/source/directory/
/. If you do not,
a second invocation of the same command will attempt to retransfer all of
the data into a subdirectory, for example:
/full/path/to/source/directory/directory
.
--archive
(-a
) will ensure all ownership and dates are preserved in the transfer.--verbose
(-v
) will display details of every file being transfered. If you have lots of small files, this will slow down the transfer processes.--one-file-system
(-x
) restricts the transfer to the source filesystem. This is important when symlinks are used to point to data that exists on other filesystems.
To keep the two directories exactly the same, use --delete
– if
the file did not exist in source, you want it removed on
destination if does exist:
--delete
means to remove files from the destination that are not in the source directory. If after a completed rsync a file was then removed from the source, then the next rsync with the –delete option would then remove the file from the destination/ It may be preferable to clean up the source only after confirming that all the files have been transferred.
Warning
Do not use the –-delete
option if you do not want data in the
destination directory to be removed.
xsync
On Jet and Hera, an additional data synchronization tool,
xsync
is available in /apps/local/bin
. It is an unsupported
wrapper around rsync
,
find
, and xargs
that performs multi-threaded transfers.
Usage of xsync
is almost identical to rsync
as described above.
Note
xsync
does not support the --include
and --exclude
rsync options. To view additional parameters to tune threading
and depth for better performance, run xsync –-help
. In mostg
cases they should not be needed.
A sample batch script to transfer data
Here is a sample batch script that can be used as a template, then submitted to the batch system to perform the data movement:
#!/bin/bash
#SBATCH --job-name=data-transfer
#SBATCH --partition=PARTITION_GOES_HERE
#SBATCH --time=08:00:00
#SBATCH --nodes=1
#SBATCH --output=$HOME/data-transfer-job-%j
set -x
SRC=/path/to/source/directory/ # Note trailing slash
DEST=/path/to/destination/directory
echo "$(date) : Starting sync from $SRC to $DEST"
rsync -ax $SRC $DEST
echo "$(date) : Ending sync from $SRC to $DEST"
Before using this template, replace the PARTITION_GOES_HERE
with
the appropriate partition for the HPCS being used. Refer to the
system-specific pages for that information.
After updating the template and saving it locally as a batch job,
submit it to the batch system. Watch for the exit status – if it does
not finish in 8 hours, resubmit it. Once it finishes successfully, add
-v
to the rsync line and submit it one more time. Examine the
output file carefully to make sure there are no errors.
If after several tries, the transfer still hasn’t completed, and the errors are not obvious upon reading the batch job output, refer to the getting help pages and ask for assistance. Be sure and include the file paths of the output files of your transfer jobs for best assistance.
Known Issues
My job runs to completion but the files are not transferred
Look at the job output for obvious errors. It will be in your home
directory in a file starting with data-transfer-job-
. If your job
completes and the files appear to not to have transferred, read that
file for clues.
If you are not a regular user of the batch system, it is likely that
your initialization files are printing messages (typically with
echo
command in the initialization files) that are causing the
jobs to fail.
If this happens you could rename your initialization files (.cshrc, .tcshrc, .bashrc, .login, .profile, .bash_profile, etc) temporarily and try again. A better solution is to address the problems caused by these initialization files.
Were all my files transferred?
Look at the job output. It will be in your home directory in a file
starting with data-transfer-job-
. When the job completes read
that file for clues and any errors. You can ignore WARNings, and
other messages, but any message with the string “FATAL” suggests an
incomplete transfer. It can happen because you ran out of time, or
there may be other problems. If your job exited because it ran out of
time you should be able to resubmit the job but be sure to add the
–resume option.