file_transfer(35) Language: English


About file copying and synchronization

----------------------------------------------------------------------[ Meta ]--

name		file_transfer
section		35
description	About file copying and synchronization
tags		theory d5os ftp copy
encoding	utf8
compliance	public
lang		en
creation	2014/07/18 17:57:14
version		1.0.0.0
copyright	Copyright (c) 2014 Ma_Sys.ma.
		For further info send an e-mail to Ma_Sys.ma@web.de.

-------------------------------------------------------------------[ Problem ]--

Having the fastest computers, fastest network and internet connections and giant
storage today, it is still a serious problem to do one thing quickly: File
synchronization. There are two major issues concerning current synchronization
mechanisms:

 1. Reading and writing many files _always_ takes a long time compared to
    reading or writing one large file. Unless you are reading from and writing
    to a SSD, this is always true, especially when slower media (SD cards,
    network file transfer like FTP or WebDAV or optical storage) are involved.
 2. Synchronization between two file systems (FTP and WebDAV may be accessed via
    FUSE filesystems and are also means of accessing files, hence we use ``file
    systems'' here for them as well) is not easily possible: Most means of
    synchronizing directory trees rely on the modification date of files which
    is bad in two ways: First, computer clocks are not always synchronized with
    NTP and therefore often incorrect and second, renaming a file or
    transferring it across file systems often corrupts these metainformation.
    Second, accessing the metadata on slow filesystems can be very time
    consuming as well. Slowness rarely means slow transfer rate but often only
    slow access time, e.\,g. with CD-ROMs.

------------------------------------------------------------------[ Solution ]--

To solve these problems, two different solutions have to be considered.

 1. For efficient synchronization to happen between two points of which one is
    slow to read from or write to or slow to respond, it is important to be
    able to quickly and reliably access the metadata for all files stored at the
    foreign location. A simple solution to this could include the creation of
    a simple checksum table consisting, e.\,g. of the SHA256 and the file name
    which could then be generated at the ``faster'' side and synchronized to the
    other side sending only the files which differ in checksum. This only
    supports unidirectional synchronization which is applicable to most cases.
    Consider, e.\,g. a static website which is only synchronized from the
    development machine to the FTP storage or some WebDAV based backup system.
 2. For efficient transfer to happen between two points of which one is slow to
    respond, it is heplful to pack files together so that they are not
    transferred individually but rather as a continous stream of data. This
    could be as simple as sending all files through `tar` or `cpio` (probably
    compressing them on the way for even faster transfer across slow internet
    upload connections).

While the first idea can be implemented in current systems, the second idea
needs a compatible client. SSH or Netcat for example, can be used in conjunction
with `tar` or `cpio` to send files as streams. FTP however, can not be used that
way, unless the data is ``extracted'' on the server by a separate program. The
common PHP + FTP combination could benefit from this by using FTP to upload one
large file of files to change and extract it using PHP's ZIP or TAR extraction
capabilities.

--------------------------------------------------[ Theorethical Application ]--

Client
 * Calculates checksum table for a FS tree
 * Asks the server for its current checksum table.
 * Computs the difference.
 * Create a `_deleted` virtual file and add it to an archive together with all
   of the changed files and a virtual `_checksum` file.
 * Send the archive to the server.

Server
 * Answers the client's query for the checksum table if available. Otherwise,
   a table with the current FS tree but no checksums is returned
   (which is slow!)
 * Can either run on the real server or provide an FTP interface.
 * The ``real'' server takes the archive, extracts it over the old files,
   deletes all files from `_deleted` and writes a new checksum table.
 * The ``FTP interface'' server takes the archive, extracts it sending the
   files over FTP, deletes the `_deleted` files from the FTP and writes a new
   checksum table to the FTP server.

Configuration: Server address, Server port, List of files and directories
(won't work recursive automatically to integrate well with `find`) and an
option to allow deleting files on the server which do not exist anymore.

Also, the ``FTP part'' of the server could be amended to also work with more/any
remote (or even local) file system.

--------------------------------------------------[ Optical Media is Special ]--

Reading from a DVD would still not be any faster. For this, it is easier to
create an image first and extract the files from the image file rather than
copying them directly. A cleaner approach is not known so far.


Zum Seitenanfang