file_transfer(35) Language: English

About file copying and synchronization


Having the fastest computers, fastest network and internet connections and giant storage today, it is still a serious problem to do one thing quickly: File synchronization. There are two major issues concerning current synchronization mechanisms:

  1. Reading and writing many files always takes a long time compared to reading or writing one large file. Unless you are reading from and writing to a SSD, this is always true, especially when slower media (SD cards, network file transfer like FTP or WebDAV or optical storage) are involved.
  2. Synchronization between two file systems (FTP and WebDAV may be accessed via FUSE filesystems and are also means of accessing files, hence we use file systems here for them as well) is not easily possible: Most means of synchronizing directory trees rely on the modification date of files which is bad in two ways: First, computer clocks are not always synchronized with NTP and therefore often incorrect and second, renaming a file or transferring it across file systems often corrupts these metainformation. Second, accessing the metadata on slow filesystems can be very time consuming as well. Slowness rarely means slow transfer rate but often only slow access time, e. g. with CD-ROMs.


To solve these problems, two different solutions have to be considered.

  1. For efficient synchronization to happen between two points of which one is slow to read from or write to or slow to respond, it is important to be able to quickly and reliably access the metadata for all files stored at the foreign location. A simple solution to this could include the creation of a simple checksum table consisting, e. g. of the SHA256 and the file name which could then be generated at the faster side and synchronized to the other side sending only the files which differ in checksum. This only supports unidirectional synchronization which is applicable to most cases. Consider, e. g. a static website which is only synchronized from the development machine to the FTP storage or some WebDAV based backup system.
  2. For efficient transfer to happen between two points of which one is slow to respond, it is heplful to pack files together so that they are not transferred individually but rather as a continous stream of data. This could be as simple as sending all files through tar or cpio (probably compressing them on the way for even faster transfer across slow internet upload connections).

While the first idea can be implemented in current systems, the second idea needs a compatible client. SSH or Netcat for example, can be used in conjunction with tar or cpio to send files as streams. FTP however, can not be used that way, unless the data is extracted on the server by a separate program. The common PHP + FTP combination could benefit from this by using FTP to upload one large file of files to change and extract it using PHP's ZIP or TAR extraction capabilities.

Theorethical Application

  • Calculates checksum table for a FS tree
  • Asks the server for its current checksum table.
  • Computs the difference.
  • Create a _deleted virtual file and add it to an archive together with all of the changed files and a virtual _checksum file.
  • Send the archive to the server.
  • Answers the client's query for the checksum table if available. Otherwise, a table with the current FS tree but no checksums is returned (which is slow!)
  • Can either run on the real server or provide an FTP interface.
  • The real server takes the archive, extracts it over the old files, deletes all files from _deleted and writes a new checksum table.
  • The FTP interface server takes the archive, extracts it sending the files over FTP, deletes the _deleted files from the FTP and writes a new checksum table to the FTP server.

Configuration: Server address, Server port, List of files and directories (won't work recursive automatically to integrate well with find) and an option to allow deleting files on the server which do not exist anymore.

Also, the FTP part of the server could be amended to also work with more/any remote (or even local) file system.

Optical Media is Special

Reading from a DVD would still not be any faster. For this, it is easier to create an image first and extract the files from the image file rather than copying them directly. A cleaner approach is not known so far.

Zum Seitenanfang