Copy on write (cow)
The technology used by all drivers - copy on write (cow). Cow is copy on write, which means to copy only when writing is needed. This is a modification scenario for existing files. For example, if multiple containers are started based on an image, if each container is allocated a file system with the same image, it will occupy a lot of disk space. Cow technology allows all containers to share the file system of the image, and all data is read from the image. Only when the file is to be written, the file to be written is copied from the image to its own file system for modification. Therefore, no matter how many containers share the same image, the write operation is performed on the replica copied from the image to their own file system, and the source file of the image will not be modified. If multiple containers operate on the same file, a replica will be generated in the file system of each container. What each container modifies is its own replica and isolated from each other, They don't affect each other. Using cow can effectively improve disk utilization.
Allocate on demand
Write time allocation is used in scenarios where there is no such file. Space is allocated only when a new file is to be written, which can improve the utilization of storage resources. For example, starting a container does not pre allocate some disk space for the container, but allocates new space as needed when a new file is written.
Aufs (another union FS) is a union FS, which is a file level storage driver. Aufs can transparently cover the layered file system of one or more existing file systems, and merge multiple layers into a single-layer representation of the file system. Simply put, it supports mounting different directories to the file system under the same virtual file system. This file system can overlay and modify files layer by layer. No matter how many layers below are read-only, only the top file system is writable. When a file needs to be modified, aufs creates a copy of the file, uses cow to copy the file from the read-only layer to the writable layer for modification, and the results are also saved in the writable layer. In docker, the lower read-only layer is image, and the writable layer is container. The structure is shown in the figure below:
Overlay is supported by Linux kernel after 3.18. It is also a kind of union FS. Unlike aufs, overlay has only two layers: an upper file system and a lower file system, representing the image layer and container layer of docker respectively. When a file needs to be modified, use cow to copy the file from the read-only lower to the writable upper for modification, and the results are also saved in the lower layer. In docker, the lower read-only layer is image, and the writable layer is container. The structure is shown in the figure below:
Device mapper is supported by Linux kernel after 2.6.9. It provides a mapping framework mechanism from logical devices to physical devices. Under this mechanism, users can easily formulate and implement storage resource management strategies according to their own needs. Aufs and overlayfs mentioned earlier are file level storage, while device mapper is block level storage. All operations are direct operations on blocks, not files. The device mapper driver will first create a resource pool on the block device, and then create a basic device with a file system on the resource pool. All images are snapshots of the basic device, and the container is a snapshot of the image. Therefore, the file system in the container is a snapshot of the file system of the basic device in the resource pool, and there is no space allocated for the container. When a new file is to be written, a new block is allocated in the container's image and data is written. This is called time allocation. When you want to modify an existing file, use cow to allocate block space for the container snapshot, copy the data to be modified to a new block in the container snapshot, and then modify it. The device mapper driver will create a 100g file by default, including images and containers. Each container is limited to 10g volumes and can be configured and adjusted by itself. The structure is shown in the figure below:
Btrfs is called the next generation write time copy file system and incorporated into the Linux kernel. It is also a file level storage, but it can directly operate the underlying device like device mapper. Btrfs configures a part of the file system as a complete sub file system, which is called subvolume. With subvolume, a large file system can be divided into multiple sub file systems. These sub file systems share the underlying device space and allocate it from the underlying device when disk space is needed, just like an application calling malloc () to allocate memory. In order to make flexible use of the device space, Btrfs divides the disk space into multiple chunks. Each chunk can use different disk space allocation policies. For example, some chunks only store metadata, and some chunks only store data. This model has many advantages. For example, Btrfs supports dynamic addition of devices. After adding a new disk to the system, you can use the Btrfs command to add the device to the file system. Btrfs regards a large file system as a resource pool and configures it into multiple complete sub file systems. It can also add new sub file systems to the resource pool. The basic image is the snapshot of the sub file system. Each sub image and container has its own snapshot, and these snapshots are the snapshots of subvolume.
When a new file is written, a new data block is allocated in the snapshot of the container. The file is written in this space, which is called time allocation. When you want to modify an existing file, use cow copy to allocate a new original data and snapshot, change the data in the newly allocated space, and then update the relevant data structure to point to the new sub file system and snapshot. The original original data and snapshot have no pointer and are overwritten.
ZFS file system is a revolutionary new file system. It fundamentally changes the management mode of file system. ZFS completely abandons "volume management" and no longer creates virtual volumes. Instead, it centralizes all devices into one storage pool for management and uses the concept of "storage pool" to manage physical storage space. In the past, file systems were built on physical devices. In order to manage these physical devices and provide redundancy for data, the concept of "volume management" provides a single device image. ZFS is created on a virtual storage pool called "zpools". Each storage pool consists of several virtual devices (vdevs). These virtual devices can be raw disks, a RAID1 mirror device, or multi disk groups with non-standard RAID levels. The file system on zpool can then use the total storage capacity of these virtual devices.
Let's take a look at the use of ZFS in docker. First, a ZFS file system is allocated from zpool to the basic layer of the image, while other image layers are clones of the ZFS file system snapshot. The snapshot is read-only and the clone is writable. When the container is started, a writable layer is generated at the top level of the image. As shown in the figure below:
When you want to write a new file, use on-demand allocation. A new data is quickly generated from zpool, the new data is written to this block, and the new space is stored in the container (ZFS clone).
When you want to modify an existing file, use copy on write to allocate a new space and copy the original data to the new space to complete the modification.
AUFS VS Overlay
Both aufs and overlay are federated file systems, but aufs has multiple layers, while overlay has only two layers. Therefore, when copying on write, if the file is large and there are lower layers, Ausf may be slower. Moreover, overlay is incorporated into the Linux kernel mainline, and aufs does not, so it may be faster than aufs. However, overlay is still too young and should be used cautiously in production. As the first storage driver of docker, aufs has a long history, is relatively stable, has been practiced in a large number of production, and has strong community support. Currently, open source DC / OS specifies to use overlay.
Overlay VS Device mapper
Overlay is a file level storage and device mapper is a block level storage. When a file is very large and the modified content is very small, overlay will copy the whole file regardless of the modified content size. Modifying and displaying a large file takes more time than a small file, while block level only copies the blocks that need to be modified, not the whole file, In this scenario, device mapper is obviously faster. Because the block level directly accesses the logical disk, it is suitable for IO intensive scenarios. The performance of overlay is relatively stronger in the scenario of complex internal program, large concurrency but less io.
Device mapper VS Btrfs Driver VS ZFS
Both device mapper and Btrfs operate directly on blocks and do not support shared storage, which means that when multiple containers read the same file, they need to live multiple copies, so this storage driver is not suitable for use on the PAAS platform of high-density containers. Moreover, when many containers are started and stopped, it may lead to disk overflow and make the host unable to work. Device mapper is not recommended for use in production. Btrfs can be very efficient in docker build.
ZFS was originally designed for salaris servers with a large amount of memory. It will affect the memory when used. It is suitable for environments with large memory. ZFS cow makes the fragmentation problem more serious. For large files generated by sequential writing, if some of them are randomly changed in the future, the physical address of the file on the hard disk will no longer be continuous, and the performance of sequential reading will become poor in the future. ZFS supports multiple containers to share a cache block, which is suitable for PAAS and high-density user scenarios.
Test tool: iozone (a benchmark tool for file systems, which can test the read and write performance of file systems in different operating systems).
Test scenario: sequential and random IO performance from 4K to 1g files.
Test method: start the container based on different storage drivers, install iozone in the container, and execute the command:.
Definition and interpretation of test items
Write: tests the performance of writing to a new file.
Re write: test the performance of writing to an existing file.
Read: test the performance of reading an existing file.
Re read: test the validity of reading a recently read file
Bao’an District Shenzhen City, China
+86 189 3806 5764