Oracle ZFS: Difference between revisions
Tomskyhaha (talk | contribs) separate paragraph for trademark |
m Moving Category:Formerly free software to Category:Formerly open-source or free software per Wikipedia:Categories for discussion/Log/2023 November 23#Category:Open-source software converted to a proprietary license |
||
(41 intermediate revisions by 18 users not shown) | |||
Line 1: | Line 1: | ||
{{About|the now proprietary filesystem|its open-source successor|OpenZFS|other uses}} |
|||
{{short description|Proprietary file system and logical volume manager by Oracle}} |
{{short description|Proprietary file system and logical volume manager by Oracle}} |
||
{{About|the proprietary filesystem|its open-source alternative|OpenZFS}} |
|||
{{Use mdy dates|date=May 2012}} |
{{Use mdy dates|date=May 2012}} |
||
{{Infobox |
{{Infobox software |
||
| programming language = C |
|||
| full_name = ZFS <!-- It is not an initialism; see the "History" section for an explanation of why --> |
|||
| |
| released = {{Start date and age|2005|11}}, part of [[OpenSolaris]] |
||
| latest_release_version = 11.4 SRU53 (Solaris OS)<ref>{{cite web | title = Announcing Oracle Solaris 11.4 SRU53 | url = https://blogs.oracle.com/solaris/post/announcing-oracle-solaris-114-sru53 | date = January 18, 2023 | access-date = January 18, 2023}}</ref> |
|||
| developer = [[Sun Microsystems]] originally, [[Oracle Corporation]] since 2010. |
|||
| latest_release_date = {{Start date and age|2023|01|18}} |
|||
| introduction_os = [[OpenSolaris]] |
|||
| |
| operating system = [[Oracle Solaris]] |
||
| license |
| license = Proprietary |
||
| website = {{URL|https://docs.oracle.com/cd/E23824_01/html/821-1448/zfsover-1.html}} |
|||
| partition_id = |
|||
}} |
|||
| directory_struct = Extensible [[hash table]] |
|||
| file_struct = |
|||
| bad_blocks_struct = |
|||
| max_filename_size = 255 [[ASCII]] characters (fewer for multibyte character standards such as [[Unicode]]) |
|||
| max_files_no = {{ubl|Per directory: 2<sup>48</sup>|Per file system: unlimited<ref name="scalability">{{cite web|title=What Is ZFS?|url=http://docs.oracle.com/cd/E23823_01/html/819-5461/zfsover-2.html#gayou|website=Oracle Solaris ZFS Administration Guide|publisher=Oracle|accessdate=29 December 2015|archive-url=https://web.archive.org/web/20160304210957/http://docs.oracle.com/cd/E23823_01/html/819-5461/zfsover-2.html#gayou|archive-date=March 4, 2016|url-status=live}}</ref>}} |
|||
| max_volume_size = 256 trillion [[yobibyte]]s (2<sup>128</sup> bytes)<ref name="scalability"/> |
|||
| max_capacity = 256 [[UB]] (2<sup>128</sup> bytes) |
|||
| max_file_size = 16 [[exbibyte]]s (2<sup>64</sup> bytes) |
|||
| filename_character_set = |
|||
| dates_recorded = |
|||
| date_range = |
|||
| date_resolution = |
|||
| forks_streams = Yes (called "extended attributes", but they are full-fledged streams) |
|||
| attributes = [[POSIX]] |
|||
| file_system_permissions = POSIX, NFSv4 ACLs |
|||
| compression = Yes |
|||
| data_deduplication = Yes |
|||
| encryption = Yes<ref name="encryption"/> |
|||
| copy_on_write = Yes |
|||
| repo = https://github.com/zfsonlinux/zfs |
|||
| OS = [[Solaris (operating system)|Solaris]], [[OpenSolaris]], [[illumos]] distributions, [[OpenIndiana]], [[FreeBSD]], [[macOS Server#Mac OS X Server 10.5 (Leopard Server)|Mac OS X Server 10.5]] (limited to read-only), [[NetBSD]], [[Linux]] via third-party [[Loadable Kernel Module|kernel module]] ("ZFS on Linux")<ref>{{cite web | url = http://zfsonlinux.org/faq.html#WhatAboutTheLicensingIssue | title = 1.1 What about the licensing issue? | accessdate = November 18, 2010 | archive-url = https://web.archive.org/web/20100926104451/http://zfsonlinux.org/faq.html#WhatAboutTheLicensingIssue | archive-date = September 26, 2010 | url-status = live }}</ref> or ZFS-[[Filesystem in Userspace|FUSE]], [[OSv]] |
|||
|variants=}} |
|||
'''Oracle ZFS''' is [[Oracle Corporation|Oracle]]'s proprietary implementation of the [[ZFS]] [[file system]] and [[Logical volume management|logical volume manager]] for [[Oracle Solaris]]. ZFS is a registered trademark belonging to Oracle.<ref>{{cite web | url =https://tsdrapi.uspto.gov/ts/cd/casestatus/sn85901629/content | title =Status Information for Serial Number 85901629 (ZFS) | publisher =United States Patent and Trademark Office | access-date =October 21, 2013 | archive-url =https://web.archive.org/web/20131021232022/https://tsdrapi.uspto.gov/ts/cd/casestatus/sn85901629/content | archive-date =October 21, 2013 | url-status =live }}</ref> |
|||
'''ZFS''' is a combined proprietary [[file system]] and [[Logical volume management|logical volume manager]]. Its [[Open-source software|open-source]] successor [[OpenZFS]] and the proprietary implementation by [[Oracle]] are extremely similar, making ZFS widely available within [[Unix-like]] systems. |
|||
ZFS is a registered trademark belonging to Oracle.<ref>{{cite web | url =https://tsdrapi.uspto.gov/ts/cd/casestatus/sn85901629/content | title =Status Information for Serial Number 85901629 (ZFS) | publisher =United States Patent and Trademark Office | accessdate =October 21, 2013 | archive-url =https://web.archive.org/web/20131021232022/https://tsdrapi.uspto.gov/ts/cd/casestatus/sn85901629/content | archive-date =October 21, 2013 | url-status =live }}</ref> |
|||
==History== |
==History== |
||
{{See also|ZFS#Implementations|OpenZFS#Implementations}} |
|||
{{Main|History and implementations of ZFS}} |
|||
== |
===Solaris 10=== |
||
In update 2 and later, ZFS is part of Sun's own Solaris 10 operating system and is thus available on both [[SPARC]] and [[x86]]-based systems. |
|||
{{Main|History and implementations of ZFS#Implementations}} |
|||
===Solaris 11=== |
|||
==Overview and design goals== |
|||
After Oracle's Solaris 11 Express release, the OS/Net consolidation (the main OS code) was made proprietary and closed-source,<ref>{{cite web |
|||
{{More citations needed section|date=August 2018}} |
|||
| url = http://techie-buzz.com/foss/oracle-has-killed-opensolaris.html |
|||
| title = Oracle Has Killed OpenSolaris |
|||
===ZFS compared to other file systems=== |
|||
| publisher = Techie Buzz |
|||
The management of stored data generally involves two aspects: the physical [[volume management]] of one or more [[Device file|block storage devices]] such as [[hard drive]]s and [[SD card]]s and their organization into [[Logical disk|logical block devices]] as seen by the [[operating system]] (often involving a [[volume manager]], [[RAID controller]], [[array manager]], or suitable [[device driver]]), and the management of data and files that are stored on these logical block devices (a [[file system]] or other data storage). |
|||
| date =August 14, 2010 |
|||
| access-date =July 17, 2013 |
|||
: Example: A [[RAID]] array of 2 hard drives and an SSD caching disk is controlled by [[Intel Rapid Storage Technology|Intel's RST system]], part of the [[chipset]] and [[firmware]] built into a desktop computer. The [[Windows]] user sees this as a single volume, containing an NTFS-formatted drive of their data, and NTFS is not necessarily aware of the manipulations that may be required (such as reading from/writing to the cache drive or [[RAID rebuilding|rebuilding the RAID array]] if a disk fails). The management of the individual devices and their presentation as a single device is distinct from the management of the files held on that apparent device. |
|||
| archive-url = https://web.archive.org/web/20131015165133/http://techie-buzz.com/foss/oracle-has-killed-opensolaris.html |
|||
| archive-date =October 15, 2013 |
|||
ZFS is unusual because, unlike most other storage systems, it unifies both of these roles and ''acts as both the volume manager and the file system''. Therefore, it has complete knowledge of both the physical disks and volumes (including their condition and status, their logical arrangement into volumes, and also of all the files stored on them). ZFS is designed to ensure (subject to suitable [[Computer hardware|hardware]]) that data stored on disks cannot be lost due to physical errors or misprocessing by the hardware or [[operating system]], or [[Data rot|bit rot]] events and [[data corruption]] which may happen over time, and its complete control of the storage system is used to ensure that every step, whether related to file management or [[Disk Management|disk management]], is verified, confirmed, corrected if needed, and optimized, in a way that storage controller cards and separate volume and file managers cannot achieve. |
|||
ZFS also includes a mechanism for dataset and pool-level [[snapshot (computer storage)|snapshots]] and [[replication (computing)|replication]], including snapshot [[Disk cloning|cloning]] which is described by the [[FreeBSD]] documentation as one of its "most powerful features", having features that "even other file systems with snapshot functionality lack".<ref name="freebsd1">{{cite web|url=https://www.freebsd.org/doc/handbook/zfs-zfs.html|title=19.4. zfs Administration|website=www.freebsd.org|access-date=February 22, 2017|archive-url=https://web.archive.org/web/20170223045940/https://www.freebsd.org/doc/handbook/zfs-zfs.html|archive-date=February 23, 2017|url-status=live}}</ref> Very large numbers of snapshots can be taken, without degrading performance, allowing snapshots to be used prior to risky system operations and software changes, or an entire production ("live") file system to be fully snapshotted several times an hour, in order to mitigate data loss due to user error or malicious activity. Snapshots can be rolled back "live" or previous file system states can be viewed, even on very large file systems, leading to savings in comparison to formal backup and restore processes.<ref name="freebsd1"/> Snapshots can also be cloned to form new independent file systems. A pool level snapshot (known as a "checkpoint") is available which allows rollback of operations that may affect the entire pool's structure, or which add or remove entire [[Datasets.load|datasets]]. |
|||
==Features== |
|||
{{Advert section|date=August 2018}} |
|||
===Summary=== |
|||
ZFS is scalable, and includes extensive protection against [[data corruption]], support for high storage capacities, efficient [[data compression]], integration of the concepts of [[File system|filesystem]] and [[volume (computing)|volume management]], [[Snapshot (computer storage)|snapshots]] and [[copy-on-write]] clones, continuous integrity checking and automatic repair, [[#RAID-Z|RAID-Z]], native [[NFSv4]] [[Access control lists|ACLs]], and can be very precisely configured. |
|||
Examples of features specific to ZFS include: |
|||
:* Designed for long-term storage of data, and indefinitely scaled datastore sizes with zero data loss, and high configurability. |
|||
:* Hierarchical [[checksum]]ming of all data and [[metadata]], ensuring that the entire storage system can be verified on use, and confirmed to be correctly stored, or remedied if corrupt. Checksums are stored with a block's parent [[Block (data storage)|block]], rather than with the block itself. This contrasts with many file systems where checksums (if held) are stored with the data so that if the data is lost or corrupt, the checksum is also likely to be lost or incorrect. |
|||
:* Can store a user-specified number of copies of data or metadata, or selected types of data, to improve the ability to recover from data corruption of important files and structures. |
|||
:* Automatic rollback of recent changes to the file system and data, in some circumstances, in the event of an error or inconsistency. |
|||
:* Automated and (usually) silent self-healing of data inconsistencies and write failure when detected, for all errors where the data is capable of reconstruction. Data can be reconstructed using all of the following: error detection and correction checksums stored in each block's parent block; multiple copies of data (including checksums) held on the disk; write intentions logged on the SLOG (ZIL) for writes that should have occurred but did not occur (after a power failure); parity data from RAID/RAIDZ disks and volumes; copies of data from mirrored disks and volumes. |
|||
:* Native handling of standard RAID levels and additional ZFS RAID layouts ("[[RAIDZ]]"). The RAIDZ levels stripe data across only the disks required, for efficiency (many RAID systems stripe indiscriminately across all devices), and checksumming allows rebuilding of inconsistent or corrupted data to be minimised to those blocks with defects; |
|||
:* Native handling of tiered storage and caching devices, which is usually a volume related task. Because ZFS also understands the file system, it can use file-related knowledge to inform, integrate and optimize its tiered storage handling which a separate device cannot; |
|||
:* Native handling of snapshots and backup/[[replication (computing)|replication]] which can be made efficient by integrating the volume and file handling. Relevant tools are provided at a low level and require external scripts and software for utilization. |
|||
:* Native [[data compression]] and [[Data deduplication|deduplication]], although the latter is largely handled in [[RAM]] and is memory hungry. |
|||
:* Efficient rebuilding of RAID arrays—a RAID controller often has to rebuild an entire disk, but ZFS can combine disk and file knowledge to limit any rebuilding to data which is actually missing or corrupt, greatly speeding up rebuilding; |
|||
:* Unaffected by RAID hardware changes which affect many other systems. On many systems, if self-contained RAID hardware such as a RAID card fails, or the data is moved to another RAID system, the file system will lack information that was on the original RAID hardware, which is needed to manage data on the RAID array. This can lead to a total loss of data unless near-identical hardware can be acquired and used as a "stepping stone". Since ZFS manages RAID itself, a ZFS pool can be migrated to other hardware, or the operating system can be reinstalled, and the RAIDZ structures and data will be recognized and immediately accessible by ZFS again. |
|||
:* Ability to identify data that would have been found in a cache but has been discarded recently instead; this allows ZFS to reassess its caching decisions in light of later use and facilitates very high cache-hit levels (ZFS cache hit rates are typically over 80%); |
|||
:* Alternative caching strategies can be used for data that would otherwise cause delays in data handling. For example, synchronous writes which are capable of slowing down the storage system can be converted to asynchronous writes by being written to a fast separate caching device, known as the SLOG (sometimes called the ZIL – ZFS Intent Log). |
|||
:* Highly tunable—many internal parameters can be configured for optimal functionality. |
|||
:* Can be used for [[high availability]] clusters and computing, although not fully designed for this use. |
|||
===Data integrity=== |
|||
{{See also|Hard disk error rates and handling|Silent data corruption}} |
|||
One major feature that distinguishes ZFS from other [[file system]]s is that it is designed with a focus on data integrity by protecting the user's data on disk against [[silent data corruption]] caused by [[data degradation]], [[electric current|current]] spikes, bugs in disk [[firmware]], phantom writes (the previous write did not make it to disk), misdirected reads/writes (the disk accesses the wrong block), DMA parity errors between the array and server memory or from the driver (since the checksum validates data inside the array), driver errors (data winds up in the wrong buffer inside the kernel), accidental overwrites (such as swapping to a live file system), etc. |
|||
A 1999 study showed that neither any of the then-major and widespread filesystems (such as [[Unix File System|UFS]], [[Extended file system|Ext]],<ref>The Extended file system (Ext) has [[metadata]] structure copied from UFS. |
|||
{{cite web |
|||
| url = http://www.april.org/groupes/entretiens/remy_card.html |
|||
| title = Rémy Card (Interview, April 1998) |
|||
| date = April 19, 1999 |
|||
| publisher = April Association |
|||
| accessdate = 2012-02-08 |
|||
| archive-url = https://web.archive.org/web/20120204082557/http://www.april.org/groupes/entretiens/remy_card.html |
|||
| archive-date = February 4, 2012 |
|||
| url-status = dead |
|||
}} (In French)</ref> [[XFS]], [[JFS (file system)|JFS]], or [[NTFS]]), nor hardware RAID (which has [[RAID#Weaknesses|some issues with data integrity]]) provided sufficient protection against data corruption problems.<ref>{{cite web |
|||
|title=IRON FILE SYSTEMS |
|||
|url=http://pages.cs.wisc.edu/~vijayan/vijayan-thesis.pdf |
|||
|work=Doctor of Philosophy in Computer Sciences |
|||
|publisher=University of Wisconsin-Madison |
|||
|accessdate=9 June 2012 |
|||
|author=Vijayan Prabhakaran |
|||
|year=2006 |
|||
|archive-url=https://web.archive.org/web/20110429011617/http://pages.cs.wisc.edu/~vijayan/vijayan-thesis.pdf |
|||
|archive-date=April 29, 2011 |
|||
|url-status=live |
|||
}}</ref><ref>{{cite web |
|||
|title=Parity Lost and Parity Regained |
|||
|url=http://www.cs.wisc.edu/adsl/Publications/parity-fast08.html |
|||
|access-date=November 29, 2010 |
|||
|archive-url=https://web.archive.org/web/20100615101314/http://www.cs.wisc.edu/adsl/Publications/parity-fast08.html |
|||
|archive-date=June 15, 2010 |
|||
|url-status=live |
|||
}}</ref><ref>{{cite web |title=An Analysis of Data Corruption in the Storage Stack |url=http://www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf |access-date=November 29, 2010 |archive-url=https://web.archive.org/web/20100615111630/http://www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf |archive-date=June 15, 2010 |url-status=live }}</ref><ref>{{cite web |
|||
|title=Impact of Disk Corruption on Open-Source DBMS |
|||
|url=http://www.cs.wisc.edu/adsl/Publications/corrupt-mysql-icde10.pdf |
|||
|access-date=November 29, 2010 |
|||
|archive-url=https://web.archive.org/web/20100615090935/http://www.cs.wisc.edu/adsl/Publications/corrupt-mysql-icde10.pdf |
|||
|archive-date=June 15, 2010 |
|||
|url-status=live |
|||
}}</ref> Initial research indicates that ZFS protects data better than earlier efforts.<ref>{{cite web|url=http://pages.cs.wisc.edu/~kadav/zfs/zfsrel.pdf|title=Reliability Analysis of ZFS|first1=Asim|last1=Kadav|first2=Abhishek|last2=Rajimwale|access-date=September 19, 2013|archive-url=https://web.archive.org/web/20130921054610/http://pages.cs.wisc.edu/~kadav/zfs/zfsrel.pdf|archive-date=September 21, 2013|url-status=live}}</ref><ref>{{cite web |url=http://www.cs.wisc.edu/wind/Publications/zfs-corruption-fast10.pdf |title=End-to-end Data Integrity for File Systems: A ZFS Case Study |author1=Yupu Zhang |author2=Abhishek Rajimwale |author3=Andrea C. Arpaci-Dusseau |author4=Remzi H. Arpaci-Dusseau |publisher=Computer Sciences Department, University of Wisconsin |location=Madison |page=14 |accessdate=December 6, 2010 |archive-url=https://web.archive.org/web/20110626130632/http://www.cs.wisc.edu/wind/Publications/zfs-corruption-fast10.pdf |archive-date=June 26, 2011 |url-status=live }}</ref> It is also faster than UFS<ref name=ufs-zfs-ext4-btrfs_bench>{{cite web|last=Larabel|first=Michael|title=Benchmarking ZFS and UFS On FreeBSD vs. EXT4 & Btrfs On Linux|url=https://www.phoronix.com/scan.php?page=article&item=zfs_ext4_btrfs&num=2|publisher=Phoronix Media 2012|accessdate=21 November 2012|archive-url=https://web.archive.org/web/20161129093628/https://www.phoronix.com/scan.php?page=article&item=zfs_ext4_btrfs&num=2|archive-date=November 29, 2016|url-status=live}}</ref><ref name=zfs-hammer_bench>{{cite web|last=Larabel|first=Michael|title=Can DragonFlyBSD's HAMMER Compete With Btrfs, ZFS?|url=https://www.phoronix.com/scan.php?page=article&item=dragonfly_hammer&num=3|publisher=Phoronix Media 2012|accessdate=21 November 2012|archive-url=https://web.archive.org/web/20161129033518/https://www.phoronix.com/scan.php?page=article&item=dragonfly_hammer&num=3|archive-date=November 29, 2016|url-status=live}}</ref> and can be seen as its replacement. |
|||
Within ZFS, data integrity is achieved by using a [[Fletcher's checksum|Fletcher-based]] checksum or a [[SHA-256]] hash throughout the file system tree.<ref name="endtoend">{{cite web |
|||
| url = https://blogs.oracle.com/bonwick/entry/zfs_end_to_end_data |
|||
| title = ZFS End-to-End Data Integrity |
|||
| date = 2005-12-08 |
|||
| accessdate = 2013-09-19 |
|||
| first = Jeff |
|||
| last = Bonwick |
|||
| website = blogs.oracle.com |
|||
| archive-url = https://web.archive.org/web/20120403015447/https://blogs.oracle.com/bonwick/entry/zfs_end_to_end_data |
|||
| archive-date = April 3, 2012 |
|||
| url-status = live |
| url-status = live |
||
}}</ref> and further ZFS upgrades and implementations inside Solaris (such as encryption) are not compatible with other non-proprietary implementations which use previous versions of ZFS. |
|||
}}</ref> Each block of data is checksummed and the checksum value is then saved in the pointer to that block—rather than at the actual block itself. Next, the block pointer is checksummed, with the value being saved at ''its'' pointer. This checksumming continues all the way up the file system's data hierarchy to the root node, which is also checksummed, thus creating a [[Merkle tree]].<ref name="endtoend"/> In-flight data corruption or phantom reads/writes (the data written/read checksums correctly but is actually wrong) are undetectable by most filesystems as they store the checksum with the data. ZFS stores the checksum of each block in its parent block pointer so the entire pool self-validates.<ref name="endtoend"/> |
|||
When a block is accessed, regardless of whether it is data or meta-data, its checksum is calculated and compared with the stored checksum value of what it "should" be. If the checksums match, the data are passed up the programming stack to the process that asked for it; if the values do not match, then ZFS can heal the data if the storage pool provides [[data redundancy]] (such as with internal [[Disk mirroring|mirroring]]), assuming that the copy of data is undamaged and with matching checksums.<ref>{{cite web |
|||
| url = https://blogs.oracle.com/timc/entry/demonstrating_zfs_self_healing |
|||
| title = Demonstrating ZFS Self-Healing |
|||
| date = November 16, 2009 |
|||
| accessdate = 2015-02-01 |
|||
| first = Tim |
|||
| last = Cook |
|||
| website = blogs.oracle.com |
|||
| archive-url = https://web.archive.org/web/20110812031213/http://blogs.oracle.com/timc/entry/demonstrating_zfs_self_healing |
|||
| archive-date = August 12, 2011 |
|||
| url-status = live |
|||
}}</ref> It is optionally possible to provide additional in-pool redundancy by specifying {{Mono|copies{{=}}2}} (or {{Mono|copies{{=}}3}} or more), which means that data will be stored twice (or three times) on the disk, effectively halving (or, for {{Mono|copies{{=}}3}}, reducing to one third) the storage capacity of the disk.<ref>{{cite web |
|||
| url = https://blogs.oracle.com/relling/entry/zfs_copies_and_data_protection |
|||
| title = ZFS, copies, and data protection |
|||
| date = 2007-05-04 |
|||
| accessdate = 2015-02-02 |
|||
| first = Richard |
|||
| last = Ranch |
|||
| website = blogs.oracle.com |
|||
| archive-url = https://web.archive.org/web/20160818143115/https://blogs.oracle.com/relling/entry/zfs_copies_and_data_protection |
|||
| archive-date = August 18, 2016 |
|||
| url-status = dead |
|||
}}</ref> Additionally some kinds of data used by ZFS to manage the pool are stored multiple times by default for safety, even with the default copies=1 setting. |
|||
If other copies of the damaged data exist or can be reconstructed from checksums and [[Parity bit|parity]] data, ZFS will use a copy of the data (or recreate it via a RAID recovery mechanism), and recalculate the checksum—ideally resulting in the reproduction of the originally expected value. If the data passes this integrity check, the system can then update all faulty copies with known-good data and redundancy will be restored. |
|||
Consistency of data held in memory, such as cached data in the ARC, is not checked by default, as ZFS is expected to run on enterprise-quality hardware with [[ECC memory|error correcting RAM]], but the capability to check in-memory data exists and can be enabled using "debug flags". |
|||
===RAID ("RaidZ"){{Anchor|RAID-Z|RaidZ}}=== |
|||
For ZFS to be able to guarantee data integrity, it needs multiple copies of the data, usually spread across multiple disks. Typically this is achieved by using either a [[RAID]] controller or so-called "soft" RAID (built into a [[file system]]). |
|||
====Avoidance of hardware RAID controllers==== |
|||
While ZFS ''can'' work with hardware [[RAID]] devices, ZFS will usually work more efficiently and with greater protection of data, if it has raw access to all storage devices, and disks are not connected to the system using a hardware, firmware or other "soft" RAID, or any other controller which modifies the usual ZFS-to-disk [[I/O]] path. This is because ZFS relies on the disk for an honest view, to determine the moment data is confirmed as safely written, and it has numerous [[algorithm]]s designed to optimize its use of [[cache (computing)|caching]], [[cache flush]]ing, and disk handling. |
|||
If a third-party device performs caching or presents drives to ZFS as a single system, or without the [[low level]] view ZFS relies upon, there is a much greater chance that the system will perform ''less'' optimally, and that a failure will not be preventable by ZFS or as quickly or fully recovered by ZFS. For example, if a hardware RAID card is used, ZFS may not be able to determine the condition of disks or whether the RAID array is degraded or rebuilding, it may not know of all data corruption, and it cannot place data optimally across the disks, make selective repairs only, control how repairs are balanced with ongoing use, and may not be able to make repairs even if it could usually do so, as the hardware RAID card will interfere. RAID controllers also usually add controller-dependent data to the drives which prevents software RAID from accessing the user data. While it is possible to read the data with a compatible hardware RAID controller, this isn't always possible, and if the controller card develops a fault then a replacement may not be available, and other cards may not understand the manufacturer's custom data which is needed to manage and restore an array on a new card. |
|||
Therefore, unlike most other systems, where RAID cards or similar are used to [[offloading|offload]] resources and processing and enhance performance and reliability, with ZFS it is strongly recommended these methods ''not'' be used as they typically ''reduce'' the system's performance and reliability. |
|||
If disks must be connected through a RAID or other controller, it is recommended to use a plain [[Host adapter|HBA (host adapter)]] or [[fanout]] card, or configure the card in [[JBOD]] mode (i.e. turn off RAID and caching functions), to allow devices to be attached but the ZFS-to-disk I/O pathway to be unchanged. A RAID card in JBOD mode may still interfere, if it has a cache or depending upon its design, and may detach drives that do not respond in time (as has been seen with many energy-efficient consumer-grade hard drives), and as such, may require [[Time-Limited Error Recovery]] (TLER)/CCTL/ERC-enabled drives to prevent drive dropouts, so not all cards are suitable even with RAID functions disabled.<ref>{{cite web |
|||
| url = http://wdc.custhelp.com/app/answers/detail/a_id/1397/~/difference-between-desktop-edition-and-raid-%28enterprise%29-edition-drives |
|||
| title = Difference between Desktop edition and RAID (Enterprise) edition drives |
|||
| author = wdc.custhelp.com |
|||
| access-date = September 8, 2011 |
|||
| archive-url = https://web.archive.org/web/20150105040018/http://wdc.custhelp.com/app/answers/detail/a_id/1397/~/difference-between-desktop-edition-and-raid-(enterprise)-edition-drives |
|||
| archive-date = January 5, 2015 |
|||
| url-status = live |
|||
}}</ref> |
|||
====ZFS' approach: RAID-Z and mirroring==== |
|||
Instead of hardware RAID, ZFS employs "soft" RAID, offering ''RAID-Z'' ([[Parity bit|parity]] based like [[RAID 5]] and similar) and [[disk mirror]]ing (similar to [[RAID 1]]). The schemes are highly flexible. |
|||
RAID-Z is a data/parity distribution scheme like [[RAID-5]], but uses dynamic stripe width: every block is its own RAID stripe, regardless of blocksize, resulting in every RAID-Z write being a full-stripe write. This, when combined with the copy-on-write transactional semantics of ZFS, eliminates the [[RAID#WRITE-HOLE|write hole error]]. RAID-Z is also faster than traditional RAID 5 because it does not need to perform the usual [[read-modify-write]] sequence.<ref name="RAID-Z">{{cite web |url= https://blogs.oracle.com/bonwick/en_US/entry/raid_z |title= RAID-Z |website= Jeff Bonwick's Blog |publisher= [[Oracle Corporation|Oracle]] Blogs |date= 2005-11-17 |accessdate= 2015-02-01 |first= Jeff |last= Bonwick |archive-url= https://web.archive.org/web/20141216015058/https://blogs.oracle.com/bonwick/en_US/entry/raid_z |archive-date= December 16, 2014 |url-status= dead }}</ref> |
|||
As all stripes are of different sizes, RAID-Z reconstruction has to traverse the filesystem metadata to determine the actual RAID-Z geometry. This would be impossible if the filesystem and the RAID array were separate products, whereas it becomes feasible when there is an integrated view of the logical and physical structure of the data. Going through the metadata means that ZFS can validate every block against its 256-bit checksum as it goes, whereas traditional RAID products usually cannot do this.<ref name="RAID-Z"/> |
|||
In addition to handling whole-disk failures, RAID-Z can also detect and correct [[silent data corruption]], offering "self-healing data": when reading a RAID-Z block, ZFS compares it against its checksum, and if the data disks did not return the right answer, ZFS reads the parity and then figures out which disk returned bad data. Then, it repairs the damaged data and returns good data to the requestor.<ref name="RAID-Z"/> |
|||
RAID-Z and mirroring do not require any special hardware: they do not need NVRAM for reliability, and they do not need write buffering for good performance or data protection. With RAID-Z, ZFS provides fast, reliable storage using cheap, commodity disks.<ref name="RAID-Z"/> |
|||
There are five different RAID-Z modes: ''RAID-Z0'' (similar to RAID 0, offers no redundancy), ''RAID-Z1'' (similar to RAID 5, allows one disk to fail), ''RAID-Z2'' (similar to RAID 6, allows two disks to fail), ''RAID-Z3'' (a RAID 7{{Efn|name="RAID7"|While RAID 7 is not a standard RAID level, it has been proposed as a catch-all term for any >3 parity RAID configuration<ref name="Triple-Parity RAID and Beyond">{{cite journal |last1=Leventhal |first1=Adam |title=Triple-Parity RAID and Beyond |journal=Queue |date=2009-12-17 |volume=7 |issue=11 |page=30 |doi=10.1145/1661785.1670144 |url=https://queue.acm.org/detail.cfm?id=1670144 |accessdate=12 April 2019 |doi-broken-date=2020-01-16 |archive-url=https://web.archive.org/web/20190315183030/https://queue.acm.org/detail.cfm?id=1670144 |archive-date=March 15, 2019 |url-status=live }}</ref>}} configuration, allows three disks to fail), and mirroring (similar to RAID 1, allows all but one disk to fail).<ref>{{cite web|title=ZFS Raidz Performance, Capacity and integrity|url=https://calomel.org/zfs_raid_speed_capacity.html|website=calomel.org|accessdate=23 June 2017|archive-url=https://web.archive.org/web/20171127225445/https://calomel.org/zfs_raid_speed_capacity.html|archive-date=November 27, 2017|url-status=dead}}</ref> |
|||
When creating a new ZFS pool, to retain the ability to use access the pool from other non-proprietary Solaris-based distributions, it is recommended to upgrade to Solaris 11 Express from OpenSolaris (snv_134b), and thereby stay at ZFS version 28. |
|||
The need for RAID-Z3 arose in the early 2000s as muti-terabyte capacity drives became more common. This increase in capacity—without a corresponding increase in throughput speeds—meant that rebuilding an array due to a failed drive could take "weeks or even months" to complete.<ref name="Triple-Parity RAID and Beyond"/> During this time, the older disks in the array will be stressed by the additional workload, which could result data corruption or drive failure. By increasing parity, RAID-Z3 reduces the chance of data loss by simply increasing redundancy.<ref>{{cite web |url= https://www.zdnet.com/blog/storage/why-raid-6-stops-working-in-2019/805 |title= Why RAID 6 stops working in 2019 |date= February 22, 2010 |work= [[ZDNet]] |accessdate= October 26, 2014 |archive-url= https://web.archive.org/web/20141031164950/http://www.zdnet.com/blog/storage/why-raid-6-stops-working-in-2019/805 |archive-date= October 31, 2014 |url-status= live }}</ref> |
|||
===Future development=== |
|||
====Resilvering and scrub (array syncing and integrity checking)==== |
|||
On September 2, 2017, [[Simon Phipps (programmer)|Simon Phipps]] reported that Oracle had laid off virtually all <!-- "~ all" being tech shorthand for "approximately all" --> of its Solaris core development staff, interpreting it as a sign that Oracle no longer intends to support future development of the platform.<ref>{{Cite web |
|||
ZFS has no tool equivalent to [[fsck]] (the standard Unix and Linux data checking and repair tool for file systems).<ref>"No fsck utility equivalent exists for ZFS. This utility has traditionally served two purposes, those of file system repair and file system validation." {{cite web |
|||
|url=https://www.itwire.com/open-sauce/79738-bye,-bye-solaris,-it-was-a-nice-ride-while-it-lasted.html |
|||
|title=Checking ZFS File System Integrity |
|||
|title=Bye, bye Solaris, it was a nice ride while it lasted |
|||
|url=http://docs.oracle.com/cd/E23823_01/html/819-5461/gbbwa.html |
|||
|last=Varghese |
|||
|publisher=Oracle |
|||
|first=Sam |
|||
|accessdate=25 November 2012 |
|||
|website=ITWire |
|||
|archive-url=https://web.archive.org/web/20130131040337/http://docs.oracle.com/cd/E23823_01/html/819-5461/gbbwa.html |
|||
| |
|date = 2017-09-04 |
||
|access-date=2019-07-21 |
|||
|url-status=live |
|||
}}</ref> Instead, ZFS has a built-in [[Data scrubbing|scrub]] function which regularly examines all data and repairs silent corruption and other problems. Some differences are: |
|||
* fsck must be run on an offline filesystem, which means the filesystem must be unmounted and is not usable while being repaired, while scrub is designed to be used on a mounted, live filesystem, and does not need the ZFS filesystem to be taken offline. |
|||
* fsck usually only checks metadata (such as the journal log) but never checks the data itself. This means, after an fsck, the data might still not match the original data as stored. |
|||
* fsck cannot always validate and repair data when checksums are stored with data (often the case in many file systems), because the checksums may also be corrupted or unreadable. ZFS always stores checksums separately from the data they verify, improving reliability and the ability of scrub to repair the volume. ZFS also stores multiple copies of data—metadata, in particular, may have upwards of 4 or 6 copies (multiple copies per disk and multiple disk mirrors per volume), greatly improving the ability of scrub to detect and repair extensive damage to the volume, compared to fsck. |
|||
* scrub checks everything, including metadata and the data. The effect can be observed by comparing fsck to scrub times—sometimes a fsck on a large RAID completes in a few minutes, which means only the metadata was checked. Traversing all metadata and data on a large RAID takes many hours, which is exactly what scrub does. |
|||
The official recommendation from Sun/Oracle is to scrub enterprise-level disks once a month, and cheaper commodity disks once a week.<ref name=freenas-zfs-scrubs>{{cite web|title=ZFS Scrubs |url=http://doc.freenas.org/index.php/ZFS_Scrubs |publisher=freenas.org |accessdate=25 November 2012 |url-status=dead |archiveurl=https://web.archive.org/web/20121127160745/http://doc.freenas.org/index.php/ZFS_Scrubs |archivedate=November 27, 2012 }}</ref><ref name=solaris-zfs-scrub>"You should also run a scrub prior to replacing devices or temporarily reducing a pool's redundancy to ensure that all devices are currently operational." {{cite web |
|||
|title=ZFS Best Practices Guide |
|||
|url=http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide |
|||
|publisher=solarisinternals.com |
|||
|accessdate=25 November 2012 |
|||
|url-status=dead |
|||
|archiveurl=https://web.archive.org/web/20150905142644/http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide |
|||
|archivedate=September 5, 2015 |
|||
|df=mdy |
|||
}}</ref> |
}}</ref> |
||
== |
==Version history== |
||
{{for|earlier history|ZFS#Version history}} |
|||
ZFS is a [[128-bit]] file system,<ref>{{cite web |url=https://blogs.oracle.com/bonwick/entry/128_bit_storage_are_you |title=128-bit storage: are you high? |author=Jeff Bonwick |website=oracle.com |accessdate=May 29, 2015 |archive-url=https://web.archive.org/web/20150529160107/https://blogs.oracle.com/bonwick/entry/128_bit_storage_are_you |archive-date=May 29, 2015 |url-status=live }}</ref><ref name="last word">{{cite web |url= https://blogs.oracle.com/bonwick/en_US/entry/zfs_the_last_word_in |title= ZFS: The Last Word in Filesystems |first= Jeff |last= Bonwick |website= blogs.oracle.com |date= October 31, 2005 |accessdate= June 22, 2013 |archive-url= https://web.archive.org/web/20130619165135/https://blogs.oracle.com/bonwick/en_US/entry/zfs_the_last_word_in |archive-date= June 19, 2013 |url-status= dead }}</ref> so it can address 1.84 × 10<sup>19</sup> times more data than 64-bit systems such as [[Btrfs]]. The maximum limits of ZFS are designed to be so large that they should never be encountered in practice. For instance, fully populating a single zpool with 2<sup>128</sup> bits of data would require 3×10<sup>24</sup> TB hard disk drives.<ref>{{cite web|url=https://blogs.oracle.com/dcb/entry/zfs_boils_the_ocean_consumes|title=ZFS: Boils the Ocean, Consumes the Moon (Dave Brillhart's Blog)|accessdate=December 19, 2015|archive-url=https://web.archive.org/web/20151208192725/https://blogs.oracle.com/dcb/entry/zfs_boils_the_ocean_consumes|archive-date=December 8, 2015|url-status=dead}}</ref> |
|||
{| class="wikitable" |
|||
|+ Legend: |
|||
| {{no|Old release}} |
|||
|- |
|||
| {{Proprietary|Latest Proprietary stable release}} |
|||
|} |
|||
{| class="wikitable" |
|||
Some theoretical limits in ZFS are: |
|||
|- |
|||
* 2<sup>48</sup>: number of entries in any individual directory<ref>{{cite web | url = http://download.oracle.com/docs/cd/E19253-01/819-5461/zfsover-2/index.html | title = Solaris ZFS Administration Guide | publisher = Oracle Corporation | accessdate =February 11, 2011}}</ref> |
|||
! ZFS Filesystem Version Number |
|||
* 16 [[exbibyte]]s (2<sup>64</sup> bytes): maximum size of a single file |
|||
! OS Release |
|||
* 16 exbibytes: maximum size of any attribute |
|||
! Significant changes |
|||
* 256 quadrillion [[zebibyte]]s (2<sup>128</sup> bytes): maximum size of any zpool |
|||
|- |
|||
* 2<sup>56</sup>: number of attributes of a file (actually constrained to 2<sup>48</sup> for the number of files in a directory) |
|||
! {{no|6}} |
|||
* 2<sup>64</sup>: number of devices in any zpool |
|||
| style="white-space:nowrap;" | Solaris 11.1 |
|||
* 2<sup>64</sup>: number of zpools in a system |
|||
| Multilevel file system support<ref name="fs-versions-2022">{{cite web | url = https://docs.oracle.com/en/operating-systems/solaris/oracle-solaris/11.4/manage-zfs/zfs-file-system-versions.html | title = ZFS File System Versions | access-date = Jan 1, 2023 | publisher = Oracle Corporation | year = 2022 | archive-url = https://web.archive.org/web/20230102032220/https://docs.oracle.com/en/operating-systems/solaris/oracle-solaris/11.4/manage-zfs/zfs-file-system-versions.html | archive-date = January 2, 2023 | url-status = live}}</ref> |
|||
* 2<sup>64</sup>: number of file systems in a zpool |
|||
|- |
|||
! {{no|7}} |
|||
| style="white-space:nowrap;" | Solaris 11.4 SRU 45 |
|||
| File retention support<ref name="fs-versions-2022"/> |
|||
|- |
|||
! {{Proprietary|8}} |
|||
| style="white-space:nowrap;" | Solaris 11.4 SRU 51 |
|||
| Unicode versioning support<ref name="fs-versions-2022"/> |
|||
|} |
|||
{| class="wikitable" |
|||
===Encryption=== |
|||
|- |
|||
With Oracle Solaris, the encryption capability in ZFS<ref>{{cite web | url=http://download.oracle.com/docs/cd/E19963-01/html/821-1448/gkkih.html | title=Encrypting ZFS File Systems | access-date=May 2, 2011 | archive-url=https://web.archive.org/web/20110623190612/http://download.oracle.com/docs/cd/E19963-01/html/821-1448/gkkih.html | archive-date=June 23, 2011 | url-status=live }}</ref> is embedded into the I/O pipeline. During writes, a block may be compressed, encrypted, checksummed and then deduplicated, in that order. The policy for encryption is set at the dataset level when datasets (file systems or ZVOLs) are created. The wrapping keys provided by the user/administrator can be changed at any time without taking the file system offline. The default behaviour is for the wrapping key to be inherited by any child data sets. The data encryption keys are randomly generated at dataset creation time. Only descendant datasets (snapshots and clones) share data encryption keys.<ref>{{cite web | url=https://blogs.oracle.com/darren/entry/compress_encrypt_checksum_deduplicate_with | title=Having my secured cake and Cloning it too (aka Encryption + Dedup with ZFS) | access-date=October 9, 2012 | archive-url=https://web.archive.org/web/20130529061709/https://blogs.oracle.com/darren/entry/compress_encrypt_checksum_deduplicate_with | archive-date=May 29, 2013 | url-status=live }}</ref> A command to switch to a new data encryption key for the clone or at any time is provided—this does not re-encrypt already existing data, instead utilising an encrypted master-key mechanism. |
|||
! ZFS Pool Version Number |
|||
! OS Release |
|||
As of 2019 the encryption feature is also fully integrated into OpenZFS 0.8.0 available for Debian and Ubuntu Linux distributions.<ref>{{Cite web|url=https://wiki.debian.org/ZFS#Encryption|title=ZFS - Debian Wiki|website=wiki.debian.org|access-date=2019-12-10|archive-url=https://web.archive.org/web/20190908104724/https://wiki.debian.org/ZFS#Encryption|archive-date=September 8, 2019|url-status=live}}</ref> |
|||
! Significant changes |
|||
=== Read/write efficiency === |
|||
ZFS will automatically allocate data storage across all vdevs in a pool (and all devices in each vdev) in a way that generally maximises the performance of the pool. ZFS will also update its write strategy to take account of new disks added to a pool, when they are added. |
|||
As a general rule, ZFS allocates writes across vdevs based on the free space in each vdev. This ensures that vdevs which have proportionately less data already, are given more writes when new data is to be stored. This helps to ensure that as the pool becomes more used, the situation does not develop that some vdevs become full, forcing writes to occur on a limited number of devices. It also means that when data is read (and reads are much more frequent than writes in most uses), different parts of the data can be read from as many disks as possible at the same time, giving much higher read performance. Therefore, as a general rule, pools and vdevs should be managed and new storage added, so that the situation does not arise that some vdevs in a pool are almost full and others almost empty, as this will make the pool less efficient. |
|||
===Other features=== |
|||
====Storage devices, spares, and quotas==== |
|||
Pools can have hot spares to compensate for failing disks. When mirroring, block devices can be grouped according to physical chassis, so that the filesystem can continue in the case of the failure of an entire chassis. |
|||
Storage pool composition is not limited to similar devices, but can consist of ad-hoc, heterogeneous collections of devices, which ZFS seamlessly pools together, subsequently doling out space to {{Clarify|reason=I thought ZFS *was* a filesystem?|text=diverse filesystems|date=October 2016}} as needed. Arbitrary storage device types can be added to existing pools to expand their size.<ref>{{cite web |url=http://download.intel.com/design/flash/nand/SolarisZFS_SolutionBrief.pdf |title=Solaris ZFS Enables Hybrid Storage Pools—Shatters Economic and Performance Barriers |publisher=Sun.com |date=September 7, 2010 |accessdate=November 4, 2011 |archive-url=https://web.archive.org/web/20111017204544/http://download.intel.com/design/flash/nand/SolarisZFS_SolutionBrief.pdf |archive-date=October 17, 2011 |url-status=live }}</ref> |
|||
The storage capacity of all vdevs is available to all of the file system instances in the zpool. A [[Disk quota|quota]] can be set to limit the amount of space a file system instance can occupy, and a [[disk reservation|reservation]] can be set to guarantee that space will be available to a file system instance. |
|||
====Caching mechanisms: ARC, L2ARC, Transaction groups, ZIL, SLOG, Special VDEV==== |
|||
ZFS uses different layers of disk cache to speed up read and write operations. Ideally, all data should be stored in RAM, but that is usually too expensive. Therefore, data is automatically cached in a hierarchy to optimize performance versus cost;<ref>{{cite web | url = http://dtrace.org/blogs/brendan/2008/07/22/zfs-l2arc/ | first = Brendan | last = Gregg | work = Brendan's blog | title = ZFS L2ARC | publisher = Dtrace.org | date = | accessdate = 2012-10-05 | archive-url = https://web.archive.org/web/20111106031228/http://dtrace.org/blogs/brendan/2008/07/22/zfs-l2arc/ | archive-date = November 6, 2011 | url-status = live }}</ref> these are often called "hybrid storage pools".<ref>{{cite web | url = http://dtrace.org/blogs/brendan/2009/10/08/hybrid-storage-pool-top-speeds/ | date = 2009-10-08 | first = Brendan | last = Gregg | work = Brendan's blog | title = Hybrid Storage Pool: Top Speeds | publisher = Dtrace.org | access-date = August 15, 2017 | archive-url = https://web.archive.org/web/20160405120351/http://dtrace.org/blogs/brendan/2009/10/08/hybrid-storage-pool-top-speeds/ | archive-date = April 5, 2016 | url-status = live }}</ref> Frequently accessed data will be stored in RAM, and less frequently accessed data can be stored on slower media, such as [[solid state drive]]s (SSDs). Data that is not often accessed is not cached and left on the slow hard drives. If old data is suddenly read a lot, ZFS will automatically move it to SSDs or to RAM. |
|||
ZFS caching mechanisms include one each for reads and writes, and in each case, two levels of caching can exist, one in computer memory (RAM) and one on fast storage (usually [[solid state drive]]s (SSDs)), for a total of four caches. |
|||
{| class=wikitable |
|||
|- |
|- |
||
! {{no|29}} |
|||
! |
|||
| style="white-space:nowrap;" | Solaris Nevada b148 |
|||
! Where stored |
|||
| RAID-Z/mirror hybrid allocator |
|||
! Read cache |
|||
! Write cache |
|||
|- |
|- |
||
! {{no|30}} |
|||
| First level cache |
|||
| style="white-space:nowrap;" | Solaris Nevada b149 |
|||
| In RAM |
|||
| ZFS encryption |
|||
| Known as '''ARC''', due to its use of a variant of the [[adaptive replacement cache]] (ARC) algorithm. RAM will always be used for caching, thus this level is always present. The efficiency of the ARC [[algorithm]] means that disks will often not need to be accessed, provided the ARC size is sufficiently large. If RAM is too small there will hardly be any ARC at all; in this case, ZFS always needs to access the underlying disks which impacts performance considerably. |
|||
| Handled by means of '''"transaction groups"''' – writes are collated over a short period (typically 5 – 30 seconds) up to a given limit, with each group being written to disk ideally while the next group is being collated. This allows writes to be organized more efficiently for the underlying disks at the risk of minor data loss of the most recent transactions upon power interruption or hardware fault. In practice the power loss risk is avoided by ZFS write [[journaling file system|journaling]] and by the SLOG/ZIL second tier write cache pool (see below), so writes will only be lost if a write failure happens at the same time as a total loss of the second tier SLOG pool, and then only when settings related to synchronous writing and SLOG use are set in a way that would allow such a situation to arise. If data is received faster than it can be written, data receipt is paused until the disks can catch up. |
|||
|- |
|- |
||
! {{no|31}} |
|||
| Second level cache |
|||
| style="white-space:nowrap;" | Solaris Nevada b150 |
|||
| On fast storage devices (which can be added or removed from a "live" system without disruption in current versions of ZFS, although not always in older versions) |
|||
| Improved 'zfs list' performance |
|||
| Known as '''L2ARC''' ("Level 2 ARC"), optional. ZFS will cache as much data in L2ARC as it can, which can be tens or hundreds of [[gigabyte]]s in many cases. L2ARC will also considerably speed up [[Data deduplication|deduplication]] if the entire deduplication table can be cached in L2ARC. It can take several hours to fully populate the L2ARC from empty (before ZFS has decided which data are "hot" and should be cached). If the L2ARC device is lost, all reads will go out to the disks which slows down performance, but nothing else will happen (no data will be lost). |
|||
|- |
|||
| Known as '''SLOG''' or '''ZIL''' ("ZFS Intent Log") – the terms are often used incorrectly. A SLOG (secondary log device) is an optional dedicated cache on a separate device, for recording writes, in the event of a system issue. If an SLOG device exists, it will be used for the ZFS Intent Log as a second level log, and if no separate cache device is provided, the ZIL will be created on the main storage devices instead. The SLOG thus, technically, refes to the dedicated disk to which the ZIL is offloaded, in order to speed up the pool. Strictly speaking, ZFS does not use the SLOG device to cache its disk writes. Rather, it uses SLOG to ensure writes are captured to a permanent storage medium as quickly as possible, so that in the event of power loss or write failure, no data which was acknowledged as written, will be lost. The SLOG device allows ZFS to speedily store writes and quickly report them as written, even for storage devices such as [[Hard disk drive|HDDs]] that are much slower. In the normal course of activity, the SLOG is never referred to or read, and it does not act as a cache; its purpose is to safeguard [[data in transit|data in flight]] during the few seconds taken for collation and "writing out", in case the eventual write were to fail. If all goes well, then the storage pool will be updated at some point within the next 5 to 60 seconds, when the current transaction group is written out to disk (see above), at which point the saved writes on the SLOG will simply be ignored and overwritten. If the write eventually fails, or the system suffers a crash or fault preventing its writing, then ZFS can identify all the writes that it has confirmed were written, by reading back the SLOG (the only time it is read from), and use this to completely repair the data loss. |
|||
! {{no|32}} |
|||
| style="white-space:nowrap;" | Solaris Nevada b151 |
|||
This becomes crucial if a large number of synchronous writes take place (such as with [[ESXi]], [[Network File System|NFS]] and some [[database]]s),<ref>{{cite web |url=http://constantin.glez.de/blog/2010/07/solaris-zfs-synchronous-writes-and-zil-explained |title=Solaris ZFS Performance Tuning: Synchronous Writes and the ZIL |publisher=Constantin.glez.de |date=2010-07-20 |accessdate=2012-10-05 |archive-url=https://web.archive.org/web/20120623100347/http://constantin.glez.de/blog/2010/07/solaris-zfs-synchronous-writes-and-zil-explained |archive-date=June 23, 2012 |url-status=live }}</ref> where the client requires confirmation of successful writing before continuing its activity; the SLOG allows ZFS to confirm writing is successful much more quickly than if it had to write to the main store every time, without the risk involved in misleading the client as to the state of data storage. If there is no SLOG device then part of the main data pool will be used for the same purpose, although this is slower. |
|||
| One MB block support |
|||
|- |
|||
If the log device itself is lost, it is possible to lose the latest writes, therefore the log device should be mirrored. In earlier versions of ZFS, loss of the log device could result in loss of the entire zpool, although this is no longer the case. Therefore, one should upgrade ZFS if planning to use a separate log device. |
|||
! {{no|33}} |
|||
| style="white-space:nowrap;" | Solaris Nevada b163 |
|||
| Improved share support |
|||
|- |
|||
! {{no|34}} |
|||
| style="white-space:nowrap;" | Solaris 11.1 (0.5.11-0.175.1.0.0.24.2) |
|||
| Sharing with inheritance |
|||
|- |
|||
! {{no|35}} |
|||
| style="white-space:nowrap;" | Solaris 11.2 (0.5.11-0.175.2.0.0.42.0) |
|||
| Sequential resilver |
|||
|- |
|||
! {{no|36}} |
|||
| style="white-space:nowrap;" | Solaris 11.3 |
|||
| Efficient log block allocation |
|||
|- |
|||
! {{no|37}} |
|||
| style="white-space:nowrap;" | Solaris 11.3 |
|||
| [[LZ4 (compression algorithm)|LZ4 compression]] |
|||
|- |
|||
! {{no|38}} |
|||
| style="white-space:nowrap;" | Solaris 11.4 |
|||
|xcopy with encryption |
|||
|- |
|||
! {{no|39}} |
|||
| style="white-space:nowrap;" | Solaris 11.4 |
|||
|reduce resilver restart |
|||
|- |
|||
! {{no|40}} |
|||
| style="white-space:nowrap;" | Solaris 11.4 |
|||
|Deduplication 2 |
|||
|- |
|||
! {{no|41}} |
|||
| style="white-space:nowrap;" | Solaris 11.4 |
|||
|Asynchronous dataset destroy |
|||
|- |
|||
! {{no|42}} |
|||
| style="white-space:nowrap;" | Solaris 11.4 |
|||
|Reguid: ability to change the pool guid |
|||
|- |
|||
! {{no|43}} |
|||
| style="white-space:nowrap;" | Solaris 11.4, Oracle ZFS Storage Simulator 8.7<ref>{{cite web | url = http://www.oracle.com/technetwork/server-storage/sun-unified-storage/downloads/sun-simulator-1368816.html | title = Oracle ZFS Storage Simulator download | access-date =January 12, 2018 | publisher = Oracle Corporation | year = 2017 | archive-url = https://web.archive.org/web/20180113043800/http://www.oracle.com/technetwork/server-storage/sun-unified-storage/downloads/sun-simulator-1368816.html | archive-date =January 13, 2018 | url-status = live}}</ref> |
|||
|RAID-Z improvements and cloud device support.<ref name="ZFS Pool Versions">{{cite web | url = https://docs.oracle.com/cd/E37838_01/html/E61017/gjxle.html | title = ZFS Pool Versions | access-date =December 18, 2018 | publisher = Oracle Corporation | year =2018 | archive-url = https://web.archive.org/web/20181218194040/https://docs.oracle.com/cd/E37838_01/html/E61017/gjxle.html | archive-date =December 18, 2018 | url-status = live}}</ref> |
|||
|- |
|||
! {{no|44}} |
|||
| style="white-space:nowrap;" | Solaris 11.4<ref name="ZFS Pool Versions"/> |
|||
|Device removal |
|||
|- |
|||
! {{no|45}} |
|||
| style="white-space:nowrap;" | Solaris 11.4 SRU 11<ref>{{cite web | url = https://docs.oracle.com/cd/E37838_01/html/E61017/gjxle.html | title = ZFS Pool Versions | access-date =July 24, 2019 | publisher = Oracle Corporation | year = 2019 | archive-url = https://web.archive.org/web/20181218194040/https://docs.oracle.com/cd/E37838_01/html/E61017/gjxle.html | archive-date =December 18, 2018 | url-status = live}}</ref> |
|||
|Lazy deadlists |
|||
|- |
|||
! {{ no|46}} |
|||
| style="white-space:nowrap;" | Solaris 11.4 SRU 12<ref>{{cite web | url = https://docs.oracle.com/cd/E37838_01/html/E61017/gjxle.html | title = ZFS Pool Versions | access-date =August 20, 2019 | publisher = Oracle Corporation | year = 2019 | archive-url = https://web.archive.org/web/20181218194040/https://docs.oracle.com/cd/E37838_01/html/E61017/gjxle.html | archive-date =December 18, 2018 | url-status = live}}</ref> |
|||
|Compact file metadata for encryption |
|||
|- |
|||
! {{ no|47}} |
|||
| style="white-space:nowrap;" | Solaris 11.4 SRU 21<ref>{{cite web | url = https://docs.oracle.com/cd/E37838_01/html/E61017/gjxle.html | title = ZFS Pool Versions | access-date =May 23, 2020 | publisher = Oracle Corporation | year = 2020 | archive-url = https://web.archive.org/web/20181218194040/https://docs.oracle.com/cd/E37838_01/html/E61017/gjxle.html | archive-date =December 18, 2018 | url-status = live}}</ref> |
|||
|Property Support for ZVOLs |
|||
|- |
|||
! {{no|48}} |
|||
| style="white-space:nowrap;" | Solaris 11.4 SRU 45 |
|||
| File retention support<ref name="zpool-2022">{{cite web | url = https://docs.oracle.com/en/operating-systems/solaris/oracle-solaris/11.4/manage-zfs/zfs-pool-versions.html | title = ZFS Pool Versions | access-date = Jan 1, 2023 | publisher = Oracle Corporation | year = 2022 | archive-url = https://web.archive.org/web/20221221174928/https://docs.oracle.com/en/operating-systems/solaris/oracle-solaris/11.4/manage-zfs/zfs-pool-versions.html | archive-date = December 21, 2022 | url-status = live}}</ref> |
|||
|- |
|||
! {{no|49}} |
|||
| style="white-space:nowrap;" | Solaris 11.4 SRU 51 |
|||
| Unicode versioning support<ref name="zpool-2022"/> |
|||
|- |
|||
! {{no|50}} |
|||
| style="white-space:nowrap;" | Solaris 11.4 SRU 57 |
|||
| Raw crypto replication<ref name="zpool-2023">{{cite web | url = https://docs.oracle.com/en/operating-systems/solaris/oracle-solaris/11.4/manage-zfs/zfs-pool-versions.html | title = ZFS Pool Versions | access-date = Nov 17, 2023 | publisher = Oracle Corporation | year = 2023 }}</ref> |
|||
|- |
|||
! {{Proprietary|51}} |
|||
| style="white-space:nowrap;" | Solaris 11.4 SRU 63 |
|||
| 'onexpiry' options for file retention<ref name="zpool-2023"/> |
|||
|} |
|} |
||
A number of other caches, cache divisions, and queues also exist within ZFS. For example, each VDEV has its own data cache, and the ARC cache is divided between data stored by the user and [[metadata]] used by ZFS, with control over the balance between these. |
|||
=====Special VDEV Class===== |
|||
In ZFS 0.8 and later, it is possible to configure a Special VDEV class to preferentially store filesystem metadata, and optionally the Data Deduplication Table (DDT), and small filesystem blocks. This allows, for example, to create a Special VDEV on fast solid-state storage to store the metadata, while the regular file data is stored on spinning disks. This speeds up metadata-intensive operations such as filesystem traversal, scrub, and resilver, without the expense of storing the entire filesystem on solid-state storage. |
|||
====Copy-on-write transactional model==== |
|||
ZFS uses a [[copy-on-write]] [[Transaction processing|transactional]] [[object model]]. All block pointers within the filesystem contain a 256-bit [[checksum]] or 256-bit [[Cryptographic hash function|hash]] (currently a choice between [[Fletcher's checksum|Fletcher-2]], [[Fletcher's checksum|Fletcher-4]], or [[SHA-256]])<ref>{{cite web|title=ZFS On-Disk Specification |url=http://opensolaris.org/os/community/zfs/docs/ondiskformat0822.pdf |publisher=Sun Microsystems, Inc. |year=2006 |url-status=dead |archiveurl=https://web.archive.org/web/20081230170058/http://www.opensolaris.org/os/community/zfs/docs/ondiskformat0822.pdf |archivedate=December 30, 2008 }} See section 2.4.</ref> of the target block, which is verified when the block is read. Blocks containing active data are never overwritten in place; instead, a new block is allocated, modified data is written to it, then any [[metadata]] blocks referencing it are similarly read, reallocated, and written. To reduce the overhead of this process, multiple updates are grouped into transaction groups, and ZIL ([[intent log]]) write cache is used when synchronous write semantics are required. The blocks are arranged in a tree, as are their checksums (see [[Merkle signature scheme]]). |
|||
====Snapshots and clones==== |
|||
{{Unreferenced section|date=January 2017}} |
|||
An advantage of copy-on-write is that, when ZFS writes new data, the blocks containing the old data can be retained, allowing a [[snapshot (computer storage)|snapshot]] version of the file system to be maintained. ZFS snapshots are consistent (they reflect the entire data as it existed at a single point in time), and can be created extremely quickly, since all the data composing the snapshot is already stored, with the entire storage pool often snapshotted several times per hour. They are also space efficient, since any unchanged data is shared among the file system and its snapshots. Snapshots are inherently read-only, ensuring they will not be modified after creation, although they should not be relied on as a sole means of backup. Entire snapshots can be restored and also files and directories within snapshots. |
|||
Writeable snapshots ("clones") can also be created, resulting in two independent file systems that share a set of blocks. As changes are made to any of the clone file systems, new data blocks are created to reflect those changes, but any unchanged blocks continue to be shared, no matter how many clones exist. This is an implementation of the [[Copy-on-write]] principle. |
|||
====Sending and receiving snapshots==== |
|||
{{Unreferenced section|date=January 2017}} |
|||
ZFS file systems can be moved to other pools, also on remote hosts over the network, as the send command creates a stream representation of the file system's state. This stream can either describe complete contents of the file system at a given snapshot, or it can be a delta between snapshots. Computing the delta stream is very efficient, and its size depends on the number of blocks changed between the snapshots. This provides an efficient strategy, e.g., for synchronizing offsite backups or high availability mirrors of a pool. |
|||
====Dynamic striping==== |
|||
Dynamic [[Data striping|striping]] across all devices to maximize throughput means that as additional devices are added to the zpool, the stripe width automatically expands to include them; thus, all disks in a pool are used, which balances the write load across them.{{Citation needed|date=January 2017}} |
|||
====Variable block sizes==== |
|||
ZFS uses variable-sized blocks, with 128 KB as the default size. Available features allow the administrator to tune the maximum block size which is used, as certain workloads do not perform well with large blocks. If [[data compression]] is enabled, variable block sizes are used. If a block can be compressed to fit into a smaller block size, the smaller size is used on the disk to use less storage and improve IO throughput (though at the cost of increased CPU use for the compression and decompression operations).<ref>{{cite web |
|||
| url = http://www.slideshare.net/esproul/zfs-nuts-and-bolts |
|||
| title = ZFS Nuts and Bolts |
|||
| date = 2009-05-21 |
|||
| accessdate = 2014-06-08 |
|||
| author = Eric Sproul |
|||
| publisher = slideshare.net |
|||
| pages = 30–31 |
|||
| archive-url = https://web.archive.org/web/20140622215818/http://www.slideshare.net/esproul/zfs-nuts-and-bolts |
|||
| archive-date = June 22, 2014 |
|||
| url-status = live |
|||
}}</ref> |
|||
====Lightweight filesystem creation==== |
|||
In ZFS, filesystem manipulation within a storage pool is easier than volume manipulation within a traditional filesystem; the time and effort required to create or expand a ZFS filesystem is closer to that of making a new directory than it is to volume manipulation in some other systems.{{Citation needed|date=January 2017}} |
|||
====Adaptive endianness==== |
|||
{{Unreferenced section|date=January 2017}} |
|||
Pools and their associated ZFS file systems can be moved between different platform architectures, including systems implementing different byte orders. The ZFS block pointer format stores filesystem metadata in an [[endianness|endian]]-adaptive way; individual metadata blocks are written with the native byte order of the system writing the block. When reading, if the stored endianness does not match the endianness of the system, the metadata is byte-swapped in memory. |
|||
This does not affect the stored data; as is usual in [[POSIX]] systems, files appear to applications as simple arrays of bytes, so applications creating and reading data remain responsible for doing so in a way independent of the underlying system's endianness. |
|||
====Deduplication==== |
|||
[[Data deduplication]] capabilities were added to the ZFS source repository at the end of October 2009,<ref>{{cite web |url= https://blogs.oracle.com/bonwick/zfs-deduplication-v2 |title= ZFS Deduplication |website= blogs.oracle.com |access-date= November 25, 2019 |archive-url= https://web.archive.org/web/20191224020451/https://blogs.oracle.com/bonwick/zfs-deduplication-v2 |archive-date= December 24, 2019 |url-status= live }}</ref> and relevant OpenSolaris ZFS development packages have been available since December 3, 2009 (build 128). |
|||
Effective use of deduplication may require large RAM capacity; recommendations range between 1 and 5 GB of RAM for every TB of storage.<ref>{{cite web|title=Building ZFS Based Network Attached Storage Using FreeNAS 8|url=http://www.trainsignal.com/blog/zfs-nas-setup-guide|work=TrainSignal Training|publisher=TrainSignal, Inc|accessdate=9 June 2012|author=Gary Sims|format=Blog|date=4 January 2012|archive-url=https://web.archive.org/web/20120507220120/http://www.trainsignal.com/blog/zfs-nas-setup-guide|archive-date=May 7, 2012|url-status=dead}}</ref><ref>{{cite web | url = http://mail.opensolaris.org/pipermail/zfs-discuss/2011-May/048159.html | archiveurl=https://web.archive.org/web/20120425142508/http://mail.opensolaris.org/pipermail/zfs-discuss/2011-May/048159.html | title=[zfs-discuss] Summary: Deduplication Memory Requirements | author=Ray Van Dolson | date=May 2011 | archivedate = 2012-04-25 | publisher=zfs-discuss mailing list }}</ref><ref name="wiki.freebsd.org">{{cite web | url=http://wiki.freebsd.org/ZFSTuningGuide#Deduplication | title=ZFSTuningGuide | access-date=January 3, 2012 | archive-url=https://web.archive.org/web/20120116113648/http://wiki.freebsd.org/ZFSTuningGuide#Deduplication | archive-date=January 16, 2012 | url-status=live }}</ref> An accurate assessment of the memory required for deduplication is made by referring to the number of unique blocks in the pool, and the number of bytes on disk and in RAM ("core") required to store each record—these figures are reported by inbuilt commands such as <tt>zpool</tt> and <tt>zdb</tt>. Insufficient physical memory or lack of ZFS cache can result in virtual memory [[Thrashing (computer science)|thrashing]] when using deduplication, which can cause performance to plummet, or result in complete memory starvation.{{Citation needed|date=January 2013}} Because deduplication occurs at write-time, it is also very CPU-intensive and this can also significantly slow down a system. |
|||
Other storage vendors use modified versions of ZFS to achieve very high [[data compression ratio]]s. Two examples in 2012 were GreenBytes<ref>{{Cite news |title= GreenBytes brandishes full-fat clone VDI pumper |author= Chris Mellor |date= October 12, 2012 |work= The Register |url= https://www.theregister.co.uk/2012/10/12/greenbytes_chairman/ |accessdate= August 29, 2013 |archive-url= https://web.archive.org/web/20130324085407/http://www.theregister.co.uk/2012/10/12/greenbytes_chairman/ |archive-date= March 24, 2013 |url-status= live }}</ref> and Tegile.<ref>{{Cite news |title= Newcomer gets out its box, plans to sell it cheaply to all comers |author= Chris Mellor |date= June 1, 2012 |work= The Register |url= https://www.theregister.co.uk/2012/06/01/tegile_zebi/ |accessdate= August 29, 2013 |archive-url= https://web.archive.org/web/20130812033031/http://www.theregister.co.uk/2012/06/01/tegile_zebi/ |archive-date= August 12, 2013 |url-status= live }}</ref> In May 2014, Oracle bought GreenBytes for its ZFS deduplication and replication technology.<ref>{{cite web |
|||
| url = https://www.theregister.co.uk/2014/12/11/oracle_improving_zfs_dedupe/ |
|||
| title = Dedupe, dedupe... dedupe, dedupe, dedupe: Oracle polishes ZFS diamond |
|||
| date = 2014-12-11 |
|||
| accessdate = 2014-12-17 |
|||
| author = Chris Mellor |
|||
| publisher = [[The Register]] |
|||
| archive-url = https://web.archive.org/web/20170707155821/https://www.theregister.co.uk/2014/12/11/oracle_improving_zfs_dedupe/ |
|||
| archive-date = July 7, 2017 |
|||
| url-status = live |
|||
}}</ref> |
|||
As described above, deduplication is usually ''not'' recommended due to its heavy resource requirements (especially RAM) and impact on performance (especially when writing), other than in specific circumstances where the system and data are well-suited to this space-saving technique. |
|||
====Additional capabilities==== |
|||
* Explicit I/O priority with deadline scheduling.{{Citation needed|date=January 2017}} |
|||
* Claimed globally optimal I/O sorting and aggregation.{{Citation needed|date=January 2017}} |
|||
* Multiple independent prefetch streams with automatic length and stride detection.{{Citation needed|date=January 2017}} |
|||
* Parallel, constant-time directory operations.{{Citation needed|date=January 2017}} |
|||
* End-to-end checksumming, using a kind of "[[Data Integrity Field]]", allowing data corruption detection (and recovery if you have redundancy in the pool). A choice of 3 hashes can be used, optimized for speed (fletcher), standardization and security ([[SHA256]]) and salted hashes ([[Skein (hash function)|Skein]]).<ref name="zfschecksums">{{cite web |title= Checksums and Their Use in ZFS |website= github.com |date= Sep 2, 2018 |accessdate= July 11, 2019 |url= https://github.com/zfsonlinux/zfs/wiki/Checksums |archive-url= https://web.archive.org/web/20190719225739/https://github.com/zfsonlinux/zfs/wiki/Checksums |archive-date= July 19, 2019 |url-status= live }}</ref> |
|||
* Transparent filesystem compression. Supports [[LZJB]], [[gzip]]<ref>{{cite web|title=Solaris ZFS Administration Guide |work=Chapter 6 Managing ZFS File Systems |accessdate=March 17, 2009 |url=http://download.oracle.com/docs/cd/E19963-01/821-1448/gavwq/index.html |url-status=dead |archiveurl=https://web.archive.org/web/20110205111337/http://download.oracle.com/docs/cd/E19963-01/821-1448/gavwq/index.html |archivedate=February 5, 2011 }}</ref> and [[LZ4 (compression algorithm)|LZ4]]. |
|||
* Intelligent [[Data scrubbing|scrubbing]] and [[Disk mirroring|resilvering]] (resyncing).<ref name="smokinmirrors">{{cite web |title= Smokin' Mirrors |website= blogs.oracle.com |date= May 2, 2006 |accessdate= February 13, 2012 |url= https://blogs.oracle.com/bonwick/entry/smokin_mirrors |archive-url= https://web.archive.org/web/20111216163425/http://blogs.oracle.com/bonwick/entry/smokin_mirrors |archive-date= December 16, 2011 |url-status= live }}</ref> |
|||
* Load and space usage sharing among disks in the pool.<ref>{{cite web | title = ZFS Block Allocation | work = Jeff Bonwick's Weblog | date = November 4, 2006 | accessdate = February 23, 2007 | url = https://blogs.oracle.com/bonwick/entry/zfs_block_allocation | archive-url = https://web.archive.org/web/20121102073644/https://blogs.oracle.com/bonwick/entry/zfs_block_allocation | archive-date = November 2, 2012 | url-status = live }}</ref> |
|||
* Ditto blocks: Configurable data replication per filesystem, with zero, one or two extra copies requested per write for user data, and with that same base number of copies plus one or two for metadata (according to metadata importance).<ref>{{cite web | title = Ditto Blocks — The Amazing Tape Repellent | work = Flippin' off bits Weblog | date = May 12, 2006 | accessdate = March 1, 2007 | url = https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape | archive-url = https://web.archive.org/web/20130526084314/https://blogs.oracle.com/bill/entry/ditto_blocks_the_amazing_tape | archive-date = May 26, 2013 | url-status = dead }}</ref> If the pool has several devices, ZFS tries to replicate over different devices. Ditto blocks are primarily an additional protection against corrupted sectors, not against total disk failure.<ref name="ditto-block-behavior">{{cite web|url=http://opensolaris.org/jive/thread.jspa?messageID=417776 |title=Adding new disks and ditto block behaviour |accessdate=October 19, 2009 |url-status=dead |archiveurl=https://web.archive.org/web/20110823190119/http://opensolaris.org/jive/thread.jspa?messageID=417776 |archivedate=August 23, 2011 }}</ref> |
|||
* ZFS design (copy-on-write + superblocks) is safe when using disks with write cache enabled, if they honor the write barriers.{{Citation needed|date=January 2017}} This feature provides safety and a performance boost compared with some other filesystems.{{According to whom|date=January 2017}} |
|||
* On Solaris, when entire disks are added to a ZFS pool, ZFS automatically enables their write cache. This is not done when ZFS only manages discrete slices of the disk, since it does not know if other slices are managed by non-write-cache safe filesystems, like [[Unix File System|UFS]].{{Citation needed|date=January 2017}} The FreeBSD implementation can handle disk flushes for partitions thanks to its [[GEOM]] framework, and therefore does not suffer from this limitation.{{Citation needed|date=January 2017}} |
|||
* Per-user, per-group, per-project, and per-dataset quota limits.<ref name="per-user-quotas">{{cite web|url=http://www.opensolaris.org/os/community/zfs/version/15/ |title=OpenSolaris.org |publisher=Sun Microsystems |accessdate=May 22, 2009 |url-status=dead |archiveurl=https://web.archive.org/web/20090508081240/http://www.opensolaris.org/os/community/zfs/version/15/ |archivedate=May 8, 2009 }}</ref> |
|||
* Filesystem encryption since Solaris 11 Express<ref name="encryption">{{cite web | url = http://www.oracle.com/technetwork/server-storage/solaris11/documentation/solaris-express-whatsnew-201011-175308.pdf | title = What's new in Solaris 11 Express 2010.11 | publisher = Oracle | accessdate = November 17, 2010 | archive-url = https://web.archive.org/web/20101116073641/http://www.oracle.com/technetwork/server-storage/solaris11/documentation/solaris-express-whatsnew-201011-175308.pdf | archive-date = November 16, 2010 | url-status = live }}</ref> (on some other systems ZFS can utilize encrypted disks for a similar effect; [[Geli (software)|GELI]] on FreeBSD can be used this way to create fully encrypted ZFS storage). |
|||
* Pools can be imported in read-only mode. |
|||
* It is possible to recover data by rolling back entire transactions at the time of importing the zpool.{{Citation needed|date=January 2017}} |
|||
* ZFS is not a [[clustered filesystem]]; however, clustered ZFS is available from third parties.{{citation needed|reason=Citiation needed|date=October 2014}} |
|||
* Snapshots can be taken manually or automatically. The older versions of the stored data that they contain can be exposed as full read-only file systems. They can also be exposed as historic versions of files and folders when used with [[CIFS]] (also known as [[Server Message Block|SMB, Samba]] or [[file share]]s); this is known as "Previous versions", "VSS shadow copies", or "File history" on [[Windows]], or [[Apple Filing Protocol|AFP]] and "Apple Time Machine" on Apple devices.<ref>{{cite web|url=http://doc.freenas.org/9.3/freenas_sharing.html|title=10. Sharing — FreeNAS User Guide 9.3 Table of Contents|website=doc.freenas.org|access-date=February 23, 2017|archive-url=https://web.archive.org/web/20170107211538/http://doc.freenas.org/9.3/freenas_sharing.html|archive-date=January 7, 2017|url-status=live}}</ref> |
|||
* Disks can be marked as 'spare'. A data pool can be set to automatically and transparently handle disk faults by activating a spare disk and beginning to resilver the data that was on the suspect disk onto it, when needed. |
|||
==Limitations== |
|||
===Limitations in preventing data corruption=== |
|||
The authors of a 2010 study that examined the ability of file systems to detect and prevent data corruption, with particular focus on ZFS, observed that ZFS itself is effective in detecting and correcting data errors on storage devices, but that it assumes data in [[RAM]] is "safe", and not prone to error. The study comments that ''"a single bit flip in memory causes a small but non-negligible percentage of runs to experience a failure", with the probability of committing bad data to disk varying from 0% to 3.6% (according to the workload),"'' and that when ZFS [[Page cache|caches]] pages or stores copies of metadata in RAM, or holds data in its "dirty" cache for writing to disk, no test is made whether the checksums still match the data at the point of use.<ref name="zhang2010">{{cite web|url=http://dl.acm.org/citation.cfm?id=1855511.1855514|title=End-to-end Data Integrity for File Systems: A ZFS Case Study|first1=Yupu|last1=Zhang|first2=Abhishek|last2=Rajimwale|first3=Andrea C.|last3=Arpaci-Dusseau|first4=Remzi H.|last4=Arpaci-Dusseau|date=January 2, 2018|publisher=USENIX Association|pages=3|via=ACM Digital Library}}</ref> Much of this risk can be mitigated in one of two ways: |
|||
:* According to the authors, by using [[ECC RAM]]; however, the authors considered that adding [[error detection]] related to the page cache and heap would allow ZFS to handle certain classes of error more robustly.<ref name="zhang2010"/> |
|||
:* One of the main architects of ZFS, Matt Ahrens, explains there is an option to enable checksumming of data in memory by using the ZFS_DEBUG_MODIFY flag (zfs_flags=0x10) which addresses these concerns.<ref>{{cite web|url=https://arstechnica.com/civis/viewtopic.php?f=2&t=1235679&p=26303271#p26303271|title=Ars walkthrough: Using the ZFS next-gen filesystem on Linux|website=arstechnica.com|access-date=June 19, 2017|archive-url=https://web.archive.org/web/20170210103308/https://arstechnica.com/civis/viewtopic.php?f=2&t=1235679&p=26303271#p26303271|archive-date=February 10, 2017|url-status=live}}</ref> |
|||
===Other limitations specific to ZFS=== |
|||
* Capacity expansion is normally achieved by adding groups of disks as a top-level vdev: simple device, [[#RAID-Z|RAID-Z]], RAID Z2, RAID Z3, or mirrored. Newly written data will dynamically start to use all available vdevs. It is also possible to expand the array by iteratively swapping each drive in the array with a bigger drive and waiting for ZFS to self-heal; the heal time will depend on the amount of stored information, not the disk size. |
|||
* As of Solaris 10 Update 11 and Solaris 11.2, it was neither possible to reduce the number of top-level vdevs in a pool, nor to otherwise reduce pool capacity.<ref>{{cite web|url=http://bugs.opensolaris.org/view_bug.do?bug_id=4852783 |title=Bug ID 4852783: reduce pool capacity |publisher=OpenSolaris Project |accessdate=March 28, 2009 |url-status=dead |archiveurl=https://web.archive.org/web/20090629081219/http://bugs.opensolaris.org/view_bug.do?bug_id=4852783 |archivedate=June 29, 2009 }}</ref> This functionality was said to be in development in 2007.<ref>{{cite mailing list |url=http://mail.opensolaris.org/pipermail/zfs-discuss/2007-April/010356.html |title=Permanently removing vdevs from a pool |mailinglist=zfs-discuss |date=April 19, 2007 |first=Mario |last=Goebbels }}{{Dead link|date=January 2020 |bot=InternetArchiveBot |fix-attempted=yes }} [https://marc.info/?l=zfs-discuss&m=122362857630617&w=1 archive link]</ref> Enhancements to allow reduction of vdevs is under development in OpenZFS.<ref name="removal">Chris Siebenmann [https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSPoolShrinkingIsComing Information on future vdev removal] {{Webarchive|url=https://web.archive.org/web/20160811202352/https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSPoolShrinkingIsComing |date=August 11, 2016 }}, Univ Toronto, blog, quote: [https://twitter.com/awreece/status/555533793700765696 informal Twitter announcement by Alex Reece] {{Webarchive|url=https://web.archive.org/web/20160811220752/https://twitter.com/awreece/status/555533793700765696 |date=August 11, 2016 }}</ref> |
|||
* As of 2008 it was not possible to add a disk as a column to a RAID Z, RAID Z2 or RAID Z3 vdev. However, a new RAID Z vdev can be created instead and added to the zpool.<ref>{{cite web |
|||
| url = https://blogs.oracle.com/ahl/entry/expand_o_matic_raid_z |
|||
| title = Expand-O-Matic RAID Z |
|||
| publisher = Adam Leventhal |
|||
| date = April 7, 2008 |
|||
| access-date = April 16, 2012 |
|||
| archive-url = https://web.archive.org/web/20111228072550/http://blogs.oracle.com/ahl/entry/expand_o_matic_raid_z |
|||
| archive-date = December 28, 2011 |
|||
| url-status = live |
|||
}}</ref> |
|||
* Some traditional nested RAID configurations, such as RAID 51 (a mirror of RAID 5 groups), are not configurable in ZFS. Vdevs can only be composed of raw disks or files, not other vdevs. However, a ZFS pool effectively creates a stripe (RAID 0) across its vdevs, so the equivalent of a RAID 50 or RAID 60 is common. |
|||
* Reconfiguring the number of devices in a top-level vdev requires copying data offline, destroying the pool, and recreating the pool with the new top-level vdev configuration, except for adding extra redundancy to an existing mirror, which can be done at any time or if all top level vdevs are mirrors with sufficient redundancy the zpool split<ref>{{cite web|url=http://download.oracle.com/docs/cd/E19253-01/816-5166/zpool-1m/?l=en&n=1&a=view |title=zpool(1M) |publisher=Download.oracle.com |date=June 11, 2010 |accessdate=November 4, 2011}}</ref> command can be used to remove a vdev from each top level vdev in the pool, creating a 2nd pool with identical data. |
|||
* [[IOPS]] performance of a ZFS storage pool can suffer if the ZFS raid is not appropriately configured. This applies to all types of RAID, in one way or another. If the zpool consists of only one group of disks configured as, say, eight disks in RAID Z2, then the IOPS performance will be that of a single disk (write speed will be equivalent to 6 disks, but random read speed will be similar to a single disk). However, there are ways to mitigate this IOPS performance problem, for instance add SSDs as L2ARC cache—which can boost IOPS into 100.000s.<ref>{{cite web |url=https://blogs.oracle.com/brendan/entry/a_quarter_million_nfs_iops |title=A quarter million NFS IOPS |publisher=Oracle Sun |date=December 2, 2008 |accessdate=January 28, 2012 |author=brendan |archive-url=https://web.archive.org/web/20111217073127/http://blogs.oracle.com/brendan/entry/a_quarter_million_nfs_iops |archive-date=December 17, 2011 |url-status=dead }}</ref> In short, a zpool should consist of several groups of vdevs, each vdev consisting of 8–12 disks, if using RAID Z. It is not recommended to create a zpool with a single large vdev, say 20 disks, because IOPS performance will be that of a single disk, which also means that resilver time will be very long (possibly weeks with future large drives). |
|||
* Online shrink <code>zpool remove</code> was not supported until Solaris 11.4 released in August 2018<ref>{{Cite web |url=https://docs.oracle.com/cd/E37838_01/html/E60974/dmgmt.html#scrolltoc |title=Data Management Features - What's New in Oracle® Solaris 11.4 |access-date=October 9, 2019 |archive-url=https://web.archive.org/web/20190924101556/https://docs.oracle.com/cd/E37838_01/html/E60974/dmgmt.html#scrolltoc |archive-date=September 24, 2019 |url-status=live }}</ref> |
|||
* Resilver (repair) of a crashed disk in a ZFS RAID can take a long time which is not unique to ZFS, it applies to all types of RAID, in one way or another. This means that very large volumes can take several days to repair or to being back to full redundancy after severe data corruption or failure, and during this time a second disk failure may occur, especially as the repair puts additional stress on the system as a whole. In turn this means that configurations that only allow for recovery of a single disk failure, such as RAID Z1 (similar to RAID 5) should be avoided. Therefore, with large disks, one should use RAID Z2 (allow two disks to crash) or RAID Z3 (allow three disks to crash).<ref>{{cite web|last=Leventhal|first=Adam|title=Triple-Parity RAID Z|url=http://dtrace.org/blogs/ahl/2009/07/21/triple-parity-raid-z|work=Adam Leventhal's blog|accessdate=19 December 2013|archive-url=https://web.archive.org/web/20110416063604/http://dtrace.org/blogs/ahl/2009/07/21/triple-parity-raid-z/|archive-date=April 16, 2011|url-status=live}}</ref> ZFS RAID differs from conventional RAID by only reconstructing live data and metadata when replacing a disk, not the entirety of the disk including blank and garbage blocks, which means that replacing a member disk on a ZFS pool that is only partially full will take proportionally less time compared to conventional RAID.<ref name="smokinmirrors"/> |
|||
== Data recovery == |
|||
Historically, ZFS has not shipped with tools such as [[fsck]] to repair damaged file systems, because the file system itself was designed to self-repair, so long as it had been built with sufficient attention to the design of storage and redundancy of data. If the pool was compromised because of poor hardware, inadequate design or redundancy, or unfortunate mishap, to the point that ZFS was unable to [[mount (computing)|mount]] the pool, traditionally there were no tools which allowed an end-user to attempt partial salvage of the stored data. This led to threads in online forums where ZFS developers sometimes tried to provide ad-hoc help to home and other small scale users, facing loss of data due to their inadequate design or poor system management.<ref name="delphix2018">{{Cite web |url=https://www.delphix.com/blog/openzfs-pool-import-recovery |title=Turbocharging ZFS Data Recovery |access-date=November 29, 2018 |archive-url=https://web.archive.org/web/20181129054344/https://www.delphix.com/blog/openzfs-pool-import-recovery |archive-date=November 29, 2018 |url-status=live }}</ref> |
|||
Modern ZFS has improved considerably on this situation over time, and continues to do so: |
|||
:* Removal or abrupt failure of caching devices no longer causes pool loss. (At worst, loss of the ZIL may lose very recent transactions, but the ZIL does not usually store more than a few seconds' worth of recent transactions. Loss of the L2ARC cache does not affect data.) |
|||
:* If the pool is unmountable, modern versions of ZFS will attempt to identify the most recent consistent point at which the pool which can be recovered, at the cost of losing some of the most recent changes to the contents. [[Copy on write]] means that older versions of data, including top-level records and metadata, may still exist even though they are superseded, and if so, the pool can be wound back to a consistent state based on them. The older the data, the more likely it is that at least some blocks have been overwritten and that some data will be irrecoverable, so there is a limit at some point, on the ability of the pool to be wound back. |
|||
:* Informally, tools exist to probe the reason why ZFS is unable to mount a pool, and guide the user or a developer as to manual changes required to force the pool to mount. These include using ''zdb'' (ZFS debug) to find a valid importable point in the pool, using [[dtrace]] or similar to identify the issue causing mount failure, or manually bypassing health checks that cause the mount process to abort, and allow mounting of the damaged pool. |
|||
:* As of March 2018, a range of significantly enhanced methods are gradually being rolled out within OpenZFS. These include:<ref name="delphix2018"/> |
|||
::* Code refactoring, and more detailed diagnostic and debug information on mount failures, to simplify diagnosis and fixing of corrupt pool issues; |
|||
::* The ability to trust or distrust the stored pool configuration. This is particularly powerful, as it allows a pool to be mounted even when top-level vdevs are missing or faulty, when top level data is suspect, and also to rewind ''beyond'' a pool configuration change if that change was connected to the problem. Once the corrupt pool is mounted, readable files can be copied for safety, and it may turn out that data can be rebuilt even for missing vdevs, by using copies stored elsewhere in the pool. |
|||
::* The ability to fix the situation where a disk needed in one pool, was accidentally removed and added to a different pool, causing it to lose metadata related to the first pool, which becomes unreadable. |
|||
==See also== |
|||
{{Portal|Free and open-source software}} |
|||
* [[Comparison of file systems]] |
|||
* [[List of file systems]] |
|||
* [[Versioning file system]]s – List of versioning file systems |
|||
==Notes== |
|||
{{notelist}} |
|||
==References== |
==References== |
||
{{Reflist|30em}} |
{{Reflist|30em}} |
||
==Bibliography== |
|||
{{Refbegin}} |
|||
* {{Cite book|date=November 23, 2009 |title=Solaris ZFS Essentials |url=http://www.informit.com/store/product.aspx?isbn=0137000103 |edition=1st |publisher=[[Prentice Hall]] |page=256 |isbn=978-0-13-700010-4 |first1=Scott |last1=Watanabe |url-status=dead |archiveurl=https://web.archive.org/web/20121001091103/http://www.informit.com/store/product.aspx?isbn=0137000103 |archivedate=October 1, 2012 }} |
|||
{{Refend}} |
|||
==External links== |
==External links== |
||
{{Div col|colwidth=30em}} |
{{Div col|colwidth=30em}} |
||
* [http://www.open-zfs.org/ The OpenZFS Project] |
|||
* [https://mauteam.org/blog/infrastructure/40-the-best-cloud-file-system-was-created-before-the-cloud-existed/ The best cloud File System was created before the cloud existed] |
|||
* [http://www.i-justblog.com/2009/08/zfs-tip-comparison-of-svm-mirroring-and.html Comparison of SVM mirroring and ZFS mirroring] |
|||
* [http://sites.google.com/site/eonstorage/ EON ZFS Storage (NAS) distribution] |
|||
* [http://www.zfsonlinux.org/ ZFS on Linux Homepage] |
|||
* [http://www.usenix.org/events/fast10/tech/full_papers/zhang.pdf End-to-end Data Integrity for File Systems: A ZFS Case Study] |
|||
* [https://web.archive.org/web/20130228192209/http://academy.inseptra.com/featured/zfs-the-zettabyte-file-system ZFS – The Zettabyte File System], archived from the original on February 28, 2013 |
|||
* [http://pages.cs.wisc.edu/~remzi/Classes/736/Fall2007/Projects/BrianKynan/paper.pdf ZFS and RAID-Z: The Über-FS?] |
|||
* [http://wiki.illumos.org/download/attachments/1146951/zfs_last.pdf ZFS: The Last Word In File Systems], by Jeff Bonwick and Bill Moore |
|||
* [https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/ Visualizing the ZFS intent log (ZIL)], April 2013, by Aaron Toponce |
|||
{{Div col end}} |
{{Div col end}} |
||
{{Filesystem}} |
|||
{{Sun Microsystems}} |
{{Sun Microsystems}} |
||
{{Solaris}} |
{{Solaris}} |
||
{{FreeBSD}} |
|||
{{MacOS}} |
|||
{{Filesystem}} |
|||
{{DEFAULTSORT:Zfs}} |
|||
[[Category:2005 software]] |
|||
[[Category:Compression file systems]] |
[[Category:Compression file systems]] |
||
[[Category:Disk file systems]] |
[[Category:Disk file systems]] |
||
[[Category: |
[[Category:Formerly open-source or free software]] |
||
[[Category:Formerly free software]] |
|||
[[Category:Oracle software]] |
[[Category:Oracle software]] |
||
[[Category:RAID]] |
[[Category:RAID]] |
||
[[Category:Software using the CDDL license]] |
|||
[[Category:Sun Microsystems software]] |
|||
[[Category:Volume manager]] |
[[Category:Volume manager]] |
Latest revision as of 00:49, 1 December 2023
Initial release | November 2005OpenSolaris | , part of
---|---|
Stable release | 11.4 SRU53 (Solaris OS)[1]
/ January 18, 2023 |
Written in | C |
Operating system | Oracle Solaris |
License | Proprietary |
Website | docs |
Oracle ZFS is Oracle's proprietary implementation of the ZFS file system and logical volume manager for Oracle Solaris. ZFS is a registered trademark belonging to Oracle.[2]
History
[edit]Solaris 10
[edit]In update 2 and later, ZFS is part of Sun's own Solaris 10 operating system and is thus available on both SPARC and x86-based systems.
Solaris 11
[edit]After Oracle's Solaris 11 Express release, the OS/Net consolidation (the main OS code) was made proprietary and closed-source,[3] and further ZFS upgrades and implementations inside Solaris (such as encryption) are not compatible with other non-proprietary implementations which use previous versions of ZFS.
When creating a new ZFS pool, to retain the ability to use access the pool from other non-proprietary Solaris-based distributions, it is recommended to upgrade to Solaris 11 Express from OpenSolaris (snv_134b), and thereby stay at ZFS version 28.
Future development
[edit]On September 2, 2017, Simon Phipps reported that Oracle had laid off virtually all of its Solaris core development staff, interpreting it as a sign that Oracle no longer intends to support future development of the platform.[4]
Version history
[edit]Old release |
Latest Proprietary stable release |
ZFS Filesystem Version Number | OS Release | Significant changes |
---|---|---|
6 | Solaris 11.1 | Multilevel file system support[5] |
7 | Solaris 11.4 SRU 45 | File retention support[5] |
8 | Solaris 11.4 SRU 51 | Unicode versioning support[5] |
ZFS Pool Version Number | OS Release | Significant changes |
---|---|---|
29 | Solaris Nevada b148 | RAID-Z/mirror hybrid allocator |
30 | Solaris Nevada b149 | ZFS encryption |
31 | Solaris Nevada b150 | Improved 'zfs list' performance |
32 | Solaris Nevada b151 | One MB block support |
33 | Solaris Nevada b163 | Improved share support |
34 | Solaris 11.1 (0.5.11-0.175.1.0.0.24.2) | Sharing with inheritance |
35 | Solaris 11.2 (0.5.11-0.175.2.0.0.42.0) | Sequential resilver |
36 | Solaris 11.3 | Efficient log block allocation |
37 | Solaris 11.3 | LZ4 compression |
38 | Solaris 11.4 | xcopy with encryption |
39 | Solaris 11.4 | reduce resilver restart |
40 | Solaris 11.4 | Deduplication 2 |
41 | Solaris 11.4 | Asynchronous dataset destroy |
42 | Solaris 11.4 | Reguid: ability to change the pool guid |
43 | Solaris 11.4, Oracle ZFS Storage Simulator 8.7[6] | RAID-Z improvements and cloud device support.[7] |
44 | Solaris 11.4[7] | Device removal |
45 | Solaris 11.4 SRU 11[8] | Lazy deadlists |
46 | Solaris 11.4 SRU 12[9] | Compact file metadata for encryption |
47 | Solaris 11.4 SRU 21[10] | Property Support for ZVOLs |
48 | Solaris 11.4 SRU 45 | File retention support[11] |
49 | Solaris 11.4 SRU 51 | Unicode versioning support[11] |
50 | Solaris 11.4 SRU 57 | Raw crypto replication[12] |
51 | Solaris 11.4 SRU 63 | 'onexpiry' options for file retention[12] |
References
[edit]- ^ "Announcing Oracle Solaris 11.4 SRU53". January 18, 2023. Retrieved January 18, 2023.
- ^ "Status Information for Serial Number 85901629 (ZFS)". United States Patent and Trademark Office. Archived from the original on October 21, 2013. Retrieved October 21, 2013.
- ^ "Oracle Has Killed OpenSolaris". Techie Buzz. August 14, 2010. Archived from the original on October 15, 2013. Retrieved July 17, 2013.
- ^ Varghese, Sam (September 4, 2017). "Bye, bye Solaris, it was a nice ride while it lasted". ITWire. Retrieved July 21, 2019.
- ^ a b c "ZFS File System Versions". Oracle Corporation. 2022. Archived from the original on January 2, 2023. Retrieved January 1, 2023.
- ^ "Oracle ZFS Storage Simulator download". Oracle Corporation. 2017. Archived from the original on January 13, 2018. Retrieved January 12, 2018.
- ^ a b "ZFS Pool Versions". Oracle Corporation. 2018. Archived from the original on December 18, 2018. Retrieved December 18, 2018.
- ^ "ZFS Pool Versions". Oracle Corporation. 2019. Archived from the original on December 18, 2018. Retrieved July 24, 2019.
- ^ "ZFS Pool Versions". Oracle Corporation. 2019. Archived from the original on December 18, 2018. Retrieved August 20, 2019.
- ^ "ZFS Pool Versions". Oracle Corporation. 2020. Archived from the original on December 18, 2018. Retrieved May 23, 2020.
- ^ a b "ZFS Pool Versions". Oracle Corporation. 2022. Archived from the original on December 21, 2022. Retrieved January 1, 2023.
- ^ a b "ZFS Pool Versions". Oracle Corporation. 2023. Retrieved November 17, 2023.