Monday, August 22, 2022

How macOS tracks file metadata on non-Mac storage

This article is based on a discussion thread on TidBITS Talk: Finding Type/Creator Tags in Old Mac Document Files.

In this thread, one reader asked how Macintosh file metadata (e.g. a file's type and creator codes) are preserved when a file is copied to non-Macintosh storage (e.g. a Windows file server or a FAT formatted hard drive). He observed that he can copy a file to a Windows server, then copy/move the file to several different locations from Windows, and then copy the file back to a Mac and the metadata is preserved.

Here is the result of my analysis.

Since the first days of the Macintosh, files were never as simple as they are on other computers. In addition to "standard" metadata like a file name, size and basic permissions (e.g. read-only, hidden), Mac files almost always contain additional metadata including:

  • File type. A 4-byte number (usually expressed as four text characters) identifying the type of data stored in the file. For example, "TEXT" for text files, "WXBN" for Microsoft Word XML files (aka "docx"), "FMP7" for FileMaker Pro version 7 and "MPG3" for MP3 audio.
  • Creator. A 4-byte number (usually expressed as four text characters) identifying the application that created the file. For example, "ttxt" for the old "TeachText" text editor and "MSWD" for Microsoft Word.
  • Dates. For creation and last modification.
  • Icon. How the document is presented on your desktop. An icon will be normally be taken from its associated application, but it can be overridden on a per-document basis.
  • Locked. Makes the file read-only. On modern versions of macOS, this is distinct from the Unix-level file permissions.
  • Stationery pad. If this flag is set then the document is treated as a template. Opening it will cause the application to create a new document with the file's content.
  • Comments. Users can type arbitrary text, which is stored with the file.

Additionally, Mac file systems support "forks". These are additional streams of data that are stored along with the file's normal content. Originally, macOS only supported two forks: the data fork and the resource fork. The data fork is considered a file's main content (and on other operating systems that don't support forks, it is the entire file content).

The resource fork is a sort of database to store "resources" - objects like bitmaps, icons, fonts, string tables, and other kinds of support data. Resources are indexed by a type (a 4-byte number represented as four text characters) and a numeric index. Classic MacOS (versions 0 through 9) even use resources to store an application's executable code, as a set of "CODE" resources stored in the application's resource fork.

Today, much of this is vestigial. Modern Mac applications don't use the resource fork (they use a completely different mechanism, beyond the scope of this article for accessing resources).

Even the type/creator codes are barely used today. Modern Mac systems identify a file's type based on an extension to its filename, just as other operating systems do. The system associates an application with a file type, which will be launched when the user double-clicks on a document's icon, unless the user manually override that default using the Finder. But the codes are still used if the file name has no extension or if the extension is unknown - in which case, the default application will be the one that created the file (identified by its creator code), and if that application is unavailable, any other application that advertises support for the file type.

But Finder metadata isn't completely unused. MacOS stores other kinds of metadata (e.g. a "quarantine" flag to identify files that should be checked for malware before opening) with files, and it does this using "extended attributes". These attributes are stored as Finder metadata and in secondary forks (not the legacy resource fork, but other ones created for the purpose). Because this metadata is still important, it should be preserved when the file is copied to non-Mac systems, and if you try it, you will see that it is preserved. But how is this done?

Disk images, archive files and special encodings

In the early days of the Macintosh, it was the user's responsibility to make sure that Finder metadata, the resource fork and other similar data would be preserved when copying files to non-Mac systems. Simply copying a file to a non-Mac system would typically result in the loss of this data, the result of which could be catastrophic (e.g. an application, whose content consists almost entirely of resoruces is the resource fork). For this reasons, various text and binary encodings were invented in order to preserve this data, and Mac users (especially those exchanging files over BBSs and the Internet) were expected to run application software in order encode/decode these formats. Some popular examples include:

  • BinHex (.hqx). A text-based format that encodes all of a Mac file's data in a text-only format suitable for exchanging via e-mail and other 7-bit communication interfaces (e.g. the early CompuServe network).
  • MacBinary. A binary format that combines a file's data fork, resource fork and Finder metadata into a single "normal" file that can be copied to/from non-Mac computers.
  • AppleSingle. Another binary format for combining files. This one was invented by Apple and solves some technical problems with the MacBinary format.
  • AppleDouble. A binary format where the file is stored as two files. The data fork content is stored in one and all other data (Finder info, resource fork, etc.) is stored in another. It was invented for Apple's A/UX Unix platform, to allow Mac files to be stored on a Unix file system in a way they are usable by Unix applications, but without losing Mac-specific content.
  • StuffIt. A very popular commercial (originally shareware) data compression system designed for MacOS, and therefore fully supporting Mac file system data.

Mac software has also been (and continues to be) frequently distributed in the form of disk images. These are (data-fork-only) files that contain an image of a complete file system. Images of Mac file systems (e.g. HFS and HFS+) can natively support all of a Mac file's forks and metadata. These are very popular for distributing collections of documents and application installers.

Finally, since Mac OS X 10.3 ("Panther"), Apple started supporting the use of Zip archives for Mac files, including easy integration with the Finder. If Mac files are zipped and unzipped using the Finder, Mac metadata should be archived along with the rest of the content.

Copying Mac files to non-Mac storage devices

With the above background, we can now understand what modern versions of macOS do when you copy a Mac file to a storage device that doesn't support Mac file structures (e.g. a FAT-formatted storage volume).

MacOS uses a variation on the legacy AppleDouble format. When a file is copied to a non-Mac volume (or if a Mac app creates/edits a file on such a volume), two files are created. The first (having the file's normal name) contains its data fork. The second (having a name identical to the first, but prefixed with ._) contains Finder info and all other forks (resource and otherwise).

If you use the Mac GetFileInfo command from a Terminal session, you can see the file's Finder info. If you run it against a file on a Mac storage volume, you might see something like:

$ GetFileInfo foo.docx 
file: "/Users/.../foo.docx"
type: "WXBN"
creator: "MSWD"
attributes: avbstclinmedz
created: 08/21/2022 18:24:31
modified: 08/21/2022 18:24:31
And if you copy that file to a FAT-formatted volume, you will see the same thing:
$ cd "/Volumes/FAT"

$ GetFileInfo foo.docx 
file: "/Volumes/FAT/foo.docx"
type: "WXBN"
creator: "MSWD"
attributes: avbstclinmedz
created: 08/21/2022 18:24:31
modified: 08/21/2022 18:24:31

So the metadata is being preserved. And it is being stored in the file's corresponding ._ file. Which you can see if you perform a hex-dump on its content:

$ ls -la ._*
-rwxrwxrwx  1 ...  staff  4096 Aug 21 18:29 ._foo.docx
  
$ hexdump -C ._foo.docx
00000000  00 05 16 07 00 02 00 00  4d 61 63 20 4f 53 20 58  |........Mac OS X|
00000010  20 20 20 20 20 20 20 20  00 02 00 00 00 09 00 00  |        ........|
00000020  00 32 00 00 0e b0 00 00  00 02 00 00 0e e2 00 00  |.2..............|
00000030  01 1e 57 58 42 4e 4d 53  57 44 00 00 00 00 00 00  |..WXBNMSWD......|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000050  00 00 00 00 41 54 54 52  00 00 00 01 00 00 0e e2  |....ATTR........|
00000060  00 00 01 34 00 00 00 b3  00 00 00 00 00 00 00 00  |...4............|
00000070  00 00 00 00 00 00 00 04  00 00 01 34 00 00 00 20  |...........4... |
00000080  00 00 15 63 6f 6d 2e 61  70 70 6c 65 2e 71 75 61  |...com.apple.qua|
00000090  72 61 6e 74 69 6e 65 00  00 00 01 54 00 00 00 10  |rantine....T....|
000000a0  00 00 1a 63 6f 6d 2e 61  70 70 6c 65 2e 6c 61 73  |...com.apple.las|
000000b0  74 75 73 65 64 64 61 74  65 23 50 53 00 00 00 00  |tuseddate#PS....|
000000c0  00 00 01 64 00 00 00 2a  00 00 24 63 6f 6d 2e 61  |...d...*..$com.a|
000000d0  70 70 6c 65 2e 6d 65 74  61 64 61 74 61 3a 5f 6b  |pple.metadata:_k|
000000e0  4d 44 49 74 65 6d 55 73  65 72 54 61 67 73 00 00  |MDItemUserTags..|
000000f0  00 00 01 8e 00 00 00 59  00 00 37 63 6f 6d 2e 61  |.......Y..7com.a|
00000100  70 70 6c 65 2e 6d 65 74  61 64 61 74 61 3a 6b 4d  |pple.metadata:kM|
00000110  44 4c 61 62 65 6c 5f 6f  66 66 32 74 33 34 64 33  |DLabel_off2t34d3|
00000120  75 74 70 35 6f 37 76 73  77 70 70 66 61 72 68 72  |utp5o7vswppfarhr|
00000130  79 00 00 00 30 30 38 32  3b 36 33 30 32 62 30 39  |y...0082;6302b09|
00000140  66 3b 4d 69 63 72 6f 73  6f 66 74 5c 78 32 30 57  |f;Microsoft\x20W|
00000150  6f 72 64 3b 9f b0 02 63  00 00 00 00 66 9a 4f 18  |ord;...c....f.O.|
00000160  00 00 00 00 62 70 6c 69  73 74 30 30 a0 08 00 00  |....bplist00....|
00000170  00 00 00 00 01 01 00 00  00 00 00 00 00 01 00 00  |................|
00000180  00 00 00 00 00 00 00 00  00 00 00 00 00 09 f2 8a  |................|
00000190  e8 14 f6 73 bd 4a 7d d8  21 e3 ac e3 1c 5d ff 2b  |...s.J}.!....].+|
000001a0  b6 18 5c 7e 64 9f bf 7a  8a 7b 7a 8f 03 ff 38 03  |..\~d..z.{z...8.|
000001b0  d6 b5 40 7f c9 68 33 2c  8f f6 35 80 70 77 42 5f  |..@..h3,..5.pwB_|
000001c0  0d ae 68 66 f7 f1 fe 6e  0b c5 eb 43 7a 50 93 95  |..hf...n...CzP..|
000001d0  bb 40 65 df 61 ee 12 82  f5 77 79 1d a8 ed 86 a7  |.@e.a....wy.....|
000001e0  fa 4d 2c d2 2d 7a 4b 00  00 00 00 00 00 00 00 00  |.M,.-zK.........|
000001f0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000ee0  00 00 00 00 01 00 00 00  01 00 00 00 00 00 00 00  |................|
00000ef0  00 1e 54 68 69 73 20 72  65 73 6f 75 72 63 65 20  |..This resource |
00000f00  66 6f 72 6b 20 69 6e 74  65 6e 74 69 6f 6e 61 6c  |fork intentional|
00000f10  6c 79 20 6c 65 66 74 20  62 6c 61 6e 6b 20 20 20  |ly left blank   |
00000f20  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000fe0  00 00 00 00 01 00 00 00  01 00 00 00 00 00 00 00  |................|
00000ff0  00 1e 00 00 00 00 00 00  00 00 00 1c 00 1e ff ff  |................|
00001000

Although this is a big blob of binary data, there are some interesting things you can notice:

  • The type and creator codes are here (at offset 0x32): WXBN is the file type and MSWD is the creator.
  • You can see the names of four extended attributes:
    • com.apple.quarantine (name starting at offset 0x83)
    • com.apple.lastuseddate#PS (name starting at offset 0xa3)
    • com.apple.metadata:_kMDItemUserTags (name starting at offset 0xCB)
    • com.apple.metadata:kMDLabel_off2t34d3utp5o7vswppfarhry (name starting at offset 0xFB)

As long as the file's corresponding ._ file always accompanies the original file when it is moved/copied, the metadata will move/copy with it. If, however, this file should get lost (e.g. the original file moved without it, or if it should be deleted), then the metadata will be lost. For example:

$ rm ._foo.docx 

$ GetFileInfo foo.docx 
file: "/Volumes/FAT/foo.docx"
type: "\0\0\0\0"
creator: "\0\0\0\0"
attributes: avbstclinmedz
created: 08/21/2022 18:24:31
modified: 08/21/2022 18:24:31

Notice how the type/creator information is no longer available. That's because it was stored in the ._ file (along with other metadata), so when that file gets deleted, so does its content.

Copying to non-Mac file systems that support forks

Interestingly enough, it doesn't always work this way. For example, if you would copy the file to an NTFS volume that is shared by a modern version of Windows (e.g. Windows 10), you will not find any ._ file stored alongside the original file. But you can copy/move it all over the Windows file system and when you copy it back to the Mac (or open it via the Windows file share), the metadata will present itself.

So what's going on here?

The answer is that the NTFS file system (used by most Windows installations these days) supports multiple forks, just like Apple's file systems do. Microsoft calls these forks Alternate Data Streams or ADS. When macOS copies a file to a network volume and the server reports that it supports ADS, macOS will store the file's metadata (Finder info, extended attributes and forks) as alternate data streams associated with the original file. These streams are typically kept hidden from users, but you can see them if you know where to look.

If you use the DIR command without any special options, you will not see them:

C:\Users\...\tmp> dir
 Volume in drive C has no label.
 Volume Serial Number is DA9A-72B5

 Directory of C:\Users\...\tmp

08/22/2022  11:27    <DIR>          .
08/22/2022  11:27    <DIR>          ..
08/22/2022  11:26            11,878 foo.docx
               1 File(s)         11,878 bytes
               2 Dir(s)  56,669,786,112 bytes free

But if you use the /R option, then you will be able to see five alternate data streams, in addition to the original file:

C:\Users\...\tmp> dir /r
 Volume in drive C has no label.
 Volume Serial Number is DA9A-72B5

 Directory of C:\Users\...\tmp

08/22/2022  11:27    <DIR>          .
08/22/2022  11:27    <DIR>          ..
08/22/2022  11:26            11,878 foo.docx
                                 60 foo.docx:AFP_AfpInfo:$DATA
                                 16 foo.docx:com.apple.lastuseddate#PS:$DATA
                                 89 foo.docx:com.apple.metadatakMDLabel_off2t34d3utp5o7vswppfarhry:$DATA
                                 42 foo.docx:com.apple.metadata_kMDItemUserTags:$DATA
                                 32 foo.docx:com.apple.quarantine:$DATA
               1 File(s)         11,878 bytes
               2 Dir(s)  56,669,786,112 bytes free

Applications that are ADS-aware can open these alternate data streams and read them as if they were separate files. The name is what you see in the directory listing, but without the :$DATA suffix. One application bundled with Windows that is ADS aware (and can therefore read these streams) is Notepad:

C:\Users\...\tmp> notepad foo.docx:AFP_AfpInfo

And if you do this, you will see that the AFP_AfpInfo stream contains the Finder info, including the type and creator codes:

IMPORTANT: Do not save this file! These alternate data streams contain binary data and Notepad is a text editor. If you try to save the stream, you will probably corrupt its content.

The other named streams (obviously) correspond to the four extended attributes we saw in the hex-dump of the ._ file. The only difference between each stream's name and its attribute's name is that : characters have been replaced with a private-use Unicode character (U+F022: ), because colons are illegal characters in Windows file names.

No comments: