File Format
The Hash Cache is stored in a binary file format for efficiency. The file structure uses a structure of arrays approach for better compression.
Notation
Assume any bit packed values are sequential, i.e. if u4
then u4
is specified, first u4
is the upper 4 bits
.
The file, including all packed fields is little-endian
. It is written out when total number of
bits aligns with a power of 2.
u6
+u12
is 2 byteslittle-endian
u15
+u17
is 4 byteslittle-endian
u26
+u22
+ u16 is 8 byteslittle-endian
u6
+u11
+ u17 is 4 byteslittle-endian
, not 2+2
File Layout
[Header]
[Partial Hashes Section]
[Full Hashes Section]
[Path Hashes Section]
[File Write Times Section]
[Paths Section] [Optional]
-
- Describes the rest of the data in the file.
-
Partial Hashes Section
- An array of
u64
values, each representing the XXH3 hash of the first 4096 bytes (or whole file if smaller) for each file.
- An array of
-
Full Hashes Section
- An array of
u64
values, each representing the XXH3 hash of the entire file for each file.
- An array of
-
Path Hashes Section
- An array of
u64
values, each representing the XXH3 hash of the file path. - File paths are sanitized.
- An array of
-
Modify Times Section
- An array of
u64
values, each representing last write time of a file. - This uses the Windows' FILETIME as the time stamp.
- An array of
-
Paths Section
Header
The File Hash Cache begins with the following header.
Header (8 bytes):
u3
: Version Numberu5
: Flagsu24
: Number of Entriesu32
: Padding (Reserved)
Current version is 0
Bit Flags
Bits are in order A0000
A
: True if theFile Paths
section is present.
Other bits are currently unused.
Modify Time Format
This uses the Windows' FILETIME structure on all platforms.
This is the time in 100ns intervals since 12:00 A.M. January 1, 1601 Coordinated Universal Time (UTC).
The reasoning for using this on nix
is a bit subtle. The native type on most nix
systems is timespec,
however this type can have inconsistent size. For example on 32-bit Linux systems it would likely be 8 bytes
,
and susceptible to the 2038 problem. On 64-bit systems it would be 16 bytes
; which is inconvenient for
space efficiency. etc.
Therefore we default to Windows' definition on all platforms.
// Constants for epoch difference and conversion
// EPOCH_DIFFERENCE is the number of seconds between the Windows epoch (1601-01-01)
// and the Unix epoch (1970-01-01)
const EPOCH_DIFFERENCE: i64 = 11_644_473_600;
const NANOS_PER_SEC: i64 = 1_000_000_000;
const WINDOWS_TICK: i64 = 100; // 100 nanoseconds
/// Converts a Unix timespec (seconds and nanoseconds) to a Windows timestamp
/// (100-nanosecond intervals since January 1, 1601 UTC)
///
/// This function handles both positive and negative timestamps, accounting for the difference
/// between Unix epoch (1970-01-01) and Windows epoch (1601-01-01).
///
/// Note: This function does not perform explicit overflow checks for performance reasons.
/// It should handle most practical timestamp values, but extreme values may cause overflow.
///
/// # Arguments
/// The unix timespec components.
///
/// * `seconds` - Signed seconds since Unix epoch (January 1, 1970)
/// * `nanoseconds` - Additional nanoseconds (always non-negative)
///
/// # Returns
///
/// The Windows timestamp as an i64 (FILETIME)
fn timespec_to_windows_timestamp(seconds: i64, nanoseconds: u32) -> i64 {
// Convert seconds to Windows ticks
let second_ticks = seconds * (NANOS_PER_SEC / WINDOWS_TICK);
// Convert nanoseconds to Windows ticks
let nano_ticks = (nanoseconds as i64) / WINDOWS_TICK;
// Add epoch difference in ticks
let epoch_diff_ticks = EPOCH_DIFFERENCE * (NANOS_PER_SEC / WINDOWS_TICK);
// Combine all components
second_ticks + nano_ticks + epoch_diff_ticks
}
On x86, this compiles down to around 7 instructions, without division. And around 11 on aarch64.
Above code is untested.
But it should not overflow provided that the input values are within the bounds of Windows' timestamp format.
Paths Section
This section has the following layout
[Header]
[Name Lengths Section]
[Names Section]
The paths follow the file path sanitization rules. They have no null terminators.
Header
4 bytes:
u3
: Version Numberu5
: Flagsu24
: Compressed Size
The flags (A0000
) for this section are:
A
: True if the Name Lengths section usesu16
instead ofu8
.
Name Lengths Section
This is an array of path lengths in bytes expressed as u8
(s).
(Or u16
if the A
flag in header is set)
Each array entry corresponds to a file at same index in previous sections.
So a value of 255 means the path is 255 bytes long.
Names Section
This is a contiguous block of UTF-8
encoded strings representing the relative paths of all files.
This block is compressed with ZStd
, using single pass function, e.g. ZSTD_compress
so decompressed
size is available in compressed payload.
File Path Sanitization
Each file path is sanitized using the following rules before being hashes.
- Any trailing slashes are removed.
- File paths are converted to upper case.
- All slashes are forward slashes.
File paths are assumed to be case sensitive.