This post was initially written to keep FerrisCraft users apprised of the work toward the launch of FC5, but expanded for the Writing Gaggle, which you should check out.
Also, my personal blog can be found here: https://www.catmonad.xyz/blog.
Consider also seeing yesterday’s devlog.
Apparently I’m doing these daily now. (Not really. Don’t expect another one tomorrow.)
While working on the initial design and
outline of this tool, I used the
fastanvil
crate to parse Minecraft
region files (the “Anvil format”). It was good
enough to let me examine the internal structure
of region files, but it does not provide a way
to read the already compressed chunk data—
instead only providing a method which creates a
buffer of decompressed data.
As I discussed yesterday, I am not going to be implementing delta encoding of chunks in the near future. So, decompressing every chunk is a waste of time, especially when I’m just going to compress it again right after, when I save it in the database.
Instead of looking into
fastanvil
’s source and perhaps
forking the crate in order to add a method which
provides direct access to the compressed chunk
data, I took the opportunity today to nail down
my understanding of the Anvil file format by
writing my own parser. I’ll be needing to
regenerate region files from the database soon
anyway, and so this knowledge was necessary from
the start.
This parser is very lazy, and largely assumes that the region file given to it is well-formed. If that assumption is violated, we’ll end up with a panic, which is better than other Bad Things I could name, but probably not relevant anyway.
It works by first running length assertions on a given byte slice, and then casting the reference. And that’s it. Boom:
pub fn parse_ref(region: &[u8]) -> Result<&Self, ()> {
let Ok(bytes): Result<&[u8; (PAGE_SIZE * 2) as usize], _> =
..(PAGE_SIZE * 2) as usize].try_into()
region[else {
return Err(());
};
// Safety: LocationTable and TimestampTable both have the same layout as a [u8; PAGE_SIZE].
// We set Meta to repr(C), which results in it having the same layout as a
// [[u8; PAGE_SIZE]; 2], which has the same layout as a [u8; PAGE_SIZE * 2].
// These two types are therefore reference compatible.
Ok(unsafe { core::mem::transmute(bytes) })
}
…okay, that’s not the whole story.
I’ve alluded to it already, but we should take a few minutes to examine the format we’re talking about here. Officially titled the “Anvil format”, it dictates how chunk data is saved to permanent storage.
It is not a complicated format, with two main segments:
All this to save a total of 32×32, or 1024, Minecraft chunks.
Each table is an array of 4 byte entries, with 1024 entries total, matching the number of chunks saved in the file.
#[repr(transparent)]
struct TableEntry {
: [u8; 4],
bytes}
#[repr(transparent)]
struct Table {
: [TableEntry; 1024],
entries}
The two tables are laid out directly next to
each other, making the front of the file look
like this: [Table; 2]
. But the
important thing is how each
TableEntry
is interpreted.
In the first table, the first three bytes of an entry are used to encode the offset a chunk is found at in the file, in 4096 byte multiples:
pub fn offset(&self) -> u32 {
self.page_offset() * PAGE_SIZE
}
pub fn page_offset(&self) -> u32 {
let [a, b, c, _] = self.entry.bytes;
u32::from_be_bytes([0, a, b, c])
}
And the fourth byte of each entry is used to encode the amount of space allocated in the file for that chunk, in 4096 byte multiples:
pub fn size(&self) -> u32 {
self.page_size() * PAGE_SIZE
}
pub fn page_size(&self) -> u32 {
let [_, _, _, x] = self.entry.bytes;
u32::from(x)
}
But which chunk is a TableEntry
talking about? That is determined by this
mapping, which converts in-world chunk
coordinate offsets from the origin of a region
to indexes into our Table
s
here:
fn compute_chunk_table_index(x: u8, z: u8) -> usize {
assert!(x < 32 && z < 32);
let (x, z) = (x as usize, z as usize);
+ (z * 32)
x }
A TableEntry
can indicate that
there is no data saved for its corresponding
chunk by being set to all zero bytes.
Both Table
s use this same
mapping, and so the last needed information is
that entries in the second table are 4 byte big
endian integers which represent modification
times:
pub fn time(&self) -> u32 {
let bytes = self.entry.bytes;
u32::from_be_bytes(bytes)
}
It occurs to me that all of these accessor
methods could in fact be on the same
TableEntry
struct, but they are in
fact defined on Location
and
Timestamp
structs which wrap the
TableEntry
struct, to ensure I
don’t mistakenly interpret a member of the
location table as a timestamp or vice-versa.
You might have noticed that each table is 4096 bytes, and offsets into an Anvil file are measured in multiples of 4096 bytes. I have dubbed this unit a “page”, because that word is often used to describe large units of data when people speak of binary formats like this. “Sector” is another common one, but “page” is closer to the front of my mind right now because it’s what SQLite’s documentation uses.
So, the data for chunks starts at a page index of 2, or 8192 bytes into the file.
Each chunk starts with a header with this information:
1
represents gzip, 2
represents zlib, and 3
represents
that it is not compressed. Reportedly, only zlib
compression is seen in the wild./// Get the length of the chunk body.
pub fn len(&self) -> u32 {
let bytes = self.bytes[..4].try_into().unwrap();
u32::from_be_bytes(bytes) - 1
}
pub fn compression_scheme(&self) -> CompressionScheme {
let scheme = self.bytes[4];
match scheme {
1 => CompressionScheme::Gzip,
2 => CompressionScheme::Zlib,
3 => CompressionScheme::Uncompressed,
=> panic!("unknown compression scheme"),
_ }
}
pub fn data(&self) -> &[u8] {
&self.bytes[5..]
}
Armed with that knowledge of the Anvil
format, and those accessor methods I wrote
above, parsing is trivial: cast a byte slice,
and then use the table accessor methods to
determine chunk indexes and create a
ChunkRef
which gives us access to
whichever chunk it is we’re examining.
I’ve also defined an iterator for chunks in an Anvil file, which works by walking the first table in the file and handing out chunk references for those chunks which have data associated with them. That will be used in the process of dumping chunks into the content addressed store I mentioned last time.
One step at a time. This is just one part, but we’re that much closer to having a fully usable backup tool. Maybe next devlog I’ll tell you about the object storage format. :)