Morphing the Divina Commedia into byte tokens with Zig
Nothing fancy. I just dumped the Divina Commedia into a contiguous u16
slice.
const path = "commedia.txt";
const buf = try tok.tokenizeFile(allocator, path);
defer allocator.free(buf.data);
Running it:
$ zig run src/main.zig -- commedia.txt
tokens: 300682 (expected 300682)
head: { 10, 32, 32, 78, 101, 108, 32, 109, 101, 122 }
Just 300682 u16s waiting for an embedding matrix :)