The tux3 code is starting to get a little more interesting now. Another significant part of the kernel vfs machinery is now emulated: the page cache. Which in userland does not have to care about pages, so it works in buffers instead, and instead of a special index field we just reinterpret the block field as an index. This is the way the Linux kernel does it anyway (actually I dimly recall that change came about in Linux as a response to the way I handled directory block access in htree) but this is obscured by the confusion between block and page units of granularity. The kernel uses a page cache with a direct device mapping to implement the buffer cache, while my code uses the buffer cache to implement the equivalent of the kernel page cache. The latter is a much cleaner way to do things, but is not an option for the kernel until we get around to generalizing page size. Anyway, both in kernel and in my emulation, the block interface is just the vererable BSD standard getblk and bread, with writing done by marking buffers dirty and flushing them out en mass periodically. Easy to understand and get right. This page cache emulation effort is about making directory operations happen. We want to scan sequentially through an entire directory file looking for an entry or a place to create a new one. So we need a notion of what sequentially means. One way is to walk an inode btree, but that is not the way the kernel does things, and that would be a problem when it comes time to port. What the kernel does is read each block of a directory file into a page cache the first time somebody accesses it, and after that, the cached block can be accessed very efficiently without needing to read it again or having to go messing around in the filesystem metadata. This is really, really powerful. You do not want to go trying to reimplement that kind of caching at the filesystem level, because you will just end up with a lot of code that does not do nearly as good a job as the kernel page cache. The way the kernel gets a block into the page cache for the first time is to call the filesystem's get_block method to do the logical to physical mapping for each block backing a page cache page, or more precisely, it invokes a filesystem method to read a page, which calls a kernel library function that does a callback to the filesystem's supplied get_block function. Pretty well all Linux filesystems use that library function, so we might as well think of this as the vfs calling the filesystem get_block method by a twisty path. Tux3 is going to do things a little differently. Instead of using that twisty library function, it will just go get the page that the vfs is asking for, eliminating a whole mess of calls back and forth, and in theory doing things somewhat more efficiently by being able to look up all the blocks for a page at once instead of doing a separate get_block for each one. In practice, Linux filesystem blocksize almost always matches the hardware page size (or else performance will suck) so there is only one get_block call per page. If we ever get around to properly supporting huge pages then this will matter a lot. For now it just feels clean. Now I need tux3_readblock that gets called for any file cache miss, and tux3_writeblock to flush dirty blocks to disk. These are nearly there. In the case of readblock, just a rearrangement of the responsibilities of filemap_readblock and the existing tuxread. The filemap_readblock method will probe the file btree to find the physical block before calling diskread, and tuxread, instead of directly doing the btree probe as it does now, will just call bread on the inode mapping. Therefore, tuxread is about to get a whole lot more efficient because only the first access hits the filesystem metadata. Just like in kernel. The new behavior for tuxwrite will be even nicer: it is now just going to do a getblk in the filemap hash, transfer data onto it and mark the buffer dirty. No filesystem metadata will be touched until it is time to flush dirty blocks to disk. This is "delayed allocation", which is usally a big feature that gets added to a filesystem some time late in its life if ever, but it just comes for free with the tux3 approach. Regards, Daniel _______________________________________________ Tux3 mailing list Tux3@tux3.org http://tux3.org/cgi-bin/mailman/listinfo/tux3
