The Tux3 free block map is a dynamically allocated tree, and this introduces a recursion: what happens when we try to allocate a block but the leaf of the tree that maps the block does not exist yet, has to be allocated, and we want to allocate it on that same free map leaf that does not yet exist? Answer: infinite recursion ending with stack overflow, unless the recursion is limited in some way. I considered two obvious approaches: * Do not record allocations of free tree leaves or nodes in the free tree itself, but log them using a logical log entry as already planned for the atomic commit mechanism. * Use bitmaps for now, not extents. Preallocate the whole free map without using an allocator, deciding where to place the blocks by an ad hoc method. Once fully preallocated, enter the btree block addresses into the free map "by hand" as ddsnap does. Then I thought: since I am going to allocate with bitmaps at first anyway, why not put the bitmaps in a file instead of a special purpose btree. Then the bitmap blocks can be accessed through the file cache (aka page cache in kernel, or logical block cache as I have sometimes called it). The really cool thing about this strategy is, it unexpectedly solves the above recursion. Changes to the bitmap are made in the cache, which is a simple logically indexed array. This naturally defers the actual allocations until the cache is flushed to disk. Woohoo, this is nice. Now what about when we go to extent based free space mapping? Ok, that is easy. There will be a separate free map btree with extents at the leaves. That btree could be mapped into a file just like an htree directory index, but that would not avoid the recursion. This is where the logical logging method described above can be used. (It is not yet clear to me whether it is best to map the free extent btree into a file or have it separate, at the same level in the Tux3 structure as the inode table.) An extent based free map is efficient only for large allocations, which are possible only when free space is not fragmented. Sometimes free space will be so fragmented in a given region that it would be more efficient to use a bitmap for that region. So I plan to implement both methods and combine them in Tux3. There will be running statistics to keep track of fragmentation per 128 MB region (the size that can be mapped by a single 4K bitmap block). If fragmentation rises above some threshold then several blocks of the free extent map will be replaced by a specially marked pointer that is the logical block number of a bitmap block in the bitmap file. Likewise, if it falls below some threshold across several regions then some bitmap blocks will be replaced by a single extent leaf. What do we mean by "large allocations"? First consider a maximally fragmented situation where every second block in an 8 block region is free. That can be represented in one byte with a bitmap, but requires four 12 byte extents (8 byte pointer + count, 4 byte logical index overhead per the dleaf design) for a total of 48 bytes, plus some per leaf overhead. A factor of 50 difference in storage efficiency, and that is huge. Now where is the crossover point where extents become more efficient that a bitmap? If every 128th block is free in a 512 block region, that also requires 48 bytes to represent with extents, but 64 bytes with bitmaps. The crossover point is somewhere between average run size of 64 and 128 free blocks in a given region. The same logic applies to runs of used blocks. By keeping statistics of run size per 128 MB region we know that a region can be represented more efficiently by extents if the run size is well above 128 and by a bitmap if it is well below 64. In between it does not matter much, which provides a lazy band to keep the representation from changing too frequently. Now it is time to implement the bitmap version of Tux3 free map. Regards, Daniel _______________________________________________ Tux3 mailing list Tux3@tux3.org http://tux3.org/cgi-bin/mailman/listinfo/tux3
