William Lee Irwin III [interview] recently announced on the lkml that he'd successfully gotten Linux running on a 64GB x86 server. His posts included two different boot message logs, one without his page clustering patch, and one with. In the latter case, his patch overcomes the 1GB mem_map virtual space limitation imposed by x86 32-bit servers, without which the kernel over-runs allowable memory space.
Bill's current efforts are based upon Hugh Dicken's earlier page clustering patches for the 2.4.x kernel. Hugh's efforts were actually focused on allowing larger filesystem block sizes, prompting Bill to say, "The fact it resolves the horror of mem_map[] overrunning kernel virtualspace on i386 PAE is really an obscure coincidence." His patch is still a work in progress, but with time will offer a number of additional benefits beyond the support of 64GB x86 servers. For example, utilizing the entire software page in fault handlers results in prefaulting benefits, and increasing the physical contiguity of data results in I/O throughput benefits. However, at this time "until it is done it will have severe performance problems on small memory machines (say, less than 16GB)."
I approached Bill, asking questions to better understand what he was working toward. He replied with a wealth of information, including several ASCII diagrams and lengthy explanations. To summarize, he offered:
"Without pgcl, 64GB is a doorstop, because in /proc/meminfo LowTotal: was a mere 176MB and so incapable of supporting any significant loads. With pgcl, 64GB functions quite nicely, because LowTotal is 750MB and has room for all the kernel bloat that should be there (but things that shouldn't still need to be fixed)."
For the complete details, read on.
"The difference on meminfo is very simple. Page clustering, as the vendors of large x86 system vendors have an interest in it, is a method of reducing the space consumed by mem_map. In general (say, on 64-bit cpus) this isn't a problem, because the kernel can address as much physical memory as anyone cares to burn on mem_map. _But_ 32-bit systems are limited to 1GB of virtualspace. There are three ways to fix it. One is to make kernelspace and userspace totally disjoint, which involves so much TLB overhead it's not worth it. Another is to change the "split" between kernelspace and userspace, which is an ABI violation and (as the ABI violation implies) unacceptable to various important userspace apps. The third is page clustering, which by reducing the number of pieces of memory the kernel keeps track of shrinks the array of pages to fit nicely into a 1GB kernel virtualspace without taking anything away from userspace. Page clustering has many other uses beyond 64GB PAE, like prefaulting and larger fs blocksize. The 64GB PAE stuff is merely the focus of my funding source, IBM, which produces such machines."Or, put very simply, instead of shrinking the size of struct page, I shrank the number of struct pages. And so I've defeated the issue with the size of struct page for all time by being able to shrink the total space used by all the struct pages by any constant factor I choose.
"It's important to note that Hugh Dickins already did this once for 2.4.x."
To better understand how page clustering works, here is a pictorial representation. (PTE = Page Table Entries):
----------------------------------------- page clustering turns the
| struct page | relationship between base
----------------------------------------- pages and ptes into 1:N.
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ struct pages remain of the
| | | | | | | | | | same size, but track a
----------------------------------------- larger area and are fewer
|PTE|PTE|PTE|PTE|PTE|PTE|PTE|PTE|PTE|PTE| in number. ptes still point
----------------------------------------- to the same size areas.Currently, Bill notes that his code is just a prototype and has the following problem:
"When a fault is taken, a 64KB (or whatever, the factor's arbitrary) chunk of memory is handed out. But only 4KB was "asked for". So in the prototype code, the following situation arises:
-------------------------------------------------------------
page
-------------------------------------------------------------
piece | piece | piece | piece | piece | piece | piece | piece
-------------------------------------------------------------
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
\\
-------------------------------------------------------------
PTE | PTE | PTE | PTE | PTE | PTE | PTE | PTE
-------------------------------------------------------------"So you only use one 4KB "piece" out of every 64KB (or 32KB or whatever -- it's compile-time configurable, I even made a config option for it) that is asked for when taking a page fault. This is not the way the patch is supposed to work. It is a result of not being done writing the code.
"The thing that has to happen to turn this into "production" code (which hugh's 2.4.x code did) is something like this:
-------------------------------
page
-------------------------------
piece | piece | piece | piece |
-------------------------------
\\ \\ \\ \\
\\ \\ \\ \\
\\ \\ \\ \\
\\ \\ \\ \\
\\ \\ \\ \\
\\ \\ \\ \\
\\ \\ \\ \\
-------------------------------------------------------------
PTE | PTE | PTE | PTE | PTE | PTE | PTE | PTE
-------------------------------------------------------------"This is the "fragmentation" issue. I'm not 100% done fixing it. Hugh's 2.4.6 and 2.4.7 code to do page clustering, which can be found at ftp://ftp.veritas.com/linux/ did this properly and it's my goal to do this properly by the time of a "final release". Until it is done it will have severe performance problems on small memory machines (say, less than 16GB)."
The following thread from the lkml offers more detail into Bill's current page clustering efforts. His final response below is a must-read, providing a thorough explanation:
From: Zwane Mwaikambo
Subject: Re: 64GB NUMA-Q after pgcl
Date: Fri, 28 Mar 2003 02:45:30 -0500 (EST)
before:
Memory: 65306956k/67100672k available (1724k kernel code, 98252k reserved, 781k data, \\
284k init, 65134592k highmem)
after:
Memory: 65946144k/67100672k available (1956k kernel code, 15936k reserved, 667k data, \\
300k init, 65198080k highmem)
Would you mind explaining the details as to what would cause that
discrepancy in reserved memory size?
Zwane
--
function.linuxpower.ca
-
From: William Lee Irwin III
Subject: Re: 64GB NUMA-Q after pgcl
Date: Thu, 27 Mar 2003 23:57:30 -0800
Sure. On NUMA-Q mem_map[] is not allocated using bootmem except for
node 0. Various other bootmem allocations are also proportional to
memory as measured in units of PAGE_SIZE, but not all.
So all we're seeing here is node 0's mem_map[] with "miscellaneous"
bootmem allocations thrown in, whether reduced or increased.
This is not very reflective of what's going on as the majority of mem_map[]
is allocated through a custom reservation mechanism as opposed to bootmem.
-- wli
From: Zwane Mwaikambo
Subject: Re: 65GB NUMA-Q after pgcl
Date: Fri, 28 Mar 2003 03:05:42 -0500 (EST)
On Thu, 27 Mar 2003, William Lee Irwin III wrote:
> Sure. On NUMA-Q mem_map[] is not allocated using bootmem except for
> node 0. Various other bootmem allocations are also proportional to
> memory as measured in units of PAGE_SIZE, but not all.
>
> So all we're seeing here is node 0's mem_map[] with "miscellaneous"
> bootmem allocations thrown in, whether reduced or increased.
>
> This is not very reflective of what's going on as the majority of mem_map[]
> is allocated through a custom reservation mechanism as opposed to bootmem.
Thanks, nice work btw, although the core guts of this stuff is somewhat of
a mystery to some of us ;)
Zwane
--
function.linuxpower.ca
From: William Lee Irwin III
Subject: Re: 64GB NUMA-Q after pgcl
Date: Fri, 28 Mar 2003 02:14:33 -0800
On Thu, 27 Mar 2003, William Lee Irwin III wrote:
>> Sure. On NUMA-Q mem_map[] is not allocated using bootmem except for
>> node 0. Various other bootmem allocations are also proportional to
>> memory as measured in units of PAGE_SIZE, but not all.
>> So all we're seeing here is node 0's mem_map[] with "miscellaneous"
>> bootmem allocations thrown in, whether reduced or increased.
>> This is not very reflective of what's going on as the majority of mem_map[]
>> is allocated through a custom reservation mechanism as opposed to bootmem.
On Fri, Mar 28, 2003 at 03:05:42AM -0500, Zwane Mwaikambo wrote:
> Thanks, nice work btw, although the core guts of this stuff is somewhat of
> a mystery to some of us ;)
The code is still very much of prototype quality, so I'm actually being
somewhat deliberately obscure so those who aren't specifically
interested in hacking or very early testing don't accidentally burn
themselves or otherwise get the impression of a patchkit gone horribly
wrong. And even worse than that, so no one reviews the code before I've
cleaned it up.
The concept is really very simple, although the consequences are far
reaching. The kernel ties together its basic unit of allocation and
accounting, the PAGE_SIZE area and its associated struct page, together
with the notion of a pagetable entry and the size of the area mapped by
a pagetable entry (also called PAGE_SIZE in mainline, made into a
distinct notion of MMUPAGE_SIZE by the patch).
Page clustering is named for the view of the arrangement that a set of
hardware pages is a "cluster" represented by the software accounting
unit. In truth it's closer to symmetry apart from the constraint that
the software unit must be larger than the hardware unit. The net result
of it is that you go around figuring out which of the two units various
bits of code really meant, and for pagetable walks and so on the code
must be taught that it's referring to only a piece of a software page,
or to hand callers the piece they need when they need them.
The fact it resolves the horror of mem_map[] overrunning kernel
virtualspace on i386 PAE is really an obscure coincidence. AIUI Hugh's
2.4.x patch was actually intended to enable larger filesystem block
sizes, and the BSD implementation for the VAX was simply meant to deal
with the fact that even 16B for every 512B hardware page is too large a
fraction of physical memory (not virtual) for page-granularity
accounting to be memory-efficient. For BSD's purposes a relatively
small constant factor sufficed; for i386 a much larger one is required
for workload feasibility as virtualspace approaches the precise
fraction of physical memory that the coremap would otherwise consume.
Various other odd goodnesses are supposed to come of it, for instance,
prefaulting benefits as a side effect of trying to utilize the entire
software page in fault handlers, and io throughput benefits from
increased physical contiguity. My codebase is not prepared for
performance analysis yet, as the fragmentation issues are only
partially resolved. The real point of the posting is to show that this
thing actually makes 64GB work and, of course, to get first the post
on 64GB i386 PAE. =)
With this in hand, we can say "Yes, this solves the problem without
turning critical userspace apps into doorstops by stealing address
space from them" and I can resume coding up the final stretch of
functionality and move on to cleanups and maintenance of the patch
until the devel cycle comes to the point where it's ready for a merge.
I'd not be surprised if some vendor and/or distro interest is provoked,
and I'll do my best to help them along (if desired) once the patch is in
good enough shape wrt. functionality and clean enough to deliver to them.
-- wli
From: John Levon
Subject: Re: 64GB NUMA-Q after pgcl
Date: Fri, 28 Mar 2003 17:38:01 +0000
On Fri, Mar 28, 2003 at 02:14:33AM -0800, William Lee Irwin III wrote:
> Various other odd goodnesses are supposed to come of it, for instance,
> prefaulting benefits as a side effect of trying to utilize the entire
> ...
> thing actually makes 64GB work and, of course, to get first the post
> on 64GB i386 PAE. =)
Thanks for the explanation, and congratulations :)
regards,
john
I want one....
Daddy, when I grow up, I want to have his computer!
Re: I want one.....
You probably want another one; these things are a royal pain in the butt to run Linux on, for highmem, physical, and other reasons.
Me neither...
Can you say "stop-gap measure"? Given that a number of 64-bit processors are already on the market and that Opteron is just around the corner, why bother? If you need that much memory right now, use a cluster.
Re: me neither
As nice as all the 64-bit machines are, one machine existing does not make another work.
64GB i386 machines have been shipping and selling for about 5 years. The one I used for verification was almost that old itself. I didn't put my head in the sand. I had a something to get working, and I got it working. The 64GB boxen are already out there and I'd rather they run Linux that certain other OS's.
And clusters don't perform well for the workloads these machines are meant to support. The shared memory model is crucial to the performance of the workloads meant to run on the things. If it weren't crucial, price/performance alone would have gravitated the users to clusters already.
don't worry...
I talked to wli on irc and he said he was working on making the patch useable for small systems like yours. Soon you won't need that kind of hardware to take advantage of page clustering!
woohoo!
Re: don't worry
It should be usable and useful on small systems when done, but there are limits as to what is useful to do. akpm warns me, for instance, that the buffer bitblitting in prepare_write() will grow in cost with PAGE_SIZE, and that various filesystems are not prepared to do the tail packing necessary for very large blocksizes yet. Truly large PAGE_SIZE values may only ever be useful for PAE; smaller systems may want something larger than they have now, but not very much larger. It's difficult to guess at the exact numbers, but some experimentation with Hugh's 2.4.7 patch should give a decent idea of which PAGE_SIZE values are worthwhile. I've at least run it on my laptop with 32KB PAGE_SIZE and it passed the "touchy feely" tests there, but that's not a particularly rigorous performance analysis. I've not used it on the larger machines as there is version skew between arch support and 2.4.7 and neither page clustering nor the arch support port easily.
There may very well be some overlap between the PAE and small system cases, though, since current PAE PAGE_SIZE values are actually not very large. PAGE_SIZE is easily configureable in my patch, so once it's more stable and performant in general, it should be easy to experiment and determine which values of PAGE_SIZE are best for performance.
I am kernel newbie but
From what i uderstand this limit is for x86-32 systems
like intella xeons . From what i see is a very complex and not as clean
as x86-64 from amd . No wonder linus wants that x86-64 to be build by
intella (Yahmill anyone? ) too . Transmeta have the license to produce
the emulation and Via(Cyrix) have that tooo
Re: I am a kernel newbie but
x86-64 is an extension of IA32, so by definition it can be no less complex than what it's an extension of. =) This isn't an effective forum for carrying out architectural debates, though, so let's not get too far into it.
There isn't really a specific limitation, in truth it's theoretically possible (i.e. with hardware changes) to just widen the PTE's beyond reason until it'd take more funding than the space program and the entire defense budget combined to get an operating system to run on it. There's enough room left in IA32 PTE's for a lot more. But current cpus thankfully only do 36-bit physical addressing, and I hope they stay that way until 64-bit virtualspace is available.
64GB doesn't seem to require special handling other than page clustering (which is useful for other things besides making 64GB work), stable page replacement, and shoving small amounts of per-process data into highmem, so I'd say it's within reason.
Doesn't x86-64 provide 64-bit virtual space?
Doesn't x86-64 provide 64-bit virtual space? If so, then wouldn't x86-64 simplify some of the highmem issues? (I say /some/ because you still have to deal w/ grotty PC hardware, so you run into magic boundaries at 1MB, 16MB, and 4GB.)
I guess it's still somewhat of an issue for legacy apps running in their own 32-bit memory space, but then we're not trying to expose > 4GB at a time to them in their private address spaces anyway, are we?
Re: Doesn't x86-64 provide 64-bit virtual space?
Of course, none of these virtualspace exhaustion issues occur for x86-64, but this is not x86-64. All of the SS5's in the world will not make my Sun3/60 boot... analogously, all the x86-64's in the world won't do a thing for these IA32 boxen.
And actually the mem_map[] space overhead is worth some trimming down even on x86-64 as it has 64-bit pointers which make mem_map[] about twice as large as on IA32. I've been in regular contact with various x86-64 people about my progress on this so it can later be used for that purpose on x86-64.