Erik Jacobson recently posted a proposal for implementing node-level round-robin memory allocation for large hash tables that are allocated by the kernel to better support NUMA systems. In his email he offers a concise description of the problem, explaining that the portion of the kernel responsible for allocating these hash tables is not NUMA aware:
"The end result is that the first node in the system is hit harder in terms of memory usage than other nodes. On a very large system (32 or more nodes with 4g of memory per node for example), the first node in the system can have less than half of its total memory available."
This can lead to poor performance for large computational jobs when the memory is not spread equally among all nodes. 2.6 kernel [forum] maintainer Andrew Morton [interview] commented, "the patch seems a reasonable way of implementing it, but I think your above comment lies at the heart of the issue: those tables are just too darn big." He goes on to point out that there are only three large hash tables left in 2.6, for dentry, inode and TCP data, each requiring "stern examination and benchmarking to decide whether we really are appropriately sizing them on large machines."
From: Erik Jacobson [email blocked] To: linux-kernel Subject: available memory imbalance on large NUMA systems Date: Wed, 12 Nov 2003 09:22:06 -0600 Summary: -------- We wish to implement node round-robin memory allocation for certain large kernel hash tables allocated during kernel startup on NUMA systems. We are interested in getting a community-accepted solution in to the 2.6 kernel. Background: ----------- NUMA systems are made of multiple nodes connected together by a fast interconnect to make one large system. Each node has it's own set of processors and memory. There is a notion of memory that is close to a node (perhaps memory on the node itself) and memory further away (perhaps located on a different node separated by a router). When the kernel starts up, certain hash tables are allocated. The routines that allocate these hashes don't know about NUMA systems. They see a large amount of memory on the system and allocate a chunk of it sometimes based on the size of overall memory available on the system. The end result is that the first node in the system is hit harder in terms of memory usage than other nodes. On a very large system (32 or more nodes with 4g of memory per node for example), the first node in the system can have less than half of its total memory available. This imbalance is not desirable for folks wishing to run large computational jobs that depend on memory being available on all nodes. For example, certain large MPI programs may be negatively impacted if they expect to be able to get equal amounts of memory from all nodes. Example Fix: ------------ To fix this problem, we propose implementing a round-robin memory allocation scheme. We have included an example implementation as a patch to 2.4.21 (attached). In it, we create a new function in vmalloc.h named alloc_big_struct. It is based on vmalloc (so the resulting memory does go through the page table). This is function can be used to allocate certain kernel hashes such as the page table or the dentry table in place of __get_free_pages(). Now, I understand that this patch would not be accepted by the community how it stands right now. So think of the patch as an example to illustrate my point rather than a polished proposal. In fact, this patch may not cleanly apply to kernel.org 2.4 as-is. The example makes heavy use of vmalloc for NUMA systems and I understand (from yesterday :) that this isn't necessarily desirable. I think the example patch does still illustrate what we're trying to do. I guess I'm hoping to be pointed in a direction that will have a fair chance of being accepted in to the 2.6 kernel if proposed. Depending on what direction this takes, I or someone else will attempt to implement something. Here is a detailed list of changes and what they do for the 2.4 example. mm.h: Add a new action modifier, GFP_ROUND_ROBIN. This modifier is used by alloc_area_pte in vmalloc.c. If the bit is set, round-robin allocation is used. Add a function called alloc_pages_round_robin and a macro alloc_page_round_robin that calls it. These are meant to mirror alloc_page and alloc_pages. vmalloc.h: Add function named alloc_big_struct. This function takes the place of __get_free_pages when we wish to do round-robin allocation. It takes an order number as input but converts it to a number of bytes as is needed by __vmalloc. When it calls __vmalloc, it ORs GFP_ROUND_ROBIN to the gfpmask so alloc_area_pte knows to do round-robin allocation of memory. If the system isn't NUMA, a macro named alloc_big_struct simply calls __get_free_pages. vmalloc.c: alloc_area_pte is adjusted to look for GFP_ROUND_ROBIN. If its set, alloc_page_round_robin is called. Otherwise, alloc_page is called like before. numa.c: page_cache_alloc is modified (inside an ifdef CONFIG_NUMA) to use the new alloc_page_round_robin support instead of alloc_pages_node. This avoids code duplication. tcp.c, buffer.c, inode.c, dcache.c: When allocating large hash tables, __get_free_pages call replaced with alloc_big_struct to spread the memory use across nodes. Testing ------- On Altix systems of various sizes, we ran aim7 and compared results. We found almost no difference in performance between the round-robin-enabled kernels and kernels without this fix implemented. Big Hash Tables --------------- As a side point, some of the hash tables allocated during startup get very large on large-memory systems (systems with a terrabyte of memory for example). Someone may wish to consider implementing a cap on the size of some of these tables. My example doesn't address this issue - it just spreads the load. In fact, I don't have an idea as to what reasonable caps would be on these tables, if any. The example patch is attached. -- Erik Jacobson - Linux System Software - Silicon Graphics - Eagan, Minnesota [patch]
From: Andrew Morton [email blocked] Subject: Re: available memory imbalance on large NUMA systems Date: Wed, 12 Nov 2003 13:09:24 -0800 Erik Jacobson [email blocked] wrote: > > As a side point, some of the hash tables allocated during startup get very > large on large-memory systems (systems with a terrabyte of memory for example). > Someone may wish to consider implementing a cap on the size of some of these > tables. The patch seems a reasonable way of implementing it, but I think your above comment lies at the heart of the issue: those tables are just too darn big. Both the pagecache hash table and the buffer_head hash tables were removed from 2.6 (but I suspect the structures which replaced them are all still crammed into the zeroeth node?). That leaves the dentry, inode and TCP hash tables. These need stern examination and benchmarking to decide whether we really are appropriately sizing them on large machines. If we can get away with just making these sanely sized then the remaining issue is the node-round-robining of pagecache allocations. I don't have an opinion on the desirability of this for NUMA machines in general.
NUMA support in general
NUMA support in the kernel is interesting, because with the opteron there may be a lot NUMA systems in the future.
Does the kernel allocate the memory at the processor that needs the memory? like: processor A runs process C, so the memory of process C is allocated in processors A memory?
I read that linux has problems on numa systems, due to the obove mentioned code locality.
Is this issue adressed in the 2.6 kernel?
Linux on NUMA.
This particular issue was addressed *way*
back in the dark ages of time. Kernel 2.2
at least. There are a lot of other NUMA
issues outstanding, though. Kernel 2.6
will have a simplistic NUMA-aware scheduler
for the first time. This will try to keep
all of the nodes at about the same load,
by migrating new processes around the
system as needed. Strong attention to NUMA
systems is just ramping up, and for the next
few revisions, the Kernel will be picking
up more advanced fetures, I'm sure. A
whole lot of the new SMP scaling stuff will
be just as applicable to NUMA systems, too.