Currently the per cpu subsystem is not able to use the atomic capabilities that are provided by many of the available processors. This patch adds new functionality that allows the optimizing of per cpu variable handling. In particular it provides a simple way to exploit atomic operations in order to avoid having to disable interrupts or performing address calculation to access per cpu data. F.e. Using our current methods we may do unsigned long flags; struct stat_struct *p; local_irq_save(flags); /* Calculate address of per processor area */ p = CPU_PTR(stat, smp_processor_id()); p->counter++; local_irq_restore(flags); The segment can be replaced by a single atomic CPU operation: CPU_INC(stat->counter); Most processors have instructions to perform the increment using a a single atomic instruction. Processors may have segment registers, global registers or per cpu mappings of per cpu areas that can be used to generate atomic instructions that combine the following in a single operation: 1. Adding of an offset / register to a base address 2. Read modify write operation on the address calculated by the instruction. If 1+2 are combined in an instruction then the instruction is atomic vs interrupts. This means that percpu atomic operations do not need to disable interrupts to increments counters etc. The existing methods in use in the kernel cannot utilize the power of these atomic instructions. local_t is not really addressing the issue since the offset calculation performed before the atomic operation. The operation is therefor not atomic. Disabling interrupt or preemption is required in order to use local_t. local_t is also very specific to the x86 processor. The solution here can utilize other methods than just those provided by the x86 instruction set. On x86 the above CPU_INC translated into a single instruction: inc %%gs:(&stat->counter) This instruction is interrupt safe since it can either be completed or not. Both adding of ...
hm, I guess this _has_ to be implemented as a macro. ho hum. But Your terminology is totally confusing here. To me, an "atomic operation" is one which is atomic wrt other CPUs: atomic_t, for example. Here we're talking about atomic-wrt-this-cpu-only, yes? If so, we should invent a new term for that different concept and stick to it like glue. How about "self-atomic"? Or "locally-atomic" in And alpha, m32r, mips and powerpc, methinks. Probably others, but I think I'll need to come back another time to understand all that ;) I wonder if all this stuff should be in a new header file. If you move this functionality into a new cpu_alloc.h then the below code goes into include/asm-generic/cpu_alloc.h and most architectures' include/asm/cpu_alloc.h will include asm-generic/cpu_alloc.h. include/linux/percpu.h can still include linux/cpu_alloc.h (which includes asm/cpu_alloc.h) if needed. But it would be better to just --
No local_t does not do the relocation of the address to the correct percpu area. It requies disabling of interrupts etc. Its not atomic (wrt Well this stuff is so large in scope that I have difficulties keeping But then its related to percpu operations and relies extensively on the various percpu.h files in asm-generic and asm-arch and include/linux --
No it doesn't. Look:
static inline void local_inc(local_t *l)
{
asm volatile(_ASM_INC "%0"
: "+m" (l->a.counter));
Well that should be fixed. We should never have mixed the
alloc_percpu() and DEFINE_PER_CPU things inthe same header. They're
different.
otoh as you propose removing the old alloc_percpu() I guess the end
result is no worse than what we presently have.
--
No its not! In order to increment a per cpu value you need to calculate the per cpu pointer address in the current per cpu segment. local_t cannot do that in an atomic (wrt interrupt/preempt fashion) fashion. cpu ops can use a segment prefix and thus the insructions can calculate the per cpu adress and perform the atomic inc without disabling preempt or interrupts. local_t is only useful when you disable interrupt or premption otherwise. With cpu_alloc they are the same. They allocate from the same per cpu area. --
Christoph, you just missed it, that's all. Look at cpu_local_read et al in include/asm-i386/local.h (ie. before the x86 mergers chose the lowest common denominator one). Cheers, Rusty. --
There is no doubt that local_t does perform an atomic vs. interrupt inc for example. But its not usable. Because you need to determine the address of the local_t belonging to the current processor first. As soon as you have loaded a processor specific address you can no longer be preempted because that may change the processor and then the wrong address may be increment (and then we have a race again since now we are incrementing counters belonging to other processors). So local_t at mininum requires disabling preempt. Believe me I have tried to use local_t repeatedly for vm statistics etc. It always fails on that issue. See f.e. the patch that converts vmstat to cpu alloc and compare with my initial local_t based implementation 2 years ago that bombed out because I assumed that local_t would work right. cpu ops does both 1. The determination of the address of the object belonging to the local processor. and 2. The RMW in one instruction. That avoids having to disable preemption or interrupts and it shortens the instructions significantly. --
Christoph! STOP typing, and START reading. cpu_local_inc() does all this: it takes the name of a local_t var, and is expected to increment this cpu's version of that. You ripped this out and called it CPU_INC(). Think for a moment. What are the chances that I didn't understand this when I wrote the multiple implementations of local_t? Frankly, I am finding it increasingly easy to believe that you failed. But you are blaming the wrong thing. There are three implementations of local_t which are obvious. The best is for architectures which can locate and increment a per-cpu var in one instruction (eg. x86). Otherwise, using atomic_t/atomic64_t for local_t provides a general solution. The other general solution would involve local_irq_disable()/increment/local_irq_enable(). My (fading) hope is that this idiocy is an abberation, Rusty. --
Hi,
I'm attempting to test both approaches to compare the object generated in order
to understand the issues involved here. Here's my code:
void test_cpu_inc(int *s)
{
__CPU_INC(s);
}
void test_local_inc(local_t *t)
{
__local_inc(THIS_CPU(t));
}
void test_cpu_local_inc(local_t *t)
{
__cpu_local_inc(t);
}
But I don't know how I can use cpu_local_inc because the pointer to the object
is not &__get_cpu_var(l):
#define __cpu_local_inc(l) cpu_local_inc((l))
#define cpu_local_inc(l) cpu_local_wrap(local_inc(&__get_cpu_var((l))))
At the minimum, we would need a new local_t op to get the correct CPU_ALLOC'd
pointer value for the increment. These new local_t ops for CPU_ALLOC'd variables
could use CPU_XXX primitives to implement them, or just a base val_to_ptr primitive
to replace __get_cpu_var().
I did notice this in local.h:
* X86_64: This could be done better if we moved the per cpu data directly
* after GS.
... which it now is, so true per_cpu variables could be optimized better as well.
Also, the above cpu_local_wrap(...) adds:
#define cpu_local_wrap(l) \
({ \
preempt_disable(); \
(l); \
preempt_enable(); \
}) \
... and there isn't a non-preemption version that I can find.
Here are the objects.
0000000000000000 <test_cpu_inc>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 48 83 ec 08 sub $0x8,%rsp
8: 48 89 7d f8 mov %rdi,0xfffffffffffffff8(%rbp)
c: 65 48 ff 45 f8 incq %gs:0xfffffffffffffff8(%rbp)
11: c9 leaveq
12: c3 retq
0000000000000013 <test_local_inc>:
13: 55 ...Yes. Because the only true per-cpu vars are the static ones, cpu_local_inc() only works on identifiers, not arbitrary pointers. Once this is fixed, we should be enhancing the infrastructure to allow that (AFAICT it's not too Yes, this should be fixed. I thought i386 had optimized versions pre-merge, but I was wrong (%gs for per-cpu came later, and noone cleaned up these naive versions). Did you want me to write them? I actually think that using local_t == atomic_t is better than You might be right, but note that local_t is 64 bit on 64-bit platforms. And speculation of possible use cases isn't a good reason to rip out working infrastructure :) Cheers, --
How can that be fixed? You have no atomic instruction that calculates the per cpu address in one go. And as long as that is the case you need to disable preempt. Otherwise you may increment the per cpu variable of another processor because the process was rescheduled after the address Its fundamentally broken since because of the preemption issue. This is also why local_t is rarely used. --
But of course, that is not a problem. You make local_t an atomic_t, and then it doesn't matter which CPU you incremented. By definition if the caller cared, they would have had premption disabled. Hope that clarifies, Rusty. --
Right that is what the cpu alloc patches do. So you could implement But then the whole point of local_t is gone. Why not use atomic_t in the There are numerous instances where the caller does not care about preemption. Its just important that one per cpu counter is increment in the least intrusive way. See f.e. the VM event counters. --
Yes, and that's exactly the point. The VM event counters are exactly a case where you should have used cpu_local_inc. Rusty. --
You need at least the zero basing to enable the use of the segment
The argument does not make any sense. First you want to use atomic_t then
I tried it and did not give any benefit except first failing due to bugs
because local_t did not disable preempt6... This led to Andi fixing
local_t.
But with the preempt disabling I could not discern what the benefit
would be.
CPU_INC does not require disabling of preempt and the cpu alloc patches
shorten the code sequence to increment a VM counter significantly.
Here is the header from the patch. How would cpu_local_inc be able to do
better unless you adopt this patchset and add a shim layer?
Subject: VM statistics: Use CPU ops
The use of CPU ops here avoids the offset calculations that we used to
have to do with per cpu operations. The result of this patch is that event
counters are coded with a single instruction the following way:
incq %gs:offset(%rip)
Without these patches this was:
mov %gs:0x8,%rdx
mov %eax,0x38(%rsp)
mov xxx(%rip),%eax
mov %eax,0x48(%rsp)
mov varoffset,%rax
incq 0x110(%rax,%rdx,1)
--
You're being obtuse. See previous mail about the three possible implementations of local_t, and the comment in asm-generic/local.h. The paths forward are clear: 1) Improve x86 local_t (mostly orthogonal to the others, but useful). 2) Implement extensible per-cpu areas. 3) Generalize per-cpu accessors. 4) Extend or replace the module.c per-cpu allocator to alloc from the other areas. 5) Convert alloc_percpu et al. to use the new code. Hope that clarifies, Rusty. --
OK. I hope I responded in the other email in a more intelligent Not sure about that. Its rarely used and the more general cpu alloc stuff can be used in lots of places as evident by the rest of the patchset. But Yes thanks. We are mostly on the same wavelength. --
Note also that the address calculation occurs before the incq. That is why disabling preemption is required otherwise the processor may change between the determination of the per cpu area address and the increment. The local_t operations could be modified to avoid the preemption issues with the zero based patches applied. Then there would still be the inflexbility of not being able to increment an arbitrary variable. I think it is also bad to treat a per cpu variable like an atomic. Its not truly atomic nor are strictly atomic accesses used. It is fine to use regular operations on the per cpu variable provided one has either disabled preemption or interrupts. The per cpu atomic wrt interrupt ops are only useful when preemption and/or interrupts are off. --
1. The x86 implementation does not exist because the segment register has so far not been available on x86_64. So you could not do the solution. You need the zero basing. Then you can use per_xxx_add in cpu_inc. 2. The general solution created overhead that is often not needed. If we would have done vm event counters with local_t then we would have atomic overhead for each increment on f.e. IA64. That was not acceptable. cpu_alloc never falls back to atomic operations. 3. local_t is based on the atomic logic. But percpu handling is fundamentally different in that accesses without the special macros are okay provided you are in a non preemptible or irq context! A local_t declaration makes such accesses impossible. 4. The modeling of local_t on atomic_t limits it to 32bit! There is no way to use this with pointers or 64 bit entities. Adding that would duplicate the API for each type added. --
You can implement it either way. I've said that three times now. The current Again, untrue. The interface is already there. So feel free to implement __cpu_local_inc et al in terms of preempt enable and disable so it doesn't Again wrong. And adding an exclamation mark doesn't make it true. Rusty. --
Ewww ... Its atomic_long_t ahh. Ok then there no 32 bit support. What about pointers? --
sizeof(long) == sizeof(void *) in Linux, right? If you were to support just a single data type, long would probably be the most useful. Still, it might be more consistent to support int and long, same as atomic. --
Sure, but in practice these tend to be simple counters: that could well change when dynamic percpu allocs become first class citizens, but let's not put the cart before the horse... Per-cpu seems to be particularly prone to over-engineering: see commit 7ff6f08295d90ab20d25200ef485ebb45b1b8d71 from almost two years ago. Grepping here reveals that this infrastructure is still not used. Cheers, Rusty. --
Hmm. Something like that needs the question asked "who uses this?" before it is merged I guess. If it were a trivial patch maybe not, but something like this that sits untested for so long is almost broken by definition ;) --
Some code of mine which didn't make it beyond -mm used this small per-cpu extension. So the commit you refer to was tested. --
Right, but it can easily rot after initial testing if it isn't continually used. Maybe this isn't the best example because maybe it still works fine. But in general, unused, non-trivial code isn't good just to leave around "just in case" IMO. --
Populating the per cpu areas on demand is a good thing especially for configurations with a large number of processors. If we really go to support 4k processor by default then we need to allocate the smallest amount of per cpu structures necessary. Maybe ACPI or so can tell us how many processors are possible and we only allocate those. But it would be best if the percpu structures are only allocated for actually active processors. --
cpu_possible_map should definitely be minimal, but your point is well made: dynamic percpu could actually cut memory allocation. If we go for a hybrid scheme where static percpu is always allocated from the initial chunk, however, we still need the current pessimistic overallocation. Mike's a clever guy, I'm sure he'll think of something :) Rusty. --
The initial chunk would mean that the percpu areas all come from the same NUMA node. We really need to allocate from the node that is nearest to a processor (not all processors have processor local memory!). It would be good to standardize the way that percpu areas are allocated. We have various ways of allocation now in various arches. init/main.c:setup_per_cpu_ares() needs to be generalized: 1. Allocate the per cpu areas in a NUMA aware fashions. 2. Have a function for instantiating a single per cpu area that can be used during cpu hotplug. 3. Some hooks for arches to override particular behavior as needed. F.e. IA64 allocates percpu structures in a special way. x86_64 Hopefully. Otherwise he will ask me =-). --
Yes, this is where it gets nasty. We shouldn't even allocate the initial chunk in a non-NUMA aware way (I'm using the term chunk loosely, it's a chunk Definitely. We also need to reserve virtual address space to create more areas with congruent mappings; that's the fun part. Unfortunately this breaks the current percpu semantics: that if you iterate over all possible cpus you can access percpu vars. This means you don't need to have hotplug CPU notifiers for simple percpu counters. We could do this IA64 is going to need some work, since dynamic percpu addresses won't be able And as always, lkml will offer feedback; useful and otherwise :) Cheers, Rusty. --
The ia64 hook could simply return the address of percpu area that was reserved when the per node memory layout was generated (which happens very early during node bootstrap). --
Apologies, this time I read the code. I thought IA64 used the pinned TLB area to access per-cpu vars under some circumstances, but they only do that via an arch-specific macro. So creating new congruent mappings to expand the percpu area(s) is our main concern now? Rusty. --
The concern here was just consolidating the setup code for the per cpu areas. --
Not exactly. Getting the system to not panic early in the boot (before x86_64_start_kernel()) is the primary problem right now. This happens in the tip tree with the change to use zero-based percpu offsets. It gets much farther on the linux-next tree. Thanks, Mike --
No, the worst thing is that this is a great deal of churn which doesn't actually fix the "running out of per-cpu memory" problem. It can, and should, be fixed, before changing dynamic percpu alloc to use the same percpu pool. Rusty. --
Yes, but this is local_t for dynamically allocated per-cpu vars. You've lost potential symmetry and invented a whole new nomenclature :( local_ptr_inc() etc would be far preferable IMHO. Cheers, Rusty. --
