Re: [patch 04/41] cpu ops: Core piece for generic atomic per cpu operations

Previous thread: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access by Christoph Lameter on Thursday, May 29, 2008 - 8:56 pm. (24 messages)

Next thread: [patch 05/41] cpu alloc: Percpu_counter conversion by Christoph Lameter on Thursday, May 29, 2008 - 8:56 pm. (3 messages)
From: Christoph Lameter
Date: Thursday, May 29, 2008 - 8:56 pm

Currently the per cpu subsystem is not able to use the atomic capabilities
that are provided by many of the available processors.

This patch adds new functionality that allows the optimizing of per cpu
variable handling. In particular it provides a simple way to exploit
atomic operations in order to avoid having to disable interrupts or
performing address calculation to access per cpu data.

F.e. Using our current methods we may do

	unsigned long flags;
	struct stat_struct *p;

	local_irq_save(flags);
	/* Calculate address of per processor area */
	p = CPU_PTR(stat, smp_processor_id());
	p->counter++;
	local_irq_restore(flags);

The segment can be replaced by a single atomic CPU operation:

	CPU_INC(stat->counter);

Most processors have instructions to perform the increment using a
a single atomic instruction. Processors may have segment registers,
global registers or per cpu mappings of per cpu areas that can be used
to generate atomic instructions that combine the following in a single
operation:

1. Adding of an offset / register to a base address
2. Read modify write operation on the address calculated by
   the instruction.

If 1+2 are combined in an instruction then the instruction is atomic
vs interrupts. This means that percpu atomic operations do not need
to disable interrupts to increments counters etc.

The existing methods in use in the kernel cannot utilize the power of
these atomic instructions. local_t is not really addressing the issue
since the offset calculation performed before the atomic operation. The
operation is therefor not atomic. Disabling interrupt or preemption is
required in order to use local_t.

local_t is also very specific to the x86 processor. The solution here can
utilize other methods than just those provided by the x86 instruction set.



On x86 the above CPU_INC translated into a single instruction:

	inc %%gs:(&stat->counter)

This instruction is interrupt safe since it can either be completed
or not. Both adding of ...
From: Andrew Morton
Date: Thursday, May 29, 2008 - 9:58 pm

hm, I guess this _has_ to be implemented as a macro.  ho hum.  But

Your terminology is totally confusing here.

To me, an "atomic operation" is one which is atomic wrt other CPUs:
atomic_t, for example.

Here we're talking about atomic-wrt-this-cpu-only, yes?

If so, we should invent a new term for that different concept and stick
to it like glue.  How about "self-atomic"?  Or "locally-atomic" in

And alpha, m32r, mips and powerpc, methinks.  Probably others, but

I think I'll need to come back another time to understand all that ;)


I wonder if all this stuff should be in a new header file.


If you move this functionality into a new cpu_alloc.h then the below
code goes into include/asm-generic/cpu_alloc.h and most architectures'
include/asm/cpu_alloc.h will include asm-generic/cpu_alloc.h.

include/linux/percpu.h can still include linux/cpu_alloc.h (which
includes asm/cpu_alloc.h) if needed.  But it would be better to just

--

From: Christoph Lameter
Date: Thursday, May 29, 2008 - 10:17 pm

No local_t does not do the relocation of the address to the correct percpu 
area. It requies disabling of interrupts etc. Its not atomic (wrt 

Well this stuff is so large in scope that I have difficulties keeping 

But then its related to percpu operations and relies extensively on the 
various percpu.h files in asm-generic and asm-arch and include/linux

--

From: Andrew Morton
Date: Thursday, May 29, 2008 - 10:38 pm

No it doesn't.  Look:

static inline void local_inc(local_t *l)
{
	asm volatile(_ASM_INC "%0"
		     : "+m" (l->a.counter));


Well that should be fixed.  We should never have mixed the
alloc_percpu() and DEFINE_PER_CPU things inthe same header.  They're
different.

otoh as you propose removing the old alloc_percpu() I guess the end
result is no worse than what we presently have.

--

From: Christoph Lameter
Date: Thursday, May 29, 2008 - 11:12 pm

No its not! In order to increment a per cpu value you need to calculate 
the per cpu pointer address in the current per cpu segment. local_t 
cannot do that in an atomic (wrt interrupt/preempt fashion) fashion. cpu 
ops can use a segment prefix and thus the insructions can calculate the 
per cpu adress and perform the atomic inc without disabling preempt or 
interrupts.

local_t is only useful when you disable interrupt or premption otherwise. 

With cpu_alloc they are the same. They allocate from the same per cpu 
area.

--

From: Rusty Russell
Date: Friday, May 30, 2008 - 12:08 am

Christoph, you just missed it, that's all.  Look at cpu_local_read et al in 
include/asm-i386/local.h (ie. before the x86 mergers chose the lowest common 
denominator one).

Cheers,
Rusty.
--

From: Christoph Lameter
Date: Friday, May 30, 2008 - 11:00 am

There is no doubt that local_t does perform an atomic vs. interrupt inc 
for example. But its not usable. Because you need to determine the address 
of the local_t belonging to the current processor first. As soon as you 
have loaded a processor specific address you can no longer be preempted 
because that may change the processor and then the wrong address may be 
increment (and then we have a race again since now we are incrementing 
counters belonging to other processors). So local_t at mininum requires 
disabling preempt.
 
Believe me I have tried to use local_t repeatedly for vm statistics etc. 
It always fails on that issue. See f.e. the patch that converts vmstat to 
cpu alloc and compare with my initial local_t based implementation 2 years 
ago that bombed out because I assumed that local_t would work right.

cpu ops does both

1. The determination of the address of the object belonging to the local 
processor.

and

2. The RMW

in one instruction. That avoids having to disable preemption or interrupts 
and it shortens the instructions significantly.

--

From: Rusty Russell
Date: Sunday, June 1, 2008 - 7:00 pm

Christoph!

STOP typing, and START reading.

cpu_local_inc() does all this: it takes the name of a local_t var, and is 
expected to increment this cpu's version of that.  You ripped this out and 
called it CPU_INC().


Think for a moment.  What are the chances that I didn't understand this when I 
wrote the multiple implementations of local_t?


Frankly, I am finding it increasingly easy to believe that you failed.  But 
you are blaming the wrong thing.

There are three implementations of local_t which are obvious.  The best is for 
architectures which can locate and increment a per-cpu var in one instruction 
(eg. x86).  Otherwise, using atomic_t/atomic64_t for local_t provides a 
general solution.  The other general solution would involve 
local_irq_disable()/increment/local_irq_enable().

My (fading) hope is that this idiocy is an abberation,
Rusty.
--

From: Mike Travis
Date: Wednesday, June 4, 2008 - 11:18 am

Hi,

I'm attempting to test both approaches to compare the object generated in order
to understand the issues involved here.  Here's my code:

        void test_cpu_inc(int *s)
        {
                __CPU_INC(s);
        }

        void test_local_inc(local_t *t)
        {
                __local_inc(THIS_CPU(t));
        }

        void test_cpu_local_inc(local_t *t)
        {
                __cpu_local_inc(t);
        }

But I don't know how I can use cpu_local_inc because the pointer to the object
is not &__get_cpu_var(l):

	#define __cpu_local_inc(l)      cpu_local_inc((l))
	#define cpu_local_inc(l)     cpu_local_wrap(local_inc(&__get_cpu_var((l))))

At the minimum, we would need a new local_t op to get the correct CPU_ALLOC'd
pointer value for the increment.  These new local_t ops for CPU_ALLOC'd variables
could use CPU_XXX primitives to implement them, or just a base val_to_ptr primitive
to replace __get_cpu_var().

I did notice this in local.h:

	 * X86_64: This could be done better if we moved the per cpu data directly
	 * after GS.

... which it now is, so true per_cpu variables could be optimized better as well.

Also, the above cpu_local_wrap(...) adds:

	#define cpu_local_wrap(l)               \
	({                                      \
	        preempt_disable();              \
	        (l);                            \
	        preempt_enable();               \
	})                                      \

... and there isn't a non-preemption version that I can find.

Here are the objects.  

0000000000000000 <test_cpu_inc>:
   0:   55                      push   %rbp
   1:   48 89 e5                mov    %rsp,%rbp
   4:   48 83 ec 08             sub    $0x8,%rsp
   8:   48 89 7d f8             mov    %rdi,0xfffffffffffffff8(%rbp)
   c:   65 48 ff 45 f8          incq   %gs:0xfffffffffffffff8(%rbp)
  11:   c9                      leaveq
  12:   c3                      retq

0000000000000013 <test_local_inc>:
  13:   55            ...
From: Rusty Russell
Date: Thursday, June 5, 2008 - 4:59 pm

Yes.  Because the only true per-cpu vars are the static ones, cpu_local_inc() 
only works on identifiers, not arbitrary pointers.  Once this is fixed, we 
should be enhancing the infrastructure to allow that (AFAICT it's not too 



Yes, this should be fixed.  I thought i386 had optimized versions pre-merge, 
but I was wrong (%gs for per-cpu came later, and noone cleaned up these naive 
versions).  Did you want me to write them?

I actually think that using local_t == atomic_t is better than 

You might be right, but note that local_t is 64 bit on 64-bit platforms.  And 
speculation of possible use cases isn't a good reason to rip out working 
infrastructure :)

Cheers,


--

From: Christoph Lameter
Date: Monday, June 9, 2008 - 12:00 pm

How can that be fixed? You have no atomic instruction that calculates the 
per cpu address in one go. And as long as that is the case you need to 
disable preempt. Otherwise you may increment the per cpu variable of 
another processor because the process was rescheduled after the address 

Its fundamentally broken since because of the preemption issue. This is 
also why local_t is rarely used.


--

From: Rusty Russell
Date: Monday, June 9, 2008 - 4:27 pm

But of course, that is not a problem.  You make local_t an atomic_t, and then 
it doesn't matter which CPU you incremented.

By definition if the caller cared, they would have had premption disabled.

Hope that clarifies,
Rusty.
--

From: Christoph Lameter
Date: Monday, June 9, 2008 - 4:54 pm

Right that is what the cpu alloc patches do. So you could implement 

But then the whole point of local_t is gone. Why not use atomic_t in the 

There are numerous instances where the caller does not care about 
preemption. Its just important that one per cpu counter is increment in 
the least intrusive way. See f.e. the VM event counters.

--

From: Rusty Russell
Date: Monday, June 9, 2008 - 7:56 pm

Yes, and that's exactly the point.  The VM event counters are exactly a case 
where you should have used cpu_local_inc.

Rusty.
--

From: Christoph Lameter
Date: Monday, June 9, 2008 - 8:18 pm

You need at least the zero basing to enable the use of the segment 

The argument does not make any sense. First you want to use atomic_t then 

I tried it and did not give any benefit except first failing due to bugs 
because local_t did not disable preempt6... This led to Andi fixing 
local_t.

But with the preempt disabling I could not discern what the benefit 
would be.

CPU_INC does not require disabling of preempt and the cpu alloc patches 
shorten the code sequence to increment a VM counter significantly.

Here is the header from the patch. How would cpu_local_inc be able to do 
better unless you adopt this patchset and add a shim layer?

Subject: VM statistics: Use CPU ops

The use of CPU ops here avoids the offset calculations that we used to 
have to do with per cpu operations. The result of this patch is that event 
counters are coded with a single instruction the following way:

        incq   %gs:offset(%rip)

Without these patches this was:

        mov    %gs:0x8,%rdx
        mov    %eax,0x38(%rsp)
        mov    xxx(%rip),%eax
        mov    %eax,0x48(%rsp)
        mov    varoffset,%rax
        incq   0x110(%rax,%rdx,1)


--

From: Rusty Russell
Date: Tuesday, June 10, 2008 - 5:03 pm

You're being obtuse.  See previous mail about the three possible 
implementations of local_t, and the comment in asm-generic/local.h.

The paths forward are clear:
1) Improve x86 local_t (mostly orthogonal to the others, but useful).
2) Implement extensible per-cpu areas.
3) Generalize per-cpu accessors.
4) Extend or replace the module.c per-cpu allocator to alloc from the other
   areas.
5) Convert alloc_percpu et al. to use the new code.

Hope that clarifies,
Rusty.
--

From: Christoph Lameter
Date: Tuesday, June 10, 2008 - 5:15 pm

OK. I hope I responded in the other email in a more intelligent 

Not sure about that. Its rarely used and the more general cpu alloc stuff 
can be used in lots of places as evident by the rest of the patchset. But 

Yes thanks. We are mostly on the same wavelength.
--

From: Christoph Lameter
Date: Monday, June 9, 2008 - 4:09 pm

Note also that the address calculation occurs before the incq. That is why 
disabling preemption is required otherwise the processor may change 
between the determination of the per cpu area address and the increment.

The local_t operations could be modified to avoid the preemption issues 
with the zero based patches applied. Then there would still be the 
inflexbility of not being able to increment an arbitrary variable.

I think it is also bad to treat a per cpu variable like an atomic. Its not 
truly atomic nor are strictly atomic accesses used. It is fine to use 
regular operations on the per cpu variable provided one has either 
disabled preemption or interrupts. The per cpu atomic wrt interrupt ops 
are only useful when preemption and/or interrupts are off.

--

From: Christoph Lameter
Date: Tuesday, June 10, 2008 - 10:42 am

1. The x86 implementation does not exist because the segment register has 
   so far not been available on x86_64. So you could not do the solution.
   You need the zero basing. Then you can use per_xxx_add in cpu_inc.

2. The general solution created overhead that is often not needed. If we
   would have done vm event counters with local_t then we would have
   atomic overhead for each increment on f.e. IA64. That was not
   acceptable. cpu_alloc never falls back to atomic operations.

3. local_t is based on the atomic logic. But percpu handling is 
   fundamentally different in that accesses without the special macros
   are okay provided you are in a non preemptible or irq context!
   A local_t declaration makes such accesses impossible.

4. The modeling of local_t on atomic_t limits it to 32bit! There is no
   way to use this with pointers or 64 bit entities. Adding that would 
   duplicate the API for each type added.
--

From: Rusty Russell
Date: Wednesday, June 11, 2008 - 4:10 am

You can implement it either way.  I've said that three times now.  The current 

Again, untrue.  The interface is already there.  So feel free to implement 
__cpu_local_inc et al in terms of preempt enable and disable so it doesn't 

Again wrong.  And adding an exclamation mark doesn't make it true.

Rusty.
--

From: Christoph Lameter
Date: Wednesday, June 11, 2008 - 4:39 pm

Ewww ... Its atomic_long_t ahh. Ok then there no 32 bit support. What about pointers?

--

From: Nick Piggin
Date: Wednesday, June 11, 2008 - 5:58 pm

sizeof(long) == sizeof(void *) in Linux, right?

If you were to support just a single data type, long would probably
be the most useful. Still, it might be more consistent to support
int and long, same as atomic.

--

From: Rusty Russell
Date: Wednesday, June 11, 2008 - 7:44 pm

Sure, but in practice these tend to be simple counters: that could well change 
when dynamic percpu allocs become first class citizens, but let's not put the 
cart before the horse...

Per-cpu seems to be particularly prone to over-engineering: see commit 
7ff6f08295d90ab20d25200ef485ebb45b1b8d71 from almost two years ago.  Grepping 
here reveals that this infrastructure is still not used.

Cheers,
Rusty.
--

From: Nick Piggin
Date: Wednesday, June 11, 2008 - 8:40 pm

Hmm. Something like that needs the question asked "who uses this?"
before it is merged I guess. If it were a trivial patch maybe not,
but something like this that sits untested for so long is almost
broken by definition ;)
--

From: Martin Peschke
Date: Thursday, June 12, 2008 - 2:37 am

Some code of mine which didn't make it beyond -mm used this small
per-cpu extension. So the commit you refer to was tested.

--

From: Nick Piggin
Date: Thursday, June 12, 2008 - 4:21 am

Right, but it can easily rot after initial testing if it isn't
continually used.

Maybe this isn't the best example because maybe it still works
fine. But in general, unused, non-trivial code isn't good just
to leave around "just in case" IMO.
--

From: Christoph Lameter
Date: Thursday, June 12, 2008 - 10:19 am

Populating the per cpu areas on demand is a good thing especially for 
configurations with a large number of processors. If we really go to 
support 4k processor by default then we need to allocate the smallest 
amount of per cpu structures necessary. Maybe ACPI or so can tell us how 
many processors are possible and we only allocate those. But it would be 
best if the percpu structures are only allocated for actually active 
processors.

--

From: Rusty Russell
Date: Thursday, June 12, 2008 - 5:38 pm

cpu_possible_map should definitely be minimal, but your point is well made: 
dynamic percpu could actually cut memory allocation.  If we go for a hybrid 
scheme where static percpu is always allocated from the initial chunk, 
however, we still need the current pessimistic overallocation.

Mike's a clever guy, I'm sure he'll think of something :)
Rusty.
--

From: Christoph Lameter
Date: Thursday, June 12, 2008 - 7:27 pm

The initial chunk would mean that the percpu areas all come from the same 
NUMA node. We really need to allocate from the node that is nearest to a 
processor (not all processors have processor local memory!).

It would be good to standardize the way that percpu areas are allocated. 
We have various ways of allocation now in various arches. 
init/main.c:setup_per_cpu_ares() needs to be generalized:

1. Allocate the per cpu areas in a NUMA aware fashions.

2. Have a function for instantiating a single per cpu area that 
   can be used during cpu hotplug.

3. Some hooks for arches to override particular behavior as needed.
   F.e. IA64 allocates percpu structures in a special way. x86_64

Hopefully. Otherwise he will ask me =-).

--

From: Rusty Russell
Date: Sunday, June 15, 2008 - 3:33 am

Yes, this is where it gets nasty.  We shouldn't even allocate the initial 
chunk in a non-NUMA aware way (I'm using the term chunk loosely, it's a chunk 

Definitely.  We also need to reserve virtual address space to create more 
areas with congruent mappings; that's the fun part.


Unfortunately this breaks the current percpu semantics: that if you iterate 
over all possible cpus you can access percpu vars.  This means you don't need 
to have hotplug CPU notifiers for simple percpu counters.  We could do this 

IA64 is going to need some work, since dynamic percpu addresses won't be able 

And as always, lkml will offer feedback; useful and otherwise :)

Cheers,
Rusty.
--

From: Christoph Lameter
Date: Monday, June 16, 2008 - 7:52 am

The ia64 hook could simply return the address of percpu area that 
was reserved when the per node memory layout was generated (which happens 
very early during node bootstrap).


--

From: Rusty Russell
Date: Monday, June 16, 2008 - 5:24 pm

Apologies, this time I read the code.  I thought IA64 used the pinned TLB area 
to access per-cpu vars under some circumstances, but they only do that via an 
arch-specific macro.

So creating new congruent mappings to expand the percpu area(s) is our main 
concern now?

Rusty.
--

From: Christoph Lameter
Date: Monday, June 16, 2008 - 7:29 pm

The concern here was just consolidating the setup code for the per cpu 
areas.

--

From: Mike Travis
Date: Tuesday, June 17, 2008 - 7:21 am

Not exactly.  Getting the system to not panic early in the boot (before
x86_64_start_kernel()) is the primary problem right now.  This happens
in the tip tree with the change to use zero-based percpu offsets.  It
gets much farther on the linux-next tree.

Thanks,
Mike
--

From: Rusty Russell
Date: Friday, May 30, 2008 - 12:05 am

No, the worst thing is that this is a great deal of churn which doesn't 
actually fix the "running out of per-cpu memory" problem.

It can, and should, be fixed, before changing dynamic percpu alloc to use the 
same percpu pool.

Rusty.
--

From: Rusty Russell
Date: Thursday, May 29, 2008 - 11:32 pm

Yes, but this is local_t for dynamically allocated per-cpu vars.  You've lost 
potential symmetry and invented a whole new nomenclature :(

local_ptr_inc() etc would be far preferable IMHO.

Cheers,
Rusty.
--

Previous thread: [patch 00/41] cpu alloc / cpu ops v3: Optimize per cpu access by Christoph Lameter on Thursday, May 29, 2008 - 8:56 pm. (24 messages)

Next thread: