Re: New Address Family: Inter Process Networking (IPN)

Previous thread: Re: [PATCH] bw-qcam: Adds module parameter 'aggressive' to skip polite auto-detection prior to direct initialization. by Brett Warden on Wednesday, December 5, 2007 - 9:51 am. (1 message)

Next thread: How to manage shared persistent local caching (FS-Cache) with NFS? by David Howells on Wednesday, December 5, 2007 - 10:11 am. (7 messages)
From: Renzo Davoli
Date: Wednesday, December 5, 2007 - 9:40 am

Inter Process Networking: 
a kernel module (and some simple kernel patches) to provide 
AF_IPN: a new address family for process networking, i.e. multipoint,
multicast/broadcast communication among processes (and networks).

WHAT IS IT?
-----------
Berkeley socket have been designed for client server or point to point
communication. All existing Address Families implement this idea.
IPN is a new address family designed for one-to-many, many-to-many and 
peer-to-peer communication among processes.
IPN is an Inter Process Communication paradigm where all the processes
appear as they were connected by a networking bus.
On IPN, processes can interoperate using real networking protocols 
(e.g. ethernet) but also using application defined protocols (maybe 
just sending ascii strings, video or audio frames, etc).
IPN provides networking (in the broaden definition you can imagine) to
the processes. Processes can be ethernet nodes, run their own TCP-IP stacks
if they like (e.g. virtual machines), mount ATAonEthernet disks, etc.etc.
IPN networks can be interconnected with real networks or IPN networks
running on different computers can interoperate (can be connected by
virtual cables).
IPN is part of the Virtual Square Project (vde, lwipv6, view-os, 
umview/kmview, see wiki.virtualsquare.org).

WHY?
----
Many applications can benefit from IPN.
First of all VDE (Virtual Distributed Ethernet): one service of IPN is a
kernel implementation of VDE.
IPN can be useful for applications where one or some processes feed their 
data to several consuming processes (maybe joining the stream at run time).
IPN sockets can be also connected to tap (tuntap) like interfaces or
to real interfaces (like "brctl addif").
There are specific ioctls to define a tap interface or grab an existing
one.
Several existing services could be implemented (and often could have extended
features) on the top of IPN:
- kernel bridge
- tuntap
- macvlan
IPN could be used (IMHO) to provide multicast services to ...
From: Stephen Hemminger
Date: Wednesday, December 5, 2007 - 2:55 pm

On Wed, 5 Dec 2007 17:40:55 +0100

Post complete source code for kernel part to netdev@vger.kernel.org.
If you want the hooks, you need to include the full source code for inclusion
in mainline. All the Documentation/SubmittingPatches rules apply;
you can't just ask for "facilitators" and expect to keep your stuff out of tree.

--

From: Renzo Davoli
Date: Wednesday, December 5, 2007 - 10:38 pm

I am sorry if I was misunderstood.
I did not want any "facilitator", nor I wanted to keep my code outside
the kernel, on the contrary.
It is perfectly okay for me to provide the entire code for inclusion.
The purposes of my message were the following:
- I wanted to introduce the idea and say to the linux kernel community
  that a team is working on it.
- Address family: is it okay to send a patch that add a new AF?
is there a "AF registry" somewhere? (like the device major/minor
registry or the well-known port assignment for TCP-IP).
- Hook: we have two different options. We can add another grabbing
inline function like those used by the bridge and macvlan or we can
design a grabbing service registration facility. Which one is preferrable?
The former is simpler, the latter is more elegant but it requires some 
changes in the kernel bridge code.
So the former choice is between less-invasive,safer,inelegant, the
latter is more-invasive,less safe,elegant.

We need a bit of time to stabilize the code: deeply testing the existing
features and implementing some more ideas we have on it.
In the meanwhile we would be grateful if the community could kindly ask to the
questions above.

renzo
--

From: Stephen Hemminger
Date: Wednesday, December 5, 2007 - 11:04 pm

On Thu, 6 Dec 2007 06:38:21 +0100


The usual process is to just add the value as part of the patchset.
You then need to tell the glibc maintainers so it gets included appropriately

The problem with making it a registration facilties are:
 * risk of making it easier for non-GPL out of tree abuse
 * possible ordering issues: ie. by hardcoding each hook, the
    behaviour is defined in the case of multiple usages on the same


I am a believer in review early and often. It is easier to just deal with
the nuisance issues (style, naming, configuration) at the beginning rather
than the final stage of the project.
--

From: Andi Kleen
Date: Wednesday, December 5, 2007 - 4:39 pm

Netlink is multicast/broadcast by default for once. And BC/MC certainly

Sounds like netlink. See also RFC 3549

Haven't read further I admit.

-Andi
--

From: Renzo Davoli
Date: Wednesday, December 5, 2007 - 10:30 pm

RFC 3549 says:
"This document describes Linux Netlink, which is used in Linux both as
   an intra-kernel messaging system as well as between kernel and user
   space."

We know AF_NETLINK, our user-space stack lwipv6 supports it.

AF_IPN is different. 
AF_IPN is the broadcast and peer-to-peer extension of AF_UNIX.
It supports communication among *user* processes. 

Example:

Qemu, User-Mode Linux, Kvm, our umview machines can use IPN as an
Ethernet Hub and communicate among themselves with the hosting computer 
and the world by a tap like interface.

You can also grab an interface (say eth1) and use eth0 for your hosting
computer and eth1 for the IPN network of virtual machines.

If you load the kvde_switch submodule IPN can be a virtual Ethernet switch.

This example is already working using the svn versions of ipn and
vdeplug.

Another Example:

You have a continuous stream of data packets generated by a process,
and you want to send this data to many processes.
Maybe the set of processes is not known in advance, you want to send the
data to any interested process. Some kind of publish&subscribe
communication service (among unix processes not on TCP-IP).
Without IPN you need a server. With IPN the sender creates the socket
connects to it and feed it with data packets. All the interested 
receivers connects to it and start reading. That's all.

I hope that this message can give a better undertanding of what IPN is.

	renzo
--

From: Kyle Moffett
Date: Wednesday, December 5, 2007 - 11:19 pm

Ok, you say it's different, but then you describe how IP unicast and  
broadcast work.  Both are frequently used for communication among  
"*user* processes".  Please provide significantly more details about  

You say "tap like" interface, but people do this already with  
existing infrastructure.  You can connect Qemu, UML, and KVM to a  
standard linus "tap" interface, and then use the standard Linux  
bridging code to connect the "tap" interface to your existing network  
interfaces.  Alternatively you could use the standard and well-tested  
IP routing/firewalling/NAT code to move your packets around.  None of  
this requires new network infrastructure in the slightest.  If you  
have problems with the existing code, please improve it instead of  
creating a slightly incompatible replacement which has different bugs  


As I described above, this can be done with the existing bridging and  

This is already done frequently in userspace.  Just register a port  
number with IANA on which to implement a "registration" server and  
write a little daemon to listen on 127.0.0.1:${YOUR_PORT}.  Your  
interconnecting programs then use either unicast or multicast sockets  
to bind, then report to the registration server what service you are  
offering and what port it's on.  Your "receivers" then connect to the  
registration server, ask what port a given service is on, and then  
multicast-listen or unicast-connect to access that service.  The best  
part is that all of the performance implications are already  
thoroughly understood.  Furthermore, if you want to extend your  
communication protocol to other hosts as well, you just have to  
replace the 127.0.0.1 bind with a global bind.  This is exactly how  
the standard-specified multiple-participant "SIP" protocol works, for  
example.


So if you really think this is something that belongs in the kernel  
you need to provide much more detailed descriptions and use-cases for  
why it cannot be implemented in user-space or ...
From: David Newall
Date: Wednesday, December 5, 2007 - 11:59 pm

Renzo also described something new (in the socket() arena): the 
multi-reader, multi-writer is just not available in IP.


I would strengthen this sentiment: If you think something belongs in the 
kernel, you need to argue your case (provide much more detailed 
descriptions and use-cases.)
--

From: Andi Kleen
Date: Thursday, December 6, 2007 - 9:34 am

How is that different from a multicast group?

-Andi
--

From: David Newall
Date: Thursday, December 6, 2007 - 3:21 pm

Good question.  I don't know much about multicast IP.  It's a bit new 
for me.  I knew it uses Martian addresses!  After a little reading, I 
now know that it does allow many to many communication.

Renzo's IPN is a local protocol--you can't multicast to localhost.
--

From: Andi Kleen
Date: Thursday, December 6, 2007 - 3:42 pm

> Renzo's IPN is a local protocol--you can't multicast to localhost.

You don't need to. All local clients can join the same group without
using localhost.

-Andi
--

From: Andi Kleen
Date: Thursday, December 6, 2007 - 9:35 am

It can be used between user space daemons as well. In fact it is.
e.g. they often listen to each other's messages.

-Andi

--

From: Chris Friesen
Date: Thursday, December 6, 2007 - 1:36 pm

One problem we ran into was that there are only 32 multicast groups per 
netlink protocol family.

We had a situation where we could have used netlink, but we needed the 
equivalent of thousands of multicast groups.  Latency was very 
important, so we ended up doing essentially a multicast unix socket 
rather than taking the extra penalty for UDP multicast.

Chris
--

From: Andi Kleen
Date: Thursday, December 6, 2007 - 2:26 pm

What extra penalty? Local UDP shouldn't be much more expensive than Unix.

-Andi
--

From: Andi Kleen
Date: Thursday, December 6, 2007 - 3:07 pm

UDP doesn't really have much stack. IP is also very little assuming
cached route (connect called first) 

I would expect the copies to dominate in both cases.

-Andi
--

From: Renzo Davoli
Date: Thursday, December 6, 2007 - 3:18 pm

Some more explanations trying to describe what IPN is and what it is
useful for.  We are writing the complete patch....

Summary:
* IPN is for inter-process communication. It is *not* directly related 
to TCP-IP or Ethernet.
* IPN itself is a *level 1* virtual physical network.  IPN services
* (like AF_UNIX) do not require root privileges.  TAP and GRAB are just
* extra features for for IPN deliverying Ethernet frames.
----

* IPN is for inter-process communication. It is *not* directly related 
to TCP-IP or Ethernet.

If you want you can call it Inter Process Bus Communication.  It is an
extension of AF_UNIX.  Comments saying that some services can be
implemented by using TCP-IP multicast protocols are unrelated to IPN.
All AF_UNIX services could be implemented as TCP-IP services on
127.0.0.1. Do we abolish AF_UNIX, then?  The problem is that to use
TCP-IP, you'd need to wrap the packets with TCP or UDP, IP and Ethernet
headers, the stack would lose time to manage useless protocols.  If you
want just to send strings to set of local processes TCP-IP is an
overloading solution.  Even X-Window uses AF_UNIX sockets to talk with
local clients, it is a performance issue... I think Chris is right.

* IPN itself is a *level 1* virtual physical network.

Like any physical network you can run higher level protocols on it, thus
Ethernet, and then TCP-IP can be services you can run on IPN, but there
can be IPN networks running neither TCP-IP nor Ethernet.

* IPN services (like AF_UNIX) do not require root privileges.

There are many communication services where the user need broadcast or
p2p among user processes.  If a user (not root) wants to run several
User-Mode Linux, Qemu, Kvm VM the only way to have them connected
together is our Virtual Distributed Ethernet.  (For this reason VDE
exists in almost all the distros, it has been ported to other OSs, and
is already supported in the Linux Kernel for User-Mode Linux).  VDE is a
userland deamon, hence requires two context switches ...
From: Andi Kleen
Date: Thursday, December 6, 2007 - 3:38 pm

No ethernet headers on localhost. Just to give you a perspective:
IP+TCP headers are 50 bytes (with timestamps) and IP+UDP is 28 bytes.
On the other hand the sk_buff+skb_shared_info header which are used for 
all socket communication in Linux and have to be mostly set up always
are 192+312bytes on 64bit [parts of the 312 bytes is an array that is 
typically only partly used] or 156+236 bytes on 32bit. So the network
headers dwarf the internal data structures.

There might be other reasons why TCP/IP is slower, but arguing 
with the size of the headers is just bogus.

My personal feeling would be that if TCP/IP is too slow for something
it is better to just improve the stack than to add a completely
new socket family. That will benefit much more applications without
requiring to change them.

About the only good reason to use UNIX sockets is when you need to use

IP Multicast when properly set up also doesn't need root.


They could easily just tunnel over a local multicast group for example.

-Andi

--

From: Renzo Davoli
Date: Thursday, December 6, 2007 - 5:18 pm

I have done some raw tests.
(you can read the code here: http://www.cs.unibo.it/~renzo/rawperftest/)

The programs are quite simple. The sender sends "Hello World" as fast as it
can, while the receiver prints time() for each 1 million message
received.

On my laptop, tests on 20000000 "Hello World" packets, 

One receiver:
multicast	244,000 msg/sec
IPN             333,000 msg/sec  (36% faster)

Two receivers:
multicast       174,000 msg/sec
IPN             250,000 msg/sec  (43% faster)

Apart from this, how could I implement policies over a multicast socket,
e.g. how does a Kernel VDE_switch work on multicast sockets?

If I send an ethernet packet over a multicast socket it can emulate just a
hub (Although it seems to me quite innatural to have to have TCP-UDP 
over IP over Ethernet over UDP over IP - okay we can skip the Ethernet 
on localhost, long ethernet frames get fragmentated but... details).

On multicast socket you cannot use policies, I mean a IPN network (or
bus or group) can have a policy reading some info on the packet to
decide the set of receipients.
For a vde_switch it is the destination mac address when found in the
MAC hash table to select the receipient port. For midi communication it 
could be the channel number....

Moving the switching fabric to the userland the performance figures are
quite different.

renzo

--

From: Chris Friesen
Date: Thursday, December 6, 2007 - 4:02 pm

I just reran on a 3.2GHZ P4 running 2.6.11 (Fedora Core 4).  42% latency 
increase.

For stream sockets, unix gives approximately a 62% bandwidth increase 
over tcp.   (Although tcp could probably be tuned to do better than this.)

Chris
--

From: Andi Kleen
Date: Thursday, December 6, 2007 - 4:06 pm

Sounds like something that should be looked into. I know of no

How long a stream did you test? You might be measuring slow start.

-Andi
--

From: Chris Friesen
Date: Thursday, December 6, 2007 - 4:42 pm

No idea.  These are just the standard local networking tests in lmbench 
v2.  In our case the latency was the big concern and we were using 
datagrams anyway.

Chris
--

From: David Miller
Date: Thursday, December 6, 2007 - 8:41 pm

From: "Chris Friesen" <cfriesen@nortel.com>

I'm pretty sure we've removed this limitation.
--

From: Chris Friesen
Date: Thursday, December 6, 2007 - 9:21 pm

As of 2.6.23 nl_groups is a 32-bit bitmask with one bit per group. 
Also, it appears that only root is allowed to use multicast netlink.

Chris
--

From: David Miller
Date: Thursday, December 6, 2007 - 11:40 pm

From: "Chris Friesen" <cfriesen@nortel.com>

The kernel supports much more than 32 groups, see nlk->groups which is
a bitmap which can be sized to arbitrary sizes.  nlk->nl_groups is
for backwards compatability only.

netlink_change_ngroups() does the bitmap resizing when necessary.

The root multicast listening restriction can be relaxed in some
circumstances, whatever is needed to fill your needs.

Stop making excuses, with minor adjustments we have the facilities to
meet your needs.  There is no need for yet-another-protocol to do what
you're trying to do, we already have too much duplicated
functionality.
--

From: Andi Kleen
Date: Friday, December 7, 2007 - 3:03 am

I suspect they would be better of just using IP multicast. But the localhost 
latency penalty vs Unix Chris was talking about probably needs to be investigated.

-Andi
--

From: Renzo Davoli
Date: Friday, December 7, 2007 - 2:18 pm

Andi, David,

I disagree. If you suspect we would be better using IP multicast, I think
your suspects are not supported.
Try the following exercises, please.... Can you provide better solutions
without IPN?

	renzo

Exercise #1.
I am a user (NOT ROOT), I like kvm, qemu etc. I want an efficient network
between my VM.

My solution:
I Create a IPN socket, with protocol IPN_VDESWITCH and all the VM can
communicate.

Your solution:
- I am condamned by two kernel developers to run the switch in the userland 
- I beg the sysadm to give me some pre-allocated taps connected together
by a kernel bridge.
- I create a multicast socket limited to this host (TTL=0) and I use it
like a hub. It cannot switch the packets.                               

Exercise #2.
I am a sysadm (maybe a lab administrator). I want my users (not root)
of the group "vmenabled" to run their VM connected to a network. 
I have hundreds of users in vmenabled(say students).

My Solution:
I create a IPN socket, with protocol IPN_VDESWITCH, connected to a virtual
interface say ipn0. I give to the socket permission 760 owner
root:vmenabled.

Your solution:
- I am condamned by two kernel developers to run the switch in the userland
- I create a multicast socket connected to a tap and then I define iptables
filters to avoid unauthorized users to join the net.
- I create hundreds of preallocated tap interfaces, at least one per user.

Exercise #3.
I am a user (NOT ROOT) and I have a heavy stream of *very private data* 
generated by some processes that must be received by several processes.
I am looking for an efficient solution.
Data can be ASCII strings, or a binary stream.
It is not a "networking" issue, it is just IPC.

My solution.
I Create a IPN socket with permission 700, IPN_BROADCAST protocol. All 
the processes connect to the socket either for writing or for reading (or both).

Your solution:
- I am condamned by two kernel developers to use userland inefficient
solutions like named pipes, tee, ...
From: David Miller
Date: Friday, December 7, 2007 - 7:07 pm

From: renzo@cs.unibo.it (Renzo Davoli)

I personally have not purely advocated IP, although the performance
differences UDP and AF_UNIX should be investigated.

Instead I advocated using AF_NETLINK with some minor multicast
permission modifications to suit your needs.
--

From: Chris Friesen
Date: Monday, December 10, 2007 - 9:05 am

Thanks for the explanation.  Given that it's a bitmap doesn't that 
result in a cost of O(number of groups) when processing messages?  In 


You may have confused me with the OP...I just chimed in because of some 
of the limitations we found when we wanted to do similar things.  In our 
case we created a new unix-like protocol to allow multicast, and have 
been using it for a few years.

However, if we could use netlink instead in our next release that would 
be a good thing.  A couple questions:

1) Is it possible to register to receive all netlink messages for a 
particular netlink family?  This is useful for debugging--it allows a 
tcpdump equivalent.

2) Is there any up-to-date netlink programming guide?  I found this one:

http://people.redhat.com/nhorman/papers/netlink.pdf

but it's three years old now.


Thanks,

Chris
--

Previous thread: Re: [PATCH] bw-qcam: Adds module parameter 'aggressive' to skip polite auto-detection prior to direct initialization. by Brett Warden on Wednesday, December 5, 2007 - 9:51 am. (1 message)

Next thread: How to manage shared persistent local caching (FS-Cache) with NFS? by David Howells on Wednesday, December 5, 2007 - 10:11 am. (7 messages)