Inter Process Networking: a kernel module (and some simple kernel patches) to provide AF_IPN: a new address family for process networking, i.e. multipoint, multicast/broadcast communication among processes (and networks). WHAT IS IT? ----------- Berkeley socket have been designed for client server or point to point communication. All existing Address Families implement this idea. IPN is a new address family designed for one-to-many, many-to-many and peer-to-peer communication among processes. IPN is an Inter Process Communication paradigm where all the processes appear as they were connected by a networking bus. On IPN, processes can interoperate using real networking protocols (e.g. ethernet) but also using application defined protocols (maybe just sending ascii strings, video or audio frames, etc). IPN provides networking (in the broaden definition you can imagine) to the processes. Processes can be ethernet nodes, run their own TCP-IP stacks if they like (e.g. virtual machines), mount ATAonEthernet disks, etc.etc. IPN networks can be interconnected with real networks or IPN networks running on different computers can interoperate (can be connected by virtual cables). IPN is part of the Virtual Square Project (vde, lwipv6, view-os, umview/kmview, see wiki.virtualsquare.org). WHY? ---- Many applications can benefit from IPN. First of all VDE (Virtual Distributed Ethernet): one service of IPN is a kernel implementation of VDE. IPN can be useful for applications where one or some processes feed their data to several consuming processes (maybe joining the stream at run time). IPN sockets can be also connected to tap (tuntap) like interfaces or to real interfaces (like "brctl addif"). There are specific ioctls to define a tap interface or grab an existing one. Several existing services could be implemented (and often could have extended features) on the top of IPN: - kernel bridge - tuntap - macvlan IPN could be used (IMHO) to provide multicast services to ...
On Wed, 5 Dec 2007 17:40:55 +0100 Post complete source code for kernel part to netdev@vger.kernel.org. If you want the hooks, you need to include the full source code for inclusion in mainline. All the Documentation/SubmittingPatches rules apply; you can't just ask for "facilitators" and expect to keep your stuff out of tree. --
I am sorry if I was misunderstood. I did not want any "facilitator", nor I wanted to keep my code outside the kernel, on the contrary. It is perfectly okay for me to provide the entire code for inclusion. The purposes of my message were the following: - I wanted to introduce the idea and say to the linux kernel community that a team is working on it. - Address family: is it okay to send a patch that add a new AF? is there a "AF registry" somewhere? (like the device major/minor registry or the well-known port assignment for TCP-IP). - Hook: we have two different options. We can add another grabbing inline function like those used by the bridge and macvlan or we can design a grabbing service registration facility. Which one is preferrable? The former is simpler, the latter is more elegant but it requires some changes in the kernel bridge code. So the former choice is between less-invasive,safer,inelegant, the latter is more-invasive,less safe,elegant. We need a bit of time to stabilize the code: deeply testing the existing features and implementing some more ideas we have on it. In the meanwhile we would be grateful if the community could kindly ask to the questions above. renzo --
On Thu, 6 Dec 2007 06:38:21 +0100
The usual process is to just add the value as part of the patchset.
You then need to tell the glibc maintainers so it gets included appropriately
The problem with making it a registration facilties are:
* risk of making it easier for non-GPL out of tree abuse
* possible ordering issues: ie. by hardcoding each hook, the
behaviour is defined in the case of multiple usages on the same
I am a believer in review early and often. It is easier to just deal with
the nuisance issues (style, naming, configuration) at the beginning rather
than the final stage of the project.
--
Netlink is multicast/broadcast by default for once. And BC/MC certainly Sounds like netlink. See also RFC 3549 Haven't read further I admit. -Andi --
RFC 3549 says: "This document describes Linux Netlink, which is used in Linux both as an intra-kernel messaging system as well as between kernel and user space." We know AF_NETLINK, our user-space stack lwipv6 supports it. AF_IPN is different. AF_IPN is the broadcast and peer-to-peer extension of AF_UNIX. It supports communication among *user* processes. Example: Qemu, User-Mode Linux, Kvm, our umview machines can use IPN as an Ethernet Hub and communicate among themselves with the hosting computer and the world by a tap like interface. You can also grab an interface (say eth1) and use eth0 for your hosting computer and eth1 for the IPN network of virtual machines. If you load the kvde_switch submodule IPN can be a virtual Ethernet switch. This example is already working using the svn versions of ipn and vdeplug. Another Example: You have a continuous stream of data packets generated by a process, and you want to send this data to many processes. Maybe the set of processes is not known in advance, you want to send the data to any interested process. Some kind of publish&subscribe communication service (among unix processes not on TCP-IP). Without IPN you need a server. With IPN the sender creates the socket connects to it and feed it with data packets. All the interested receivers connects to it and start reading. That's all. I hope that this message can give a better undertanding of what IPN is. renzo --
Ok, you say it's different, but then you describe how IP unicast and
broadcast work. Both are frequently used for communication among
"*user* processes". Please provide significantly more details about
You say "tap like" interface, but people do this already with
existing infrastructure. You can connect Qemu, UML, and KVM to a
standard linus "tap" interface, and then use the standard Linux
bridging code to connect the "tap" interface to your existing network
interfaces. Alternatively you could use the standard and well-tested
IP routing/firewalling/NAT code to move your packets around. None of
this requires new network infrastructure in the slightest. If you
have problems with the existing code, please improve it instead of
creating a slightly incompatible replacement which has different bugs
As I described above, this can be done with the existing bridging and
This is already done frequently in userspace. Just register a port
number with IANA on which to implement a "registration" server and
write a little daemon to listen on 127.0.0.1:${YOUR_PORT}. Your
interconnecting programs then use either unicast or multicast sockets
to bind, then report to the registration server what service you are
offering and what port it's on. Your "receivers" then connect to the
registration server, ask what port a given service is on, and then
multicast-listen or unicast-connect to access that service. The best
part is that all of the performance implications are already
thoroughly understood. Furthermore, if you want to extend your
communication protocol to other hosts as well, you just have to
replace the 127.0.0.1 bind with a global bind. This is exactly how
the standard-specified multiple-participant "SIP" protocol works, for
example.
So if you really think this is something that belongs in the kernel
you need to provide much more detailed descriptions and use-cases for
why it cannot be implemented in user-space or ...Renzo also described something new (in the socket() arena): the multi-reader, multi-writer is just not available in IP. I would strengthen this sentiment: If you think something belongs in the kernel, you need to argue your case (provide much more detailed descriptions and use-cases.) --
How is that different from a multicast group? -Andi --
Good question. I don't know much about multicast IP. It's a bit new for me. I knew it uses Martian addresses! After a little reading, I now know that it does allow many to many communication. Renzo's IPN is a local protocol--you can't multicast to localhost. --
> Renzo's IPN is a local protocol--you can't multicast to localhost. You don't need to. All local clients can join the same group without using localhost. -Andi --
It can be used between user space daemons as well. In fact it is. e.g. they often listen to each other's messages. -Andi --
One problem we ran into was that there are only 32 multicast groups per netlink protocol family. We had a situation where we could have used netlink, but we needed the equivalent of thousands of multicast groups. Latency was very important, so we ended up doing essentially a multicast unix socket rather than taking the extra penalty for UDP multicast. Chris --
What extra penalty? Local UDP shouldn't be much more expensive than Unix. -Andi --
UDP doesn't really have much stack. IP is also very little assuming cached route (connect called first) I would expect the copies to dominate in both cases. -Andi --
Some more explanations trying to describe what IPN is and what it is useful for. We are writing the complete patch.... Summary: * IPN is for inter-process communication. It is *not* directly related to TCP-IP or Ethernet. * IPN itself is a *level 1* virtual physical network. IPN services * (like AF_UNIX) do not require root privileges. TAP and GRAB are just * extra features for for IPN deliverying Ethernet frames. ---- * IPN is for inter-process communication. It is *not* directly related to TCP-IP or Ethernet. If you want you can call it Inter Process Bus Communication. It is an extension of AF_UNIX. Comments saying that some services can be implemented by using TCP-IP multicast protocols are unrelated to IPN. All AF_UNIX services could be implemented as TCP-IP services on 127.0.0.1. Do we abolish AF_UNIX, then? The problem is that to use TCP-IP, you'd need to wrap the packets with TCP or UDP, IP and Ethernet headers, the stack would lose time to manage useless protocols. If you want just to send strings to set of local processes TCP-IP is an overloading solution. Even X-Window uses AF_UNIX sockets to talk with local clients, it is a performance issue... I think Chris is right. * IPN itself is a *level 1* virtual physical network. Like any physical network you can run higher level protocols on it, thus Ethernet, and then TCP-IP can be services you can run on IPN, but there can be IPN networks running neither TCP-IP nor Ethernet. * IPN services (like AF_UNIX) do not require root privileges. There are many communication services where the user need broadcast or p2p among user processes. If a user (not root) wants to run several User-Mode Linux, Qemu, Kvm VM the only way to have them connected together is our Virtual Distributed Ethernet. (For this reason VDE exists in almost all the distros, it has been ported to other OSs, and is already supported in the Linux Kernel for User-Mode Linux). VDE is a userland deamon, hence requires two context switches ...
No ethernet headers on localhost. Just to give you a perspective: IP+TCP headers are 50 bytes (with timestamps) and IP+UDP is 28 bytes. On the other hand the sk_buff+skb_shared_info header which are used for all socket communication in Linux and have to be mostly set up always are 192+312bytes on 64bit [parts of the 312 bytes is an array that is typically only partly used] or 156+236 bytes on 32bit. So the network headers dwarf the internal data structures. There might be other reasons why TCP/IP is slower, but arguing with the size of the headers is just bogus. My personal feeling would be that if TCP/IP is too slow for something it is better to just improve the stack than to add a completely new socket family. That will benefit much more applications without requiring to change them. About the only good reason to use UNIX sockets is when you need to use IP Multicast when properly set up also doesn't need root. They could easily just tunnel over a local multicast group for example. -Andi --
I have done some raw tests. (you can read the code here: http://www.cs.unibo.it/~renzo/rawperftest/) The programs are quite simple. The sender sends "Hello World" as fast as it can, while the receiver prints time() for each 1 million message received. On my laptop, tests on 20000000 "Hello World" packets, One receiver: multicast 244,000 msg/sec IPN 333,000 msg/sec (36% faster) Two receivers: multicast 174,000 msg/sec IPN 250,000 msg/sec (43% faster) Apart from this, how could I implement policies over a multicast socket, e.g. how does a Kernel VDE_switch work on multicast sockets? If I send an ethernet packet over a multicast socket it can emulate just a hub (Although it seems to me quite innatural to have to have TCP-UDP over IP over Ethernet over UDP over IP - okay we can skip the Ethernet on localhost, long ethernet frames get fragmentated but... details). On multicast socket you cannot use policies, I mean a IPN network (or bus or group) can have a policy reading some info on the packet to decide the set of receipients. For a vde_switch it is the destination mac address when found in the MAC hash table to select the receipient port. For midi communication it could be the channel number.... Moving the switching fabric to the userland the performance figures are quite different. renzo --
I just reran on a 3.2GHZ P4 running 2.6.11 (Fedora Core 4). 42% latency increase. For stream sockets, unix gives approximately a 62% bandwidth increase over tcp. (Although tcp could probably be tuned to do better than this.) Chris --
Sounds like something that should be looked into. I know of no How long a stream did you test? You might be measuring slow start. -Andi --
No idea. These are just the standard local networking tests in lmbench v2. In our case the latency was the big concern and we were using datagrams anyway. Chris --
From: "Chris Friesen" <cfriesen@nortel.com> I'm pretty sure we've removed this limitation. --
As of 2.6.23 nl_groups is a 32-bit bitmask with one bit per group. Also, it appears that only root is allowed to use multicast netlink. Chris --
From: "Chris Friesen" <cfriesen@nortel.com> The kernel supports much more than 32 groups, see nlk->groups which is a bitmap which can be sized to arbitrary sizes. nlk->nl_groups is for backwards compatability only. netlink_change_ngroups() does the bitmap resizing when necessary. The root multicast listening restriction can be relaxed in some circumstances, whatever is needed to fill your needs. Stop making excuses, with minor adjustments we have the facilities to meet your needs. There is no need for yet-another-protocol to do what you're trying to do, we already have too much duplicated functionality. --
I suspect they would be better of just using IP multicast. But the localhost latency penalty vs Unix Chris was talking about probably needs to be investigated. -Andi --
Andi, David, I disagree. If you suspect we would be better using IP multicast, I think your suspects are not supported. Try the following exercises, please.... Can you provide better solutions without IPN? renzo Exercise #1. I am a user (NOT ROOT), I like kvm, qemu etc. I want an efficient network between my VM. My solution: I Create a IPN socket, with protocol IPN_VDESWITCH and all the VM can communicate. Your solution: - I am condamned by two kernel developers to run the switch in the userland - I beg the sysadm to give me some pre-allocated taps connected together by a kernel bridge. - I create a multicast socket limited to this host (TTL=0) and I use it like a hub. It cannot switch the packets. Exercise #2. I am a sysadm (maybe a lab administrator). I want my users (not root) of the group "vmenabled" to run their VM connected to a network. I have hundreds of users in vmenabled(say students). My Solution: I create a IPN socket, with protocol IPN_VDESWITCH, connected to a virtual interface say ipn0. I give to the socket permission 760 owner root:vmenabled. Your solution: - I am condamned by two kernel developers to run the switch in the userland - I create a multicast socket connected to a tap and then I define iptables filters to avoid unauthorized users to join the net. - I create hundreds of preallocated tap interfaces, at least one per user. Exercise #3. I am a user (NOT ROOT) and I have a heavy stream of *very private data* generated by some processes that must be received by several processes. I am looking for an efficient solution. Data can be ASCII strings, or a binary stream. It is not a "networking" issue, it is just IPC. My solution. I Create a IPN socket with permission 700, IPN_BROADCAST protocol. All the processes connect to the socket either for writing or for reading (or both). Your solution: - I am condamned by two kernel developers to use userland inefficient solutions like named pipes, tee, ...
From: renzo@cs.unibo.it (Renzo Davoli) I personally have not purely advocated IP, although the performance differences UDP and AF_UNIX should be investigated. Instead I advocated using AF_NETLINK with some minor multicast permission modifications to suit your needs. --
Thanks for the explanation. Given that it's a bitmap doesn't that result in a cost of O(number of groups) when processing messages? In You may have confused me with the OP...I just chimed in because of some of the limitations we found when we wanted to do similar things. In our case we created a new unix-like protocol to allow multicast, and have been using it for a few years. However, if we could use netlink instead in our next release that would be a good thing. A couple questions: 1) Is it possible to register to receive all netlink messages for a particular netlink family? This is useful for debugging--it allows a tcpdump equivalent. 2) Is there any up-to-date netlink programming guide? I found this one: http://people.redhat.com/nhorman/papers/netlink.pdf but it's three years old now. Thanks, Chris --
