How to manage shared persistent local caching (FS-Cache) with NFS?

Previous thread: New Address Family: Inter Process Networking (IPN) by Renzo Davoli on Wednesday, December 5, 2007 - 9:40 am. (28 messages)

Next thread: [PATCH] fix group stop with exit race by Oleg Nesterov on Wednesday, December 5, 2007 - 10:30 am. (3 messages)
From: David Howells
Date: Wednesday, December 5, 2007 - 10:11 am

Okay...  I'm getting to the point where I want to release my local caching
patches again and have NFS work with them.  This means making NFS mounts share
or not share appropriately - something that's engendered a fair bit of
argument.

So I'd like to solicit advice on how best to deal with this problem.

Let me explain the problem in more detail.


================
CURRENT PRACTICE
================

As the kernel currently stands, coherency is ignored for mounts that have
slightly different combinations of parameters, even if these parameters just
affect the properties of network "connection" used or just mark a superblock
as being read-only.

Consider the case of a file remotely available by NFS.  Imagine the client sees
three different views of this file (they could be by three overlapping mounts,
or by three hardlinks or some combination thereof).

This is how NFS currently operates without any superblock sharing:

				+---------+
    Object on server --->	|	  |
				|  inode  |
				|	  |
				+---------+
				    /|\
				   / | \
				  /  |	\
				 /   |	 \
				/    |	  \
			       /     |	   \
			      /	     |	    \
			     /	     |	     \
			    /	     |	      \
			   /	     |	       \
			  /	     |		\
			 |	     |		 |
			 |	     |		 |
 :::::::::::::NFS::::::::|:::::::::::|:::::::::::|:::::::::::::::::::::::::::::
			 |	     |		 |
			 |	     |		 |
			 |	     |		 |
   +---------+	    +---------+	     |		 |
   |	     |	    |	      |	     |		 |
   | mount 1 |----->| super 1 |	     |		 |
   |	     |	    |	      |	     |		 |
   +---------+	    +---------+	     |		 |
				     |		 |
				     |		 |
   +---------+			+---------+	 |
   |	     |			|	  |	 |
   | mount 2 |----------------->| super 2 |	 |
   |	     |			|	  |	 |
   +---------+			+---------+	 |
						 |
						 |
   +---------+				    +---------+
   |	     |				    |	      |
   | mount 3 |----------------------------->| super 3 |
   |	     |				    |	      |
   +---------+				    ...
From: Jon Masters
Date: Wednesday, December 5, 2007 - 10:49 am

[Empty message]
From: David Howells
Date: Wednesday, December 5, 2007 - 11:03 am

I don't have figures on that, but I do know people have complained about it

My point was meant to be that the presence and coverage of a cache is more
likely to reflect the client machine than would the NIS map for the NFS
automounts.  You wouldn't necessarily want to store this table in NIS.

David
--

From: Chuck Lever
Date: Wednesday, December 5, 2007 - 12:54 pm

I don't see how persistent local caching means we can no longer  
ignore (a) and (b) above.  Can you amplify this a bit?  Nothing you  
say in the rest of your proposal convinces me that having multiple  
caches for the same export is really more than a theoretical issue.

Frankly, the reason why admins mount exports multiple times is  
precisely because they want different applications to access the  
files in different ways.  Admins *want* one mount point to be  
available ro, and another rw.  They *want* one mount point to use  
'noac' and another not to.  They *want* multiple sockets, more RPC  
slots, and unique caches for different applications.  No one would go  
to the trouble of mounting an export again, using different options,  
unless that's precisely the behavior that they wanted.

This is actually a feature of NFS.  It's used as a standard part of  
production environments, for example, when running Oracle databases  
on NFS.  One mount point is rw and is used by the database engine.   
Another mount point is ro and is used for back-up utilities, like RMAN.

Another example is local software distribution.  One mount point is  
ro, and is accessed by normal users.  Another mount point accesses  
the same export rw, and is used by administrators who provide updates  
for the software.

As useful as the feature is, one can also argue that mounting the  
same export multiple times is infrequent in most normal use cases.   
Practically speaking, why do we really need to worry about it?

The real problem here is that the NFS protocol itself does not  
support strong cache coherence.  I don't see why the Linux kernel  
must fix that problem.

The only real problem with the first scenario is that you may have  
more than one copy of a file in the persistent cache.  How often will  
that be the case?  Since the local persistence cache is probably disk- 
based and thus large relative to memory, what's the problem with  
using a little extra space?

The problems ...
From: David Howells
Date: Wednesday, December 5, 2007 - 6:22 pm

How about I put it like this.  There are two principal problems to be dealt
with:

 (1) Reconnection.

     Imagine that the administrator requests a mount that uses part of a cache.
     The client machine is at some time later rebooted and the administrator
     requests the same mount again.

     Since the cache is meant to be persistent, the administrator is at liberty
     to expect that the second mount immediately begins to use the data that
     the first mount left in the cache.

     For this to occur, the second mount has to be able to determine which part
     of the cache the first mount was using and request to use the same piece
     of cache.

     To aid with this, FS-Cache has the concept of a 'key'.  Each object in the
     cache is addressed by a unique key.  NFS currently builds a key to the
     cache object for a file from: "NFS", the server IP address, port and NFS
     version and the file handle for that file.

 (2) Cache coherency.

     Imagine that the administrator requests a mount that uses part of a
     cache.  The administrator then makes a second mount that overlaps the
     first, maybe because it's a different part of the same server export or
     maybe it uses the same part, but with different parameters.

     Imagine further that a particular server file is accessible through both
     mountpoints.  This means that the kernel, and therefore the user, has two
     views of the one file.

     If the kernel maintains these two views of the files as totally separate
     copies, then coherency is mostly not a kernel problem, it's an application
     problem - as it is now.

     However, if these two views are shared at any level - such as if they
     share an FS-Cache cache object - then coherency can be a problem.

     The two simplest solutions to the coherency problem are (a) to enforce
     sharing at all levels (superblocks, inodes, cache objects), (b) to enforce
     non-sharing.  In-between states are possible, but ...
From: Chuck Lever
Date: Thursday, December 6, 2007 - 11:28 am

Hi David-


Why not use the fsid as well?  The NFS client already uses the fsid  
to detect when it is crossing a server-side mount point.  Fsids are  
supposed to be stable over server reboots (although sometimes they  
aren't, it could be made a condition of supporting FS-cache on clients).

I also note the inclusion of server IP address in the key.  For multi- 
homed servers, you have the same unavoidable cache aliasing issues if  
the client mounts the same server and export via different server  

Is it a problem because, if there are multiple copies of the same  
remote file in its cache, then FS-cache doesn't know, upon  
reconnection, which item to match against a particular remote file?

I think that's actually going to be a fairly typical situation --  
you'll have conditions where some cache items will become orphaned,  
for example, so you're going to have to deal with that ambiguity as a  
part of normal operation.

For example, if the FS-caching client is disconnected or powered off  
when a remote rename occurs that replaces a file it has cached, the  
client will have an orphaned item left over.  Maybe this use case is  

How do you propose to do that?

First, clearly, FS-cache has to know that it's the same object, so  
fsid and filehandle have to be the same (you refer to that as the  
"reconnection problem", but it may generally be a "cache aliasing  
problem").

I assume FS-cache has a record of the state of the remote file when  
it was last connected -- mtime, ctime, size, change attribute (I'll  
refer to this as the "reconciliation problem")?  Does it, for  
instance, checksum both the cache item and the remote file to detect  
data differences?

You have the same problem here as we have with file system search  
tools such as Beagle.  Reconciling file contents after a reconnection  
event may be too expensive to consider for NFS, especially if a file  

Do you allow administrators to select whether the FS-cache is  
persistent?  Or is it ...
From: David Howells
Date: Thursday, December 6, 2007 - 1:00 pm

Why use the FSID at all?  The file handles are supposed to be unique per

I'm aware of this, but unless there's:

 (a) a way to specify a logical server group to the kernel, and

 (b) a guarantee that the file handles of each member of the logical group are
     common across the group

there's nothing I can do about it.

AFS deals with these by making servers second class citizens, and defining
"file handles" to be a set within the cell space.

Besides, I can use the IP address of the server as a key.  I just have to hope
that the IP address doesn't get transferred to a different server because, as

There are multiple copies of the same remote file that are described by the
same remote parameters.  Same IP address, same port, same NFS version, same

Orphaned stuff in the cache is eventually culled by cachefilesd when there's

Rename isn't a problem provided the FH doesn't change.  NFS effectively caches
inodes, not files.  If the remote file is deleted, then either NFS will try
opening it, will fail and will tell the cache to evict it; or the remote file
will never be opened again and the garbage in the cache will be culled
eventually.  It may even hang around for ever, but if the FH it re-used, the
cache object will be evicted based on mtime + ctime + filesize being
different.

If someone tries hard enough, they can probably muck up the cache, but there's

For NFS, check mtime + ctime + filesize upon opening.  It's in the patch
already.



mtime + ctime + size, yes.  I should add the change attribute if it's present,

No.  That would be horrendously inefficient.  Besides, if we're going to
checksum the remote file each time, what's the point in having a persistent

Because NFS v2 and v3 don't support proper coherency, there's a limited amount
we can do without being silly about it.  You just have to hope someone doesn't
wind back the clock on the server in order to fudge the ctime to give your
cache conniptions.  But if someone's willing to go to such lengths, ...
Previous thread: New Address Family: Inter Process Networking (IPN) by Renzo Davoli on Wednesday, December 5, 2007 - 9:40 am. (28 messages)

Next thread: [PATCH] fix group stop with exit race by Oleg Nesterov on Wednesday, December 5, 2007 - 10:30 am. (3 messages)