Proposal: let us simplify PTRACE_SYSCALL/PTRACE_SINGLESTEP/PTRACE_SYSEMU/PTRACE_SYSEMU_SINGLESTEP, and now PTRACE_BLOCKSTEP (which will require soon a PTRACE_SYSEMU_BLOCKSTEP), my PTRACE_SYSVM...etc. etc. Summary of the solution: Use tags in the "addr" parameter of existing PTRACE_SYSCALL/PTRACE_SINGLESTEP/PTRACE_CONT/PTRACE_BLOCKSTEP calls to skip the current call (PTRACE_VM_SKIPCALL) or skip the second upcall to the VM/debugger after the syscall execution (PTRACE_VM_SKIPEXIT). Note: The patch is against linux-2.6.26-rc6, it applies with some line offset warnings to git2, too. Motivation: The ptrace tag PTRACE_SYSEMU is a feature mainly used for User-Mode Linux, or at most for other virtual machines aiming to virtualize *all* the syscalls (total virtual machines). In fact: ptrace(PTRACE_SYSEMU, pid, 0, 0) means that the *next* system call will not be executed. PTRACE_SYSEMU AFAIK has been implemented only for x86_32. I already proposed some time ago a different tag: PTRACE_SYSVM (and I maintain a patch for it) where: ptrace(PTRACE_SYSVM, pid, XXX, 0) 1* is the same as PTRACE_SYSCALL when XXX==0, 2* skips the call (and stops before entering the next syscall) when PTRACE_VM_SKIPCALL | PTRACE_VM_SKIPEXIT 3* skips the ptrace call after the system call if PTRACE_VM_SKIPEXIT. PTRACE_SYSVM has been implemented for x86_32, powerpc_32, um+x86_32. (x86_64 and ppc64 exist too, but are less tested). The main difference between SYSEMU and SYSVM is that with SYSVM it is possible to decide if *this* system call should be executed or not (instead of the next one). SYSVM can be used also for partial virtual machines (some syscall gets virtualized and some others do not), like our umview. PTRACE_SYSVM above can be used instead of PTRACE_SYSEMU in user-mode linux and in all the others total virtual machines. In fact, provided user-mode linux skips *all* the syscalls it does not matter if the upcall happens just after (SYSEMU) or just before (SYSVM) having skipped the ...
On the whole, I'm in favor of generalizing ptrace, especially if it There's a symmetry implied in the PTRACE_VM_SKIPCALL and PTRACE_VM_SKIPEXIT names which doesn't exist in reality. SKIPEXIT (as you note later) merely omits the notification on system call return. SKIPCALL keeps the notification, but omits the system call execution, so the effects are very different from each other. I think this is just a naming issue - we don't want the names to fake BTW, if performance is the issue here (and I don't see any other compelling reasons for it), there are other possibilities which provide much better performance. Any PTRACE_* variant will have at least one notification. While there is a noticable gain over two notifications, that's marginal compared to no notifications at all. If you know ahead of time what system calls you want to trace, a system call tracing mask lets you avoid those notifications totally. I wrote up a patch a couple of years ago - http://marc.info/?l=user-mode-linux-devel&m=114495242202954&w=2 but the interface implemented there isn't very good. Jeff -- Work email - jdike at linux dot intel dot com --
Maybe we can find out better tag names.
In the patch I submitted PTRACE_VM_SKIPCALL implies PTRACE_VM_SKIPEXIT
as it is useless to have a notification after nothing has been done.
So, there are three behaviors after the first notification:
0 -> do the syscall and notify after it
PTRACE_VM_SKIPEXIT -> do the syscall and do not notify after it
There is a misunderstanding about what I meant with "some syscall gets
virtualized and some others do not". Obviously it if a fault of mine, it
was poorly explained. Let me briefly describe our partial virtual
machines to explain one possible application for these tags.
(the complete documentation of the project can be found here:
wiki.virtualsquare.org).
umview (and now kmview using a kernel module based on utrace) decides if
a syscall must be virtualized or not depending on the value of its
arguments, not on the syscall number. With "system call" I mean "call of
a system call", a "system call call";-)
For example, *mview {umview,kmview} can virtualize just a subtree of the
file system, thus a "open" system call gets virtualized only if the path
refers to a file in the subtree. Consequently a system call like "read"
becomes virtual if the file descriptor was created by a virtualized
open, otherwise the process executes the standard read provided by the
kernel.
In this way users can (virtually) mount file system images just for the
processes running inside a *mview instance, or run user-level network
stacks, virtual devices, define their own perspective on everything
(uid, gid, system name). We have virtualized even the pace of the time
flowing.
We do not "boot" a different kernel, there are just modules that users
can combine to virtualize different entities:
- umfuse for the file system
- umnet for networking
- umdev for devices
- umtime, umbinfmt, umtime, umname...
We need all the different behaviors listed above.
PTRACE_VM_SKIPCALL -> for the system calls we virtualize.
PTRACE_VM_SKIPEXIT -> for the non virtualized system ...To be more precise -
don't do the syscall or return notification
Looking at things this way, it seems like you might want three flags,
since the asymmetry is caused by two things being bundled into
SKIPCALL.
If you have
PTRACE_VM_SKIPEXIT - skip the return notification
PTRACE_VM_SKIPCALL - skip the syscall
PTRACE_VM_SKIPSTART - skip the call notification
this makes the meaning make more sense to me.
The downside of this is that you end up at least one combination that
doesn't make too much sense, like PTRACE_VM_SKIPCALL (do both
OK, if you're looking at the arguments in order to decide what to do,
then you can't just mask out the notifications.
Jeff
--
Work email - jdike at linux dot intel dot com
--
Jeff,
There are three events for a syscall:
START - call notification
CALL - run the SYSCALL
EXIT - return notification.
I think that it is a non sense to write code for useless cases.
Let us see all the combinations of doing/skipping each one of the three
phases:
0- DOSTART - DOCALL - DOEXIT - Standard PTRACE_SYSCALL (new option 0)
1- DOSTART - DOCALL - SKIPEXIT - PTRACE_VM_SKIPEXIT of my proposal
2- DOSTART - SKIPCALL - DOEXIT - useless, nothing has changed between
the two notifications
3- DOSTART - SKIPCALL - SKIPEXIT - PTRACE_VM_SKIPCALL in my proposal
4- SKIPSTART - DOCALL - DOEXIT - is this useful? (Case 4,see below)
5- SKIPSTART - DOCALL - SKIPEXIT - simply don't use PTRACE_SYSCALL
6- SKIPSTART - SKIPCALL - DOEXIT - this is the old PTRACE_SYSEMU (case 6)
7- SKIPSTART - SKIPCALL - SKIPEXIT - nullify completely the syscalls
(case 7).
case 4: a vm or debugging monitor receives just the return value of a
syscall. In many architectures it not even possible to read the parameters
of the call (e.g. powerpc where the first argument and the return value
use the same register). This choice must be done a-priori, so without
actually know which will be the next system call.
case 6: this makes sense just for applications which virtualize *all* the
system call, current PTRACE_SYSEMU works exactly in this way.
My patch shows that for these applications it does not matter whether the
virtualization takes place before skipping the call or after having just
skipped the call. So PTRACE_VM_SKIPCALL can be used instead.
case 7: skip the next syscall and give no information about, there is no way
to virtualize or trace what is going on.
Who could be ever interested in an option like this?
It seems that the combinations that really make sense are those skipping
a trailing part of the sequence.
DOSTART - DOCALL - DOEXIT my option 0
DOSTART - DOCALL - SKIPEXIT my option ...I can see this being useful - this is kind of what strace wants, except that it wouldn't be able to see that a system call is about to sleep. This could be implemented by just stashing any trashed Seems reasonable. In this case, they should be numbered 0, 1, 2 rather than having masks or-ed together. This happens to produce the Maybe. How about PTRACE_VM_TRACESTART? Makes the naming somewhat non-orthogonal, but shorter and descriptive. Jeff -- Work email - jdike at linux dot intel dot com --
