X Gesture Extension protocol - draft proposal v1

Wed Aug 18 14:02:57 PDT 2010

On Wed, 2010-08-18 at 21:54 +0200, Simon Thum wrote:
> Hi Chase,
> 
> I'll just quickly note some things not covered by Peter.
> 
> Am 16.08.2010 17:13, schrieb Chase Douglas:
> > Each touch event location occurs within a hierarchy of windows from the child
> > window, the top-most window the touch occurred in, and the root window of the
> > screen in which the touch event occurred. The common ancestry of all touch
> > events is used for propagation.
> That's very single-user. What about big muti-touch screens with multiple
> people around? I'm not arguing you should cater that case, but at least
> don't rule it out right in the spec. IOW, why not let the special client
> decide propagation constraints?

I should restate the last sentence here as: "The common ancestry of the
touch events comprising the gesture is used for propagation." I'll fix
that up in the next revision.

We did try to craft a protocol that works in multi-user (or two-handed
single-user :) environments. The input events are sent to the gesture
engine for recognition. The GE can split the touches into separate
groupings as it feels appropriate. It might see a cluster of two fingers
on the right side of the screen as one gesture, and a cluster of three
fingers on the left side of the screen as another gesture. It then sends
gesture events for these gestures separately.

This is another reason why I like to decouple the GE from X: it allows
people to play with touch to gesture assignment. In Maverick, we will
only be supporting single gestures at a time (I think that's the case,
Henrik Rydberg may whip something together before release to enable more
than one gesture :). However, picking which touches are part of which
groupings is tricky, so I can see someone come along with a better
algorithm than what is initially implemented.

> > The common ancestry is traversed from child windows to parent windows to find
> > the first window with a client selecting for initiation of the gesture primitive
> > comprising the touches. The first window meeting this criteria is the normal
> > event window.
> I think that normal window deserves a definition of its own.

I'm not sure I follow? The normal event window is just a regular window
that meets the criteria set above: the first window from the child
window to the root window that contains all the touches and has at least
one client selecting for initiation of the gesture primitive. I just
called it the "normal event window" cause I needed some name for it. If
you can think of a better name let me know :).

> > 4. Gesture Primitive Events
> > 
> > Gesture primitive events provide a complete picture of the gesture and the
> > touches that comprise it. Each gesture provides typical X data such as the
> > window ID of the window the gesture occurred within, and the time when the event
> > occurred. The gesture specific data includes information such as the focus point
> > of the event, the location and IDs of each touch comprising the gesture, the
> > properties of the gesture, and the status of the gesture.
> Is the focus point the coordinate in GestureRecognized? If yes, why do
> you call it gesture-specific? It's present in all gestures recognized.

It's gesture specific because it may vary even among the same gesture
type. Think of a rotation gesture. It may be performed by moving two
fingers around a pivot point halfway between the two. The focus point
would thus be that pivot point. A rotation may also be performed by
rotating one finger around a stationary finger. The focus point would be
under the stationary finger in this case.

The point of the focus coordinates is to give context to the clients
about the gesture. For rotations, a client will need to know at what
point to pivot. For pinches, a client will need to know at what point to
zoom at.

> > coordinates. The status of the gesture defines the state of the gesture through
> > its lifetime.
> This sentence is defined by these words until the .

Heh, I can try to clean that up. The point I'm trying to get across is
that a gesture primitive has a lifetime, and the status informs the
client about the beginning, continuation, and ending of a primitive.

> > When the engine recognizes a gesture primitive, it sends a gesture event to the
> > server with the set of input event sequence numbers that comprise the gesture
> > primitive. The server then selects and propagates the gesture event to clients.
> > If clients were selected for propagation, the input events comprising the
> > gesture primitive are discarded. Otherwise, the input events are released to
> > propagate through to clients as normal XInput events.
> I understand what you want to achieve, but I'd argue that apps shouldn't
> be listening to xinput events when they register for gestures. Or at
> least, only in properly constrained areas. Think of a mouse/pad gesture
> detection - how to avoid the latency implied by that approach?

I'm not sure I understand your last sentence, but I'll address the rest.

It may be true that a client should choose whether to receive only
gestures or only XInput events on a given window, but I'm not sure. X
was designed to be as flexible as possible, and leave policy up to the
toolkits and libraries that sit on top of X (or so wikipedia tells
me :). This mechanism is following that spirit.

Beyond that, It sounds like we're of the same mindset here. You "argue
that apps shouldn't be listening to xinput events when they register for
gestures." This protocol discards XInput events when a gesture is
recognized and an event is sent to a client. I hope I haven't misread
anything :).

I don't fully understand your last sentence, but I will try to address
latency concerns. I think our current gesture recognition code is less
than 500 lines of code (maybe more near 300 lines, Henrik wrote it and
has more details if you are interested). Obviously, you can do a lot in
a small amount of code to kill latency, but I think Henrik has crafted a
tight and fast algorithm. I was skeptical about latency at first too,
but human interfaces being what they are, we should have plenty of cpu
cycles to do all the gesture primitive recognition we need (please don't
read this and assume we're pegging the processors either :). So far in
testing, I haven't seen any noticeable delay, but it's still rather
early in development.

> > The timeout value is implementation specific.
> (Having 5 ms here doesn't count...)

Can you give more detail? I'm not sure what you are getting at.

Thanks Simon for reviewing the protocol!

-- Chase