X Gesture Extension protocol - draft proposal v1

Wed Aug 18 16:10:39 PDT 2010

On Wed, Aug 18, 2010 at 05:02:57PM -0400, Chase Douglas wrote:
> On Wed, 2010-08-18 at 21:54 +0200, Simon Thum wrote:
> > Hi Chase,
> > 
> > I'll just quickly note some things not covered by Peter.
> > 
> > Am 16.08.2010 17:13, schrieb Chase Douglas:
> > > Each touch event location occurs within a hierarchy of windows from the child
> > > window, the top-most window the touch occurred in, and the root window of the
> > > screen in which the touch event occurred. The common ancestry of all touch
> > > events is used for propagation.
> > That's very single-user. What about big muti-touch screens with multiple
> > people around? I'm not arguing you should cater that case, but at least
> > don't rule it out right in the spec. IOW, why not let the special client
> > decide propagation constraints?
> 
> I should restate the last sentence here as: "The common ancestry of the
> touch events comprising the gesture is used for propagation." I'll fix
> that up in the next revision.
> 
> We did try to craft a protocol that works in multi-user (or two-handed
> single-user :) environments. The input events are sent to the gesture
> engine for recognition. The GE can split the touches into separate
> groupings as it feels appropriate. It might see a cluster of two fingers
> on the right side of the screen as one gesture, and a cluster of three
> fingers on the left side of the screen as another gesture. It then sends
> gesture events for these gestures separately.
> 
> This is another reason why I like to decouple the GE from X: it allows
> people to play with touch to gesture assignment. 

but aren't you coupling GE and X with this protocol? apps now must go
through the GE picked by the server* and take it or leave it.

* yeah, not 100% true but you get the point

> In Maverick, we will only be supporting single gestures at a time (I think
> that's the case, Henrik Rydberg may whip something together before release
> to enable more than one gesture :). However, picking which touches are
> part of which groupings is tricky, so I can see someone come along with a
> better algorithm than what is initially implemented.
>  
> > > The common ancestry is traversed from child windows to parent windows to find
> > > the first window with a client selecting for initiation of the gesture primitive
> > > comprising the touches. The first window meeting this criteria is the normal
> > > event window.
> > I think that normal window deserves a definition of its own.
> 
> I'm not sure I follow? The normal event window is just a regular window
> that meets the criteria set above: the first window from the child
> window to the root window that contains all the touches and has at least
> one client selecting for initiation of the gesture primitive. I just
> called it the "normal event window" cause I needed some name for it. If
> you can think of a better name let me know :).
> 
> > > 4. Gesture Primitive Events
> > > 
> > > Gesture primitive events provide a complete picture of the gesture and the
> > > touches that comprise it. Each gesture provides typical X data such as the
> > > window ID of the window the gesture occurred within, and the time when the event
> > > occurred. The gesture specific data includes information such as the focus point
> > > of the event, the location and IDs of each touch comprising the gesture, the
> > > properties of the gesture, and the status of the gesture.
> > Is the focus point the coordinate in GestureRecognized? If yes, why do
> > you call it gesture-specific? It's present in all gestures recognized.
> 
> It's gesture specific because it may vary even among the same gesture
> type. Think of a rotation gesture. It may be performed by moving two
> fingers around a pivot point halfway between the two. The focus point
> would thus be that pivot point. A rotation may also be performed by
> rotating one finger around a stationary finger. The focus point would be
> under the stationary finger in this case.

I think you may be arguing about the wording here only, not about the
content. The focus point is present in every gesture, right? If so, 
maybe a better wording may be
"Each gesture provides generic data such as the window the gesture occurred
within, the time of the event and the focus point of the gesture. Data
specific to a particular type of gesture includes the location and IDs
of each touch, ..."

> The point of the focus coordinates is to give context to the clients
> about the gesture. For rotations, a client will need to know at what
> point to pivot. For pinches, a client will need to know at what point to
> zoom at.
> 
> > > coordinates. The status of the gesture defines the state of the gesture through
> > > its lifetime.
> > This sentence is defined by these words until the .
> 
> Heh, I can try to clean that up. The point I'm trying to get across is
> that a gesture primitive has a lifetime, and the status informs the
> client about the beginning, continuation, and ending of a primitive.
> 
> > > When the engine recognizes a gesture primitive, it sends a gesture event to the
> > > server with the set of input event sequence numbers that comprise the gesture
> > > primitive. The server then selects and propagates the gesture event to clients.
> > > If clients were selected for propagation, the input events comprising the
> > > gesture primitive are discarded. Otherwise, the input events are released to
> > > propagate through to clients as normal XInput events.
> > I understand what you want to achieve, but I'd argue that apps shouldn't
> > be listening to xinput events when they register for gestures. Or at
> > least, only in properly constrained areas. Think of a mouse/pad gesture
> > detection - how to avoid the latency implied by that approach?
> 
> I'm not sure I understand your last sentence, but I'll address the rest.
> 
> It may be true that a client should choose whether to receive only
> gestures or only XInput events on a given window, but I'm not sure. X
> was designed to be as flexible as possible, and leave policy up to the
> toolkits and libraries that sit on top of X (or so wikipedia tells
> me :). This mechanism is following that spirit.
> 
> Beyond that, It sounds like we're of the same mindset here. You "argue
> that apps shouldn't be listening to xinput events when they register for
> gestures." This protocol discards XInput events when a gesture is
> recognized and an event is sent to a client. I hope I haven't misread
> anything :).

You need to clearly define whether this is a "shouldn't" or a "mustn't",
because only the latter is something you can safely work with. In
particular, you need to define what happens to XI events creating a gesture
if a client has a grab on a particular device.

> I don't fully understand your last sentence, but I will try to address
> latency concerns. I think our current gesture recognition code is less
> than 500 lines of code (maybe more near 300 lines, Henrik wrote it and
> has more details if you are interested). Obviously, you can do a lot in
> a small amount of code to kill latency, but I think Henrik has crafted a
> tight and fast algorithm. I was skeptical about latency at first too,
> but human interfaces being what they are, we should have plenty of cpu
> cycles to do all the gesture primitive recognition we need (please don't
> read this and assume we're pegging the processors either :). So far in
> testing, I haven't seen any noticeable delay, but it's still rather
> early in development.

I don't think the algorithm is what's holding you back anyway, it's the
nature of gestures and human input in general.Even if you GE is
instantaneous in the recognition, you may not know for N milliseconds if the
given input may even translate into a gesture. Example: middle mouse button
emulation code - you can't solve it without a timeout.

Cheers,
  Peter