X Gesture Extension protocol - draft proposal v1

Fri Aug 27 09:21:40 PDT 2010

On Fri, 2010-08-20 at 13:20 +1000, Peter Hutterer wrote:
> On Wed, Aug 18, 2010 at 10:29:20AM -0400, Chase Douglas wrote:
> > First, I see multitouch and gestures as two separate concepts, even
> > though they're closely linked. Multitouch is defined by the ability to
> > send raw data about each touch wherever they are on the screen, while
> > gestures are a grouping of multitouch touches as a higher order event.
> > There are times when an application wants only multitouch events, and
> > when an application only wants gesture events. For example, google maps
> > may just want pan and zoom gestures, while inkscape only wants
> > multitouch events for drawing.
> 
> I have to disagree very strongly here. IMO, gestures are merely an
> interpretation of input events conveying a specific meaning, much in the
> same manner as a doubleclick conveyes a meaning.
> 
> And quite frankly, I'm pretty sure that most multitouch applications will
> use _both_ gestures and multitouch in the future. That's an assumption based
> on my beergut feeling, feel free to prove me wrong in a few years time ;)

I don't think we are in disagreement here. Maybe you are saying that a
client should receive both gesture events and the underlying multitouch
events at the same time? I tried to craft that into the proposal by
sending all the MT event data as properties of the gesture, but it could
be represented differently.

> > The strict dependency here is the event stream to the GE that I didn't
> > define as fully as I should have. The GE receives MT events through XI
> > 2.1 as its raw input stream to perform recognition. It only makes sense
> > to me to send these events using the XI 2.1 protocol instead of defining
> > a new protocol just to send events to the GE.
> 
> Yeah, that makes sense. But why only allow touch events to be interpreted as
> gesture events? Opera I think was the first browser to come out with mouse
> gestures and I used to love it. There is a use-case for other gestures.

We are not trying to enable gestures as a whole in the X gesture
protocol, only primitives. What I mean by that is we aren't going to be
recognizing movements that look like a shape in order to send a gesture.
If you want to do something fancy when someone makes a circle shape on
the screen, do that recognition client side. We only want to recognize
gesture primitives that we can send events for as they are occurring.
Anything that holds up event propagation, like waiting for a full circle
to complete before recognition ends and we send an event, is ill suited
for gesture recognition in X.

> > > * If this supplements the XI2.1 MT proposal, how does gesture recognition work
> > > when there's a grab active on any device?
> > 
> > Good question. I'll admit that I haven't fully worked out this issue. In
> > Maverick we are actually siphoning off events inside the evdev input
> > module, so gestures would supersede input device grabs. However, my goal
> > is to move the gesture event handling inside the server where we could
> > reverse the ordering. If I had to make an educated guess as to what
> > should happen, I would say that an active grab should override gesture
> > support for the grabbed device.
> > 
> > > * What kind of event stream is sent to the gesture engine (GE)? it just says
> > >   "stream of events" but doesn't specify anything more.
> > 
> > XI 2.1 events. I'll be sure to make that more clear in the next
> > revision.
> 
> Make sure to be clear _what_ type of events. XI 2.1 is not finished yet, the
> proposed bits cover the touch parts of it but there are several other ideas
> floating around that may (or more likely may not) go in.
> 
> Also, you realise that the draft I sent out is just that - a draft? You're
> building on top a potentially moving target here (just a warning).

Oh yes! Don't worry. My first and number 1 priority for the next cycle
of Ubuntu development is to do whatever I can to help implement
multitouch support through X. Without that support, we are left with
what we have in Maverick, which is very hackish. If XI 2.1 isn't ready
by the time Natty ships, we'll likely keep what we've done in Maverick.
However, our goal is to have multitouch through X and gestures built on
top of it. We'll see how that works out :).

> I think my point may have been ambiguous. What I wanted to say is: I'm for X
> not having _anything_ at all to do with the GE. X forwards the events to the
> client, the client then may or may not pass it to the GE (e.g. over dbus). X
> just sends the raw events, contextual interpretation of these events is done
> purely client-side.

I think we should continue this discussion in a thread for the email I
just sent out about client side vs server side recognition.

> > > I'll be upfront again and re-state what I've said in the past - gestures do
> > > IMO not belong into the X server, so consider my view biased in this regard.
> > > The X protocol is not the ideal vehicle for gesture recognition and
> > > inter-client communication, especially given it's total lack of knowledge of
> > > user and/or session.
> > 
> > Believe me, we tried our hardest not to throw all this into the X
> > server :). It's not that gesture recognition needs to be inside the
> > server. The issue is correct event propagation. Gestures occur in
> > specific regions of the screen, and as such they must be propagated with
> > full knowledge of the X window environment.
> > 
> > Say you have one application with a parent window and a child window.
> > Both windows select for MT input events. The application wants to
> > receive gesture events on the parent window. I then make a gesture with
> > some fingers in the parent window and some in the child window. Without
> > the gesture recognition and propagation inside the X server, some of the
> > input events would be sent to the parent window and some to the child
> > window. It becomes very difficult to assimilate all the data properly
> > for gestures if the raw inputs are spread out among various child
> > windows.
> 
> OTOH, something that may look like a gesture when viewed from the parent
> window may indeed be independent interaction in two different windows.
> Humans are (un)surprisingly adept at using two hands independently and
> ruling out this use-case by assuming gestures by default is inhibiting.
> 
> The much better approach here is to teach users not to do wrong gestures.
> I recommend a read of "Ripples: Utilizing Per-Contact Visualizations to
> Improve User Interaction with Touch Displays" by Wigdor et al.

You might be right here, and I think we'll just have to try out things
to figure out what works best.

> > If your question is geared more towards why we call these primitives
> > instead of just gestures, it comes from an idea we have that primitives
> > may be strung together at a high level (maybe a toolkit?) to have some
> > predefined meaning. For example, a DJ application may define a gesture
> > sequence as two finger down, release one finger, drag one finger, then
> > tap the second finger again. This gesture may be defined to have a
> > specific meaning when it occurs over a mixer level control. This is very
> > much a new idea though, it's not been implemented and tested yet.
> 
> IMO, this is _way_ too complicated to be regarded as a gesture by the
> engine. And this is my main grief here, by having a single engine you
> require that engine to be a catchall one. Having a completely client-side
> engine allows you to do crazy gestures in one app but have other apps use a
> simpler engine.

I should have been more clear: what I noted above is not supposed to be
implemented in the gesture engine. It's supposed to be implemented in a
layer on the client side that will aggregate gesture primitive events
into a stream that makes up an even higher order intention. This higher
order intention is very application specific, so it should only be done
client-side.

> Use-case of Firefox again: How does FF deal with gestures on other OS? does
> it take what the OS provides or does it have its own engine? If the latter,
> it's likely that FF does not want to rely on any other engine to keep the
> behaviour consistent across platforms.
> (I might be wrong here, please let me know if this is the case)
> 
> Substitute FF with any app, just to be sure.

It is my understanding that FF and Qt rely on system provided gestures.
In the case of Qt, if the system does not provide gestures it does have
an internal gesture recognizer that may be used. This would be client
side recognition of course. I don't know the propagation and selection
algorithms of Windows or OS X.

> > Of course, if FF only listened to XI 2.1 events instead of using X
> > Gesture, then this simplifies greatly.
> 
> This is a problem then. As soon as something simplifies something greatly,
> you've just found what most developers will want to use, be it out of
> convenience, lazyness, or any other reason.

Yes, which is why we are trying to make our solution as simple as we
can. In most cases I think we can succeed in this regard. However, I
won't argue that in every case using the X Gesture extension is the best
solution.

> > > > 3. Data types
> > > > 
> > > > DEVICE { DEVICEID, AllDevices }
> > > >         A DEVICE specifies either an X Input DEVICEID or AllDevices.
> > > 
> > > AllMasterDevices is missing here, and in the other mentions below.
> > 
> > This is intentional. Gestures are tied to absolute input devices, and
> > properties are given in screen coordinates. Thus, you should only be
> > listening for input on individual devices themselves, not the aggregates
> > that are master devices.
> > 
> > I'm not 100% convinced of this approach though, it just feels right to
> > me. I'd be happy to add in master devices if it makes sense.
> 
> I think it's best to add them here. While aggregate master devices may not
> make a lot of sense for direct touch devices (well, maybe in clone mode,
> but...) I don't see why we shouldn't just pass them through.

I can't think of a specific reason to leave them out, so we can add them
in.

> > Currently we  do all the event processing in the same thread as the rest
> > of X, and I think there's minimal latency such that it doesn't impact
> > performance. The gesture recognition code is only a few hundred lines.
> > 
> > Since we haven't split it out into a separate client yet, we haven't had
> > to deal with acceptable GE latency. I think there's some research that
> > says any UI latency above 100 ms can become an issue, so perhaps a
> > timeout value based on that would be a good starting point?
> 
> as said in the other email, the GE implementation is the least of your
> worries. It's the nature of gestures and how quickly they can be identified.

I think this comes back to a misunderstanding of the types of gestures
we are aiming to recognize. The gesture events we are trying to
recognize use a very small recognition period (in time or space). For
example, a pinch is recognized once the two fingers have moved just a
few pixels towards each other in a small amount of time. If it takes too
long for a pinch to materialize, then it's not regarded as such and the
events are passed on as is. This way we have very minimal latency.

> > > As I read this at the moment, this approach means that _any_ touch event is
> > > delayed until the server gets the ok from the GE. The passive grab approach
> > > for gesture recognition also means that any event is delayed if there is at
> > > least one client that wants gestures on the root window. What's the impact
> > > on the UI here?
> > 
> > I don't think I understand :). The gesture engine does its recognition
> > and then hands gesture events off to the server or tells the server to
> > allow XI 2.1 events it's queued up. At this point, it's just a matter of
> > propagation and selection. Maybe your argument is that you have to check
> > the full lineage from the child window to the root window to find if
> > anyone is listening for the gesture event, but that shouldn't take very
> > long.
> 
> my argument here is that you get an input event, send it to the GE, then
> wait for the GE to return with some value before you either send the raw
> events or the gesture event. Given a timeout of e.g. 50 ms, how much does
> this accumulate to before the actual event will arrive at the client?
> 
> I don't know how the GE works, but if you send multiple subsequent events to
> the GE, does the timeout accumulate or reset on each event?
> e.g. finger 1 sets timeout to 50ms, but 40ms into it another finger
> arrives. you now need to wait another 50ms before you can give the go. For a
> 4 finger gesture, you're up to 200ms already before the GE can give the go
> or no-go for the gesture. by the time the event actually arrives, a delay is
> surely noticable.

I think this is something that needs to be figured out in
implementation. I don't want to dwell on it here because I don't think
it's that hard of a problem, it's just not worth worrying about when
there are higher order architectural issues to hammer out :).

> > When a GE registers with the server, the type is assigned to an ID,
> > which is also used as the bitmask position for selecting events. I think
> > I can make that clearer in a second revision.
> 
> Or you could just supply a list of atoms in the SelectEvents request and
> convert this to bitmasks internally. I don't really see the benefit the
> bitmasks provide to the client and we're not strapped for bandwidth here
> either.

It could be easier this way. I tried to copy the XInput spec as much as
possible within reason since it's worked well enough. However, it may
make sense to handle gestures differently.

> > > >     init_mask
> > > >         Gesture mask for initiation. A gesture mask for an event type T is
> > > >         defined as (1 << T).
> > > 
> > > Don't do this (1 << Y) thing, this wasn't one of the smarter decisions in
> > > XI2. Simply define the masks as they are, don't bind them to event types.
> > > Though it hasn't become a problem yet, I already ran into a few proposals
> > > where this would either be too inflexible or would create holes in the mask
> > > sets (latter not really a problem, but...).
> > 
> > This was lazy copy and paste from me :). The proposal should read:
> > 
> >     init_mask
> >         Gesture mask for initiation. A gesture mask for an event ID I is
> >         defined as (1 << I).
> > 
> > Does the distinction between types and IDs resolve your issue, or are
> > you referring to some other issue?
> 
> I think the specification as (1 << I) may cause issues long term.

Ok

> > > >             gesture_id:                 CARD16
> > > >             gesture_instance:           CARD16
> > > >             device_id:                  CARD16
> > > >             root_x:                     Float
> > > >             root_y:                     Float
> > > >             event_x:                    Float
> > > >             event_y:                    Float
> > > 
> > > probably better to use the same type as in the XI2 spec.
> > 
> > I want to provide a protocol using XCB, and I couldn't figure out an
> > easy way to do so with an FP1616 type. If there's a way, then that would
> > be fine with me. If not, which would be easier? Fixing up XCB to provide
> > a way or just using IEEE 754 floats instead?
> > 
> > Admittedly, I didn't spend a large amount of time looking for an FP1616
> > solution in XCB since I don't understand the appeal of FP1616 :).
> 
> it was added to avoid a required format for floats on the protocol and
> as alternative for devices without a useful FPU. Mixing datatypes for
> extensions so similar (or the same if this is just folded into XI2) is IMO a
> bad idea, it gives us very little benefit.

Ok.

Thanks,

-- Chase