[cairo] PDF Text Extraction: Future

Thu Mar 6 06:34:35 PST 2008

Behdad Esfahbod wrote:
> PDF Implementation:
>
> Before I started research that led to this thread, I wrote some
> stuff about this, which I now see does not work.  Specifically,
> ActualText is not supported in poppler (and possibly other
> extractors) at all, so that cannot be part of a portable
> solution.

ActualText is supported in poppler since version 0.7. It also works fine
in acroread.

If a viewer does not support ActualText, file a bug report. Not all PDF
viewers support all of the PDF functionality that cairo supports.
However this is not a reason to target the lowest common denominator.
Support for additional PDF features has been added to poppler as a
result of cairo requiring these features.

Your proposal adds an unacceptable level of complexity to the font
subsetting code simply to avoid using ActualText.

The ActualText entry is already used by other PDF generators. For
example I have seen it used to add "\r\n" to the end of each line of
text in a table to preserve the table formatting when copy and pasted
from the PDF.

Also, ActualText support is required for PDF/A. PDF/A support is
something I would like to add to cairo.

With the proposed show_text_glyphs API, how do we handle the case of an
application wanting to change the font inside a cluster?

> Discussion:
>
>   - It's crucial for the above algorithm to work that a ToUnicode
>     entry mapping a glyph to an empty string works.  That is, a
>     glyph that maps to zero Unicode characters.  Nowhere in the
>     CMap spec I found whether this is allowed or not.  I don't
>     remember if Acrobat handles it correctly, but poppler has
>     some bugs: when you select such a character, it's not
>     rendered correctly and it outputs a U+FFFD in the extracted
>     text instead of nothing.

Mapping to an empty string is not permitted. It does not work in
acroread. For example copy and pasting a line of text where only one
glyph maps to an empty string results in no characters being pasted for
the entire line.

>     A nasty hack to relax this is to output an "ignorable"
>     Unicode character instead.  Something like U+2060 WORD
>     JOINER, or if that affects Indic shaping, a nastier one like
>     U+2063 INVISIBLE SEPARATOR, or something equally useless and
>     ignorable (general category Cf).

Yuck.

>     Another way around it if many viewers don't support it is to
>     add new composite glyphs to the font that combine multiple
>     glyphs into one, so we don't have to use 0->N clusters.  Some
>     font formats may not support composite glyphs however.

There are many problems with joining glyphs:

- It may exceed the maximum glyph size. While this may be unlikely we
have to support any possible input to the show_text_glyphs API.

- Merging truetype glyphs with hinting is not trivial

- It increases the size of the embedded font. This a particularly a
problem when printing as all of the embedded fonts have to fit into
printer memory.

- Font format specific code would have to be written for each type of
font that cairo can embed in PDFs (TrueType, CFF, Type 1, Type 1
fallback, CFF Fallback, and Type 3).

- This does not work well with non subsetted fonts.

One of the things I want to do with the cairo PDF backend is to make the
PDF files useful as a vector format for importing into other
applications for editing. For example Inkscape's use of poppler for PDF
import and cairo for PDF export allows PDFs to be round trip edited in
Inkscape. GNUpdf will also allow PDF files to be edited. There have
previously been requests to make cairo PDFs able to be imported into
other applications for editing text.

This means we need to do a much better job of preserving the original
content than we would need just for viewing or printing. At a minimum,
API support for selectively enabling full font embedding instead of
subsetting is required.

If we start joining glyphs then every time the PDF is loaded, text is
edited, then saved the number of joined glyphs will multiply. There
would also be the problem of exceeding the maximum glyphs in a font.
This is particularly a problem for Type 1 fonts or if using the standard
encoding suggested below).

> Other Issues:
>
> Some other random thought about PDF text output that is not
> necessarily related to the text extraction problem, but passed
> through my mind during these experiments:
>

>   - Shall we use standard encodings if all the used glyphs in a
>     subset are in a well-supported standard encoding?  May be
>     worth the slight optimization.  Also may make generated
>     PS/PDF more readable for the case of simple ASCII text.

This is something I have been thinking of doing as I have seen other PDF
files do this.

>   - Also occurred to me that in PDF almost all objects can come
>     after hey are referenced.  Does this mean we can write out
>     pages as we go and avoid writing to a temp file that we
>     currently do?

The PDF backend has never used a temp file.

>   - TaggedPDF allows for a lot more, but it's very hard to for
>     example mark all paragraphs and pages and all.  There's also
>     a ReversedText tag in there.  In most cases though, seems
>     like we can do the same without it as far as text extraction
>     is concerned.  Some cairo API may be added to allow TaggedPDF
>     marking from higher level.  Something like:
>
> 	cairo_pdf_marked_content_sequence_t
> 	cairo_pdf_surface_begin/end_marked_content()

Tagged PDF is something I plan on supporting, most likely using the API
discussed below. This will provide for a much more interesting text
extraction capability. I am very interested in supporting text reflow in
poppler. This is something that the Windows version of Adobe Reader
supports but is not available in the Linux version. This requires tagged
PDF for best results.

Details of the reflow feature can be seen at (search for "reflow"):

http://www.friendsofed.com/web-accessibility/chapter12.html

>   - Other possibly useful PDF APIs will be for embedding
>     arbitrary objects, for marking ActualText around arbitrary
>     drawing operations, and to get object number for any embedded
>     fonts, patterns, etc.  We are still waiting for someone more
>     familiar with PDF and higher-level use-cases to tell us what
>     exactly they need from cairo...

The alternate description entry "Alt" is likely the more useful entry
for marking drawing objects. It is described in the PDF Reference as
being for alternate descriptions for images for use with text to speech
readers.

The replacement text entry "ActualText" is described in the PDF
Reference as being for content that translates to text but is
represented in a non standard way such as glyphs for ligatures or custom
characters.

I am working on an API for inserting objects into the generated PDF file
and inserting text into the content stream. This would allow most of the
non graphics functionality i.e. the PDF Reference chapters on
"Interactive Features", "Multimedia", and "Document Interchange" to be
supported via a minimal set of API functions.

This would give applications the flexibility to either call this cairo
API directly for fine-grained control over these features or to use some
helper library that provides a higher level of abstraction without
requiring the application to understand the PDF format.