Paragraph flowing as a fold

A simple, useful, plain-text variant of Donald Knuth’s paragraph-filling algorithm can be expressed as a fold. This algorithm runs in time linear to the length of its input and usually produces better results than the classic greedy text-flow algorithm.

We first specify the function flow-paragraph and partially define its helpers:

(: flow-paragraph (→ (list Word) (list (list Word))))
{ flow-paragraph = build-lines ∘ flow }

(: flow (→ (list Word) (list Line-Info)))
(define (flow words)
  (match words
    (() ())
    ((,w . ,ws)
     (let ((ls (flow ws)))
       (cons (best-break w ls) ls)))))

(: build-lines (→ (list Line-Info) (list (list Word))))

where the types are given by

(define-type Word String)  ; but could be more complex

(define-type Line-Info (list Word Offset Cost))

For simplicity, we represent a Word by its text as a String. The type could include more data, e.g. a string containing any following whitespace, or syntactic context information (does the word end with punctuation? Does it end a sentence? etc.). We will ignore syntax and assume that each word is followed by a single space.

We will spend most of our time on flow, which expresses the algorithm by taking a list of words to a list of values which indicate the “best” way to break a line beginning with a given word. To see how this data is encoded, consider the line

"It was the best of times, it was the worst of times,".

This is represented by the Word list

("It" "was" "the" "best" "of" "times," "it" "was" "the"
 "worst" "of" "times,")

If we were trying to break this into lines of about 25 characters, the optimal first line might be "It was the best of times,", which would be encoded by the triple ("It" 5 c).

The interpretation: This is a line consisting of the word "It" followed by 5 words taken from the paragraph-suffix ("was the worst of times,"). c is a positive integer indicating the relative quality (“badness”, in TeX terminology) of this line break—the lower the cost, the better the choice.

The Flow algorithm flows a paragraph (w . ws) by finding the best line beginning with w that also results in the best flow for the suffix ws. This is crucially different from the greedy algorithm, which makes only local choices; it may find a very nice break for w’s line at the cost of choosing awful breaks for the rest of the paragraph.

Those familiar with the list folds will immediately recognize a right fold in the equations for flow:

(define (flow ws)
  (fold-right (λ (w ls)
                (cons (best-break w ls) ls))
              '()
              ws))

We need a definition for best-break, which returns a Line-Info structure describing the best way to break the line beginning with word w. If we can enumerate all legal ways of breaking after w, then it should suffice to choose the minimum-cost solution [TODO: more here about the retention of break data for the entire suffix]:

;; (: best-break (→ Word (list Line-Info) (list Line-Info)))
(define (best-break w ls)
  (minimum-cost
   (enumerate-breaks (cons (word->line-info w) ls))))

;; (: minimum-cost (→ (list Line-Info) Line-Info))
(define (minimum-cost ls)
  (minimum-by (λ (l1 l2)
                (< (cost l1) (cost l2)))
              ls)

;; (: word->line-info (→ Word Line-Info))
(define (word->line-info w) (list w 1 0))

word->line-info constructs a singleton line, i.e. one consisting of a single word. Adding this to the paragraph suffix avoids some inelegance during length computation. We do not use the cost associated with this sentinel value.

;; (: enumerate-breaks (→ (list Line-Info) (list Line-Info)))
(define (enumerate-breaks suffix)
  (assert (pair? suffix))
  (line-candidates 0 0 suffix))

;; (: line-candidates (→ ℕ Offset (list Line-Info) (list Line-Info)))
(define (line-candidates len run-len suffix)
  (match suffix
    ((,_)             ; last word of paragraph
     `((,run-len 0)))
    (((,w ,_ ,cost) . ,suffix*)
     (if (>= len longest-allowed-line)
         '()          ; halt
         (cons (list run-len (+ cost (length-cost len)))
               (let ((len* (+ len space-length (word-length w))))
                 (line-candidates len* (+ run-len 1) suffix*)))))))

The helper function line-candidates is the most complex part of the entire algorithm. It traverses a prefix of suffix, extending the previous break by one word at each step, until it hits the maximum line length. The list of candidate breaks is returned.

A few explanations are in order. The last line of a paragraph is always assigned a cost of 0, since it’s customary to allows these lines to be very short. This is handled by the first, singleton case of line-candidates. space-length—a constant equal to 1 in this version of the algorithm—is added to the length of each candidate; this is the number of spaces following the current word. (These are deliberate simplifications. See the Enhancements section for discussion.)

length-cost computes the cost of a line as a function of the square of the difference between its length and the goal length:

;; (: length-cost (→ ℕ Cost))
(define (length-cost len)
  (short-cost (- goal-length len)))

;; (: short-cost (→ Integer Cost))
(define (short-cost z)
  (expt (* z 10) 2))

(In GNU fmt, the goal length is 7% less than the maximum allowed width.)

Much more complex heuristics could be used to compute the cost of a line, but the version of length-cost above is simple and pretty good in practice.

An elegant fact about the line-candidates fold is that the number of words it must examine to find the optimal way of breaking a given line is bounded by longest-allowed-line / 2. To see this, consider a maximally “fine” (as opposed to “coarse”) paragraph made up of one-character words, e.g.

A B C D E F G …

line-candidates will produce many candidates for a suffix of such a paragraph, because the cumulative length will approach longest-allowed-line slowly. How slowly? At a rate of 1 + space-length per word. For normal prose space-constant ≥ 1, so the line-candidates process will terminate after producing at most longest-allowed-line / 2 candidates.

line-candidates thus runs in O(1) time. And since the rest of the algorithm simply processes the candidates produced by this function, it seems likely that flow-paragraph runs in time linear in the length of its input.

Enhancements

TODO

© 2024 Wolfgang Corcoran-Mathe. Released under the terms of the Creative Commons Attribution 4.0 International license.

SIGWINCH Home