A simple, useful, plain-text variant of Donald Knuth’s paragraph-filling algorithm can be expressed as a fold. This algorithm runs in time linear to the length of its input and usually produces better results than the classic greedy text-flow algorithm.
We first specify the function flow-paragraph
and
partially define its helpers:
(: flow-paragraph (→ (list Word) (list (list Word))))
{ flow-paragraph = build-lines ∘ flow }
(: flow (→ (list Word) (list Line-Info)))
(define (flow words)
(match words
(() ())
((,w . ,ws)
(let ((ls (flow ws)))
(cons (best-break w ls) ls)))))
(: build-lines (→ (list Line-Info) (list (list Word))))
where the types are given by
(define-type Word String) ; but could be more complex
(define-type Line-Info (list Word Offset Cost))
For simplicity, we represent a Word by its text as a String. The type could include more data, e.g. a string containing any following whitespace, or syntactic context information (does the word end with punctuation? Does it end a sentence? etc.). We will ignore syntax and assume that each word is followed by a single space.
We will spend most of our time on flow
, which expresses
the algorithm by taking a list of words to a list of values which indicate
the “best” way to break a line beginning with a given word. To see how
this data is encoded, consider the line
"It was the best of times, it was the worst of times,".
This is represented by the Word list
("It" "was" "the" "best" "of" "times," "it" "was" "the"
"worst" "of" "times,")
If we were trying to break this into lines of about 25 characters, the
optimal first line might be "It was the best of times,",
which would be encoded by the triple ("It" 5 c).
The interpretation: This is a line consisting of the word
"It"
followed by 5 words taken from the
paragraph-suffix ("was the worst of times,"
).
c is a positive integer indicating the relative quality
(“badness”, in TeX terminology) of this line break—the lower the
cost, the better the choice.
The Flow algorithm flows a paragraph
(
w .
ws)
by finding the best line beginning with w that also
results in the best flow for the suffix ws. This is
crucially different from the greedy algorithm, which makes only local
choices; it may find a very nice break for w’s line at the
cost of choosing awful breaks for the rest of the paragraph.
Those familiar with the list folds will immediately recognize a
right fold in the equations for flow
:
(define (flow ws)
(fold-right (λ (w ls)
(cons (best-break w ls) ls))
'()
ws))
We need a definition for best-break
, which returns a
Line-Info
structure describing the best way to break the line beginning with
word w. If we can enumerate all legal ways of breaking
after w, then it should suffice to choose the minimum-cost
solution [TODO: more here about the retention of break data for the
entire suffix]:
;; (: best-break (→ Word (list Line-Info) (list Line-Info)))
(define (best-break w ls)
(minimum-cost
(enumerate-breaks (cons (word->line-info w) ls))))
;; (: minimum-cost (→ (list Line-Info) Line-Info))
(define (minimum-cost ls)
(minimum-by (λ (l1 l2)
(< (cost l1) (cost l2)))
ls)
;; (: word->line-info (→ Word Line-Info))
(define (word->line-info w) (list w 1 0))
word->line-info
constructs a singleton line, i.e.
one consisting of a single word. Adding this to the paragraph suffix
avoids some inelegance during length computation. We do not use the
cost associated with this sentinel value.
;; (: enumerate-breaks (→ (list Line-Info) (list Line-Info)))
(define (enumerate-breaks suffix)
(assert (pair? suffix))
(line-candidates 0 0 suffix))
;; (: line-candidates (→ ℕ Offset (list Line-Info) (list Line-Info)))
(define (line-candidates len run-len suffix)
(match suffix
((,_) ; last word of paragraph
`((,run-len 0)))
(((,w ,_ ,cost) . ,suffix*)
(if (>= len longest-allowed-line)
'() ; halt
(cons (list run-len (+ cost (length-cost len)))
(let ((len* (+ len space-length (word-length w))))
(line-candidates len* (+ run-len 1) suffix*)))))))
The helper function line-candidates
is the most
complex part of the entire algorithm. It traverses a prefix of
suffix, extending the previous break by one word at each
step, until it hits the maximum line length. The list of candidate
breaks is returned.
A few explanations are in order. The last line of a paragraph is
always assigned a cost of 0, since it’s customary to allows these
lines to be very short. This is handled by the first, singleton case
of line-candidates
. space-length
—a constant
equal to 1 in this version of the algorithm—is added to the length of
each candidate; this is the number of spaces following the current word.
(These are deliberate simplifications. See the Enhancements section for
discussion.)
length-cost
computes the cost of a line as a function
of the square of the difference between its length and the goal
length:
;; (: length-cost (→ ℕ Cost))
(define (length-cost len)
(short-cost (- goal-length len)))
;; (: short-cost (→ Integer Cost))
(define (short-cost z)
(expt (* z 10) 2))
(In GNU fmt, the goal length is 7% less than the maximum allowed width.)
Much more complex heuristics could be used to compute the cost of
a line, but the version of length-cost
above is simple
and pretty good in practice.
An elegant fact about the line-candidates
fold is
that the number of words it must examine to find the optimal way of
breaking a given line is bounded by longest-allowed-line
/ 2. To see this, consider a maximally “fine” (as opposed to “coarse”)
paragraph made up of one-character words, e.g.
A B C D E F G …
line-candidates
will produce many candidates
for a suffix of such a paragraph, because the cumulative
length will approach longest-allowed-line
slowly. How
slowly? At a rate of 1 + space-length
per
word. For normal prose space-constant
≥ 1, so the
line-candidates
process will terminate after producing at
most longest-allowed-line
/ 2 candidates.
line-candidates
thus runs in O(1) time. And since the
rest of
the algorithm simply processes the candidates produced by this function,
it seems likely that flow-paragraph
runs in time linear
in the length of its input.
TODO
© 2024 Wolfgang Corcoran-Mathe. Released under the terms of the Creative Commons Attribution 4.0 International license.