(lispkit char-set)

Last updated 6 months ago

(lispkit char-set)

Library (lispkit char-set) implements efficient means to represent and manipulate sets of characters. Its design is based on SRFI 14 but the implementation is specific to the definition of characters in LispKit; i.e. library (lispkit char-set) assumes that characters are UTF-16 code units.

As opposed to SRFI 14, it can be assumed that the update procedures ending with "!" are mutating the corresponding character set. This means that clients of these procedures may rely on these procedures performing their functionality in terms of side effects.

In the procedure specifications below, the following conventions are used:

A cs parameter is a character set.
A s parameter is a string.
A char parameter is a character.
A char-list parameter is a list of characters.
A pred parameter is a unary character predicate procedure, returning either #t or #f when applied to a character.
An obj parameter may be any value at all.

Passing values to procedures with these parameters that do not satisfy these types is an error.

Unless otherwise noted in the specification of a procedure, procedures always return character sets that are distinct from the parameter character sets (unless the procedure mutates a character set and its name ends with "!"). For example, char-set-adjoin is guaranteed to provide a fresh character set, even if it is not given any character parameters.

Library (lispkit char-set) supports both mutable as well as immutable character sets. Character sets are assumed to be mutable unless they are explicitly specified to be immutable.

Constants

char-set:lower-case char-set:upper-case char-set:title-case char-set:letter char-set:digit char-set:letter+digit char-set:graphic char-set:printing char-set:whitespace char-set:newlines char-set:iso-control char-set:punctuation char-set:symbol char-set:hex-digit char-set:blank char-set:ascii char-set:empty char-set:full

Library (lispkit char-set) predefines these frequently used immutable character sets.

Note that there may be characters in char-set:letter that are neither upper or lower case. The char-set:whitespaces character set contains whitespace and newline characters. char-set:blanks only contains whitespace (i.e. "blank") characters. char-set:newlines only contains newline characters.

Predicates

Returns #t if obj is a character set, otherwise returns #f.

Returns #t if the character set cs does not contain any characters, otherwise returns #f.

Returns #t if all the provided character sets cs ... contain the exact same characters; returns #f otherwise. For both corner cases, (char-set=?) and (char-set=? cs), char-set=? returns #t.

Returns #t if every character set cs-i is a subset of character set cs-i+1; returns #f otherwise. For both corner cases, (char-set<=?) and (char-set<=? cs), char-set<=? returns #t.

Returns #t if character sets cs1 and cs2 are disjoint, i.e. they do not share a single character; returns #f otherwise.

Returns #t if character char is contained in character set cs; returns #f otherwise.

The char-set-every? procedure returns #t if predicate pred returns #t for every character in the character set cs. Likewise, char-set-any? applies pred to every character in character set cs, and returns #t if there is at least one character for which pred returns #t. If no character produces a #t value, it returns #f. The order in which these procedures sequence through the elements of cs is not specified.

Constructors

Return a newly allocated mutable character set containing the given characters.

Return a newly allocated immutable character set containing the given characters.

Returns a newly allocated copy of the character set cs. The copy is mutable by default unless parameter mutable? is provided and set to #f.

Return a newly allocated mutable character set containing the characters in the list of characters char-list. If character set base-cs is provided, the characters from base-cs are added to it as well.

Return a newly allocated mutable character set containing the characters of the string s. If character set base-cs is provided, the characters from base-cs are added to it as well.

Returns a newly allocated mutable character set containing every character whose ISO/IEC 10646 UCS-4 code lies in the half-open range [lower,upper). lower and upper are exact non-negative integers where lower <= upper <= limit is required to hold. limit is either an exact non-negative integer specifying the maximum upper limit, or it is #t which specifies the maximum UTF-16 code unit value. If limit is not provided, a very large default is assumed (equivalent to limit being #f).

This signature is compatible with the SRFI 16 specification which states that if the requested range includes unassigned UCS values, these are silently ignored. If the requested range includes "private" or "user space" codes, these are passed through transparently. If any code from the requested range specifies a valid, assigned UCS character that has no corresponding representative in the implementation's character type, then

an error is raised if limit is #t, and
the code is ignored if limit is #f (the default).

If character set base-cs is provided, the characters of base-cs are included in the newly allocated mutable character set.

Returns a new character set containing every character c in character set cs such that (pred c) returns true. If character set base-cs is provided, the characters specified by base-cs are added to it.

Coerces object x into a character set. x may be a string, character or character set. A string is converted to the set of its constituent characters; a character is converted to a singleton character set; a character set is returned as is. This procedure is intended for use by other procedures that want to provide "user-friendly", wide-spectrum interfaces to their clients.

Querying character sets

Returns the number of elements in character set cs.

Apply pred to the characters of character set cs, and return the number of characters that caused the predicate to return #t.

This procedure returns a list of the characters of character set cs. The order in which cs's characters appear in the list is not defined, and may be different from one call to another.

This procedure returns a string containing the characters of character set cs. The order in which cs's characters appear in the string is not defined, and may be different from one call to another.

Compute a hash value for the character set cs. bound is a non-negative exact integer specifying the range of the hash function. A positive value restricts the return value to the range [0, bound). If bound is either zero or not given, a default value is used, chosen to be as large as it is efficiently practical.

Character set algebra

Return a newly allocated mutable copy of cs into which the characters char ... were inserted.

Return a newly allocated mutable copy of cs from which the characters char ... were removed.

Return a newly allocated character set containing all characters that are not contained in cs.

These procedures implement set complement, union, intersection, difference, and exclusive-or for character sets. The union, intersection and xor operations are n-ary. The difference function is also n-ary, associates to the left (that is, it computes the difference between its first argument and the union of all the other arguments), and requires at least one argument.

Boundary cases:

(char-set-union)          ⇒  char-set:empty
(char-set-intersection)   ⇒  char-set:full
(char-set-xor)            ⇒  char-set:empty
(char-set-difference cs)  ⇒  cs

char-set-diff+intersection returns both the difference and the intersection of the arguments, i.e. it partitions its first parameter. It is equivalent to (values (char-set-difference cs1 cs2 ...) (char-set-intersection cs1 (char-set-union cs2 ...))) but can be implemented more efficiently.

Mutating character sets

Insert the characters char ... into the character set cs.

Remove the characters char ... from the character set cs.

Complement the character set cs by including all characters that were not contained in cs previously and by removing all previously contained characters.

These are update variants of the set-algebra functions mutating the first character set cs1 instead of creating a new one. char-set-diff+intersection! will perform a side-effect on both of its two required parameters cs1 and cs2.

Adds every character c in cs for which (pred c) returns #t to the given character set base-cs.

Add the characters from the character list char-list to character set base-cs and return the mutated character set base-cs.

Add the characters of the string s to character set base-cs and return the mutated character set base-cs.

Mutates the mutable character set base-cs including every character whose ISO/IEC 10646 UCS-4 code lies in the half-open range [lower,upper). lower and upper are exact non-negative integers where lower <= upper <= limit is required to hold. limit is either an exact non-negative integer specifying the maximum upper limit, or it is #t which specifies the maximum UTF-16 code unit value. If limit is not provided, a very large default is assumed (equivalent to limit being #f).

This is a fundamental constructor for character sets.

g is used to generate a series of "seed" values from the initial seed: seed, (g seed), (g2 seed), (g3 seed), ...
p tells us when to stop by returning #t when applied to one of these seed values.
f maps each seed value to a character. These characters are added to the base character set base-cs to form the result. char-set-unfold! adds the characters by mutating base-cs as a side effect.

Iterating over character sets

Cursors are a low-level facility for iterating over the characters in a character set cs. A cursor is a value that indexes a character in a character set. char-set-cursor produces a new cursor for a given character set. The set element indexed by the cursor is fetched with char-set-ref. A cursor index is incremented with char-set-cursor-next; in this way, code can step through every character in a character set. Stepping a cursor "past the end" of a character set produces a cursor that answers true to end-of-char-set?. It is an error to pass such a cursor to char-set-ref or to char-set-cursor-next.

A cursor value may not be used in conjunction with a different character set; if it is passed to char-set-ref or char-set-cursor-next with a character set other than the one used to create it, the results and effects are undefined. These primitives are necessary to export an iteration facility for character sets to loop macros.

(define cs (char-set #\G #\a #\T #\e #\c #\h))

;; Collect elts of CS into a list.
(let lp ((cur (char-set-cursor cs)) (ans '()))
  (if (end-of-char-set? cur) ans
      (lp (char-set-cursor-next cs cur)
          (cons (char-set-ref cs cur) ans))))
  ⇒  (#\G #\T #\a #\c #\e #\h)

;; Equivalently, using a list unfold (from SRFI 1):
(unfold-right end-of-char-set? 
              (curry char-set-ref cs)
	          (curry char-set-cursor-next cs)
	          (char-set-cursor cs))
  ⇒  (#\G #\T #\a #\c #\e #\h)

This is the fundamental iterator for character sets. Applies the function kons across the character set cs using initial state value knil. That is, if cs is the empty set, the procedure returns knil. Otherwise, some element c of cs is chosen; let cs' be the remaining, unchosen characters. The procedure returns (char-set-fold kons (kons c knil) cs').

; CHAR-SET-MEMBERS
(lambda (cs) (char-set-fold cons '() cs))
; CHAR-SET-SIZE
(lambda (cs) (char-set-fold (lambda (c i) (+ i 1)) 0 cs))
; How many vowels in the char set?
(lambda (cs) 
  (char-set-fold (lambda (c i) (if (vowel? c) (+ i 1) i)) 0 cs))

This is a fundamental constructor for character sets.

g is used to generate a series of "seed" values from the initial seed: seed, (g seed), (g2 seed), (g3 seed), ...
p tells us when to stop, when it returns #t when applied to one of these seed values.
f maps each seed value to a character. These characters are added to a mutable copy of the base character set base-cs to form the result; base-cs defaults to an empty set.

More precisely, the following definitions hold, ignoring the optional-argument issues:

(define (char-set-unfold p f g seed base-cs) 
  (char-set-unfold! p f g seed (char-set-copy base-cs)))

(define (char-set-unfold! p f g seed base-cs)
  (let lp ((seed seed) (cs base-cs))
    (if (p seed) cs                            ; P says we are done
        (lp (g seed)                           ; Loop on (G SEED)
            (char-set-adjoin! cs (f seed)))))) ; Add (F SEED) to set

Examples:

(port->char-set p) = (char-set-unfold eof-object?
                                      values
                                      (lambda (x) (read-char p))
                                      (read-char p))
(list->char-set lis) = (char-set-unfold null? car cdr lis)

Apply procedure proc to each character in the character set cs. Note that the order in which proc is applied to the characters in the set is not specified, and may even change from one procedure application to another.

proc is a procedure mapping characters to characters. It will be applied to all the characters in the character set cs, and the results will be collected in a newly allocated mutable character set which will be returned by char-set-map.

Last updated 6 months ago