string->utf8, string->utf16, string->utf32, utf8->string, utf16->string, utf32->string - convert between strings and bytevectors
LIBRARY
(import (rnrs)) ;R6RS
(import (rnrs bytevectors)) ;R6RS
(import (scheme base)) ;R7RS
SYNOPSIS
(string->utf8 string)
(string->utf8 string start) ;R7RS
(string->utf8 string start end) ;R7RS
(utf8->string bytevector)
(utf8->string bytevector start) ;R7RS
(utf8->string bytevector start end) ;R7RS
;; The following procedures are in R6RS and are absent from R7RS.
(string->utf16 string)
(string->utf16 string endianness)
(string->utf32 string)
(string->utf32 string endianness)
(utf16->string bytevector endianness)
(utf16->string bytevector endianness endianness-mandatory?)
(utf32->string bytevector endianness)
(utf32->string bytevector endianness endianness-mandatory?)
DESCRIPTION
These procedures convert between string and bytevector representations
of strings in various Unicode encodings.
The
string->utf8,
string->utf16,
and
string->utf32
procedures return a bytevector that contains an encoding of
string
(with no byte-order mark).
The
utf8->string,
utf16->string,
and
utf32->string
procedures return a string whose character sequence is encoded by
bytevector.
- utf8->string, string->utf8
-
These procedures use the UTF-8 encoding.
- string->utf16
-
This procedure encodes according to UTF-16BE (default) or UTF-16LE.
- string->utf32
-
This procedure encodes according to UTF-32BE (default) or UTF-32LE.
- utf16->string
-
This procedure decodes according to UTF-16, UTF-16BE, UTF-16LE, or
a fourth encoding scheme that differs from all of those, as in the
description of
endianness-mandatory?
below.
- utf32->string
-
This procedure decodes according to UTF-32, UTF-32BE, UTF-32LE, or
a fourth encoding scheme that differs from all of those, as in the
description of
endianness-mandatory?
below.
- Endianness
-
If
endianness
is specified, it must be the symbol
big
or the symbol
little.
This differs from other bytevector procedures that can support
additional implementation-defined endianness values. See
native-endianness(3scm)
for a definition of endianness.
The default endianness for the
string->
procedures is
big.
The endianness concept is not applicable to UTF-8.
- Byte-order marks
-
A UTF-16 BOM is either the sequence of bytes #xFE, #xFF specifying
big
and UTF-16BE, or #xFF, #xFE specifying
little
and UTF-16LE.
A UTF-32 BOM is either the sequence of bytes #x00, #x00, #xFE, #xFF
specifying
big
and UTF-32BE, or #xFF, #xFE, #x00, #x00, specifying
little
and UTF-32LE.
A UTF-8 BOM is the sequence of bytes #xEF, #xBB, #xBF. Neither R6RS
nor R7RS mentions the UTF-8 BOM. It does not specify an endianness,
but is sometimes used as a magic string to mark UTF-8 text.
- The endianness-mandatory? argument (the fourth encoding scheme)
-
If endianness-mandatory? is absent or
#f,
then
utf16->string
and
utf32->string
determine the endianness according to a BOM at
the beginning of bytevector if a BOM is present; in this
case, the BOM is not decoded as a character.
Also in this case, if no BOM is present,
endianness
specifies the endianness of the encoding.
If
endianness-mandatory?
is a true value,
endianness
specifies the endianness of the encoding, and any BOM in the encoding
is decoded as a regular character.
- Decoding errors
-
If an invalid or incomplete character
encoding is encountered, then the replacement character U+FFFD is
appended to the string being generated, an appropriate number of bytes
are ignored, and decoding continues with the following bytes.
- R7RS
-
R7RS provides two extra arguments for restricting the transcoding
operation to only a part of the input. Only UTF-8 is provided (and
possibly only a small subset, see the errors section).
IMPLEMENTATION NOTES
- Chez Scheme
-
There is a single empty bytevector object and a single empty string
object. If these are returned then they are not newly allocated.
Chez Scheme removes an initial UTF-8 BOM.
- Loko Scheme
-
Same notes as for Chez Scheme.
RETURN VALUES
Returns a single (unless empty) newly allocated bytevector or string
object.
EXAMPLES
;; The #vu8() syntax is used in R6RS. R7RS uses #u8() instead.
(utf8->string #vu8(#x41))
=> "A"
(string->utf8 "λ")
=> #vu8(#xCE #xBB)
APPLICATION USAGE
These procedures are used when interfacing with external systems,
other processes, and where strings are encoded in one of the supported
encodings. File operations are usually handled better using a
transcoded port, except in cases where the file structure as such is
binary and only some parts represent strings.
COMPATIBILITY
The UTF-8 variants of these procedures are present in both R6RS and R7RS,
but R6RS is missing the start and end arguments.
The number of bytes skipped when decoding an invalid or incomplete
character differs between implementations. Relying on the precise
number of bytes skipped, or the number of replacement characters used,
is not portable.
Some implementations do not return newly allocated strings or
bytevectors if they are empty, as they have a single copy of each.
For R7RS, also see the note below in the errors section.
ERRORS
This procedure can raise exceptions with the following condition types:
- &assertion (R6RS)
-
The wrong number of arguments was passed or an argument was outside its domain.
Somewhat unusually, the
endianness-mandatory?
argument can be any object.
- Unsupported characters (R7RS)
-
It is an error to pass
utf8->string
a character in UTF-8 encoded form which the implementation does not
support. 7-bit ASCII (except #\null) must be supported. Any
other character is optional and potentially an error. You can use the
full-unicode
feature identifier in
cond-expand(7scm)
to check if all of Unicode 6.0 is supported.
- R7RS
-
The assertions described above are errors.
Implementations may signal an error, extend the procedure's
domain of definition to include such arguments,
or fail catastrophically.
SEE ALSO
string->bytevector(3scm),
bytevector->string(3scm),
transcoded-port(3scm),
STANDARDS
R6RS,
R7RS
AUTHORS
This page is part of the
scheme-manpages
project.
It includes materials from the RnRS documents.
More information can be found at
https://github.com/schemedoc/manpages/
.
Markup created by unroff 1.0sc, March 04, 2023.