The Charset module supports a wide variety of different character sets, and
it is flexible in regard of the names of character sets it accepts. The
character case is ignored, as are the most common non-alaphanumeric
characters appearing in character set names. E.g. "iso-8859-1"
works just as well as "ISO_8859_1". All encodings specified in
RFC 1345 are supported.
First of all the Charset module is capable of handling the following encodings of Unicode:
Universal Coded Character Set encodings.
Unicode Transformation Format (aka UTF) encodings.
Most, if not all, of the relevant code pages are represented, as the following list shows. Prefix the numbers as noted in the list to get the wanted codec:
These may be prefixed with "cp", "ibm" or
"ms".
These may be prefixed with "cp", "ibm",
"ms" or "windows"
The default charset in MySQL, similar to cp1252.
+359 more.
In Pike 7.8 and earlier this module was named Locale.Charset.
void decode_error(string err_str, int err_pos, string charset, void|string reason, mixed ... args)
Throws a DecodeError exception. See DecodeError.create for
details about the arguments. If args is given then the error
reason is formatted using sprintf(.reason, @args)
Decoder decoder(string|zero name)
Returns a charset decoder object.
nameThe name of the character set to decode from. Supported charsets include (not all supported charsets are enumerable): "iso_8859-1:1987", "iso_8859-1:1998", "iso-8859-1", "iso-ir-100", "latin1", "l1", "ansi_x3.4-1968", "iso_646.irv:1991", "iso646-us", "iso-ir-6", "us", "us-ascii", "ascii", "cp367", "ibm367", "cp819", "ibm819", "iso-2022" (of various kinds), "utf-7", "utf-8" and various encodings as described by RFC 1345.
If the asked-for name was not supported, an error is thrown.
Decoder decoder_from_mib(int mib)
Returns a decoder for the encoding schema denoted by MIB mib.
void encode_error(string err_str, int err_pos, string charset, void|string reason, mixed ... args)
Throws an EncodeError exception. See EncodeError.create for
details about the arguments. If args is given then the error
reason is formatted using sprintf(.reason, @args)
Encoder encoder(string|zero name, string|void replacement, function(string:string)|void repcb)
Returns a charset encoder object.
nameThe name of the character set to encode to. Supported charsets include (not all supported charsets are enumerable): "iso_8859-1:1987", "iso_8859-1:1998", "iso-8859-1", "iso-ir-100", "latin1", "l1", "ansi_x3.4-1968", "iso_646.irv:1991", "iso646-us", "iso-ir-6", "us", "us-ascii", "ascii", "cp367", "ibm367", "cp819", "ibm819", "iso-2022" (of various kinds), "utf-7", "utf-8" and various encodings as described by RFC 1345.
replacementThe string to use for characters that cannot be represented in
the charset. It's used when repcb is not given or when it returns
zero. If no replacement string is given then an error is thrown
instead.
repcbA function to call for every character that cannot be
represented in the charset. If specified it's called with one
argument - a string containing the character in question. If it
returns a string then that one will replace the character in the
output. If it returns something else then the replacement
argument will be used to decide what to do.
If the asked-for name was not supported, an error is thrown.
Encoder encoder_from_mib(int mib, string|void replacement, function(string:string)|void repcb)
Returns an encoder for the encoding schema denoted by MIB mib.
string|zero normalize(string|zero in)
All character set names are normalized through this function before compared.
void set_decoder(string name, program decoder)
Adds a custom defined character set decoder. The name is
normalized through the use of normalize.
void set_encoder(string name, program encoder)
Adds a custom defined character set encoder. The name is
normalized through the use of normalize.
Base class for errors thrown by the Charset module.
inherit Error.Generic : Generic
Error thrown when decode fails (and no replacement char or replacement callback has been registered).
This error class is not actually used by this module yet - decode errors are still thrown as untyped error arrays. At this point it exists only for use by other modules.
inherit CharsetGenericError : CharsetGenericError
string Charset.DecodeError.charset
The decoding charset, typically as known to
Charset.decoder.
Other code may produce errors of this type. In that case this
name is something that Charset.decoder does not accept
(unless it implements exactly the same charset), and it should
be reasonably certain that Charset.decoder never accepts that
name in the future (unless it is extended to implement exactly
the same charset).
int Charset.DecodeError.err_pos
The failing position in err_str.
string Charset.DecodeError.err_str
The string that failed to be decoded.
Virtual base class for charset decoders.
Decoders take a stream of bytes and convert them to a (possibly wide) string of Unicode code points.
string win1252_to_string( string(8bit) data ) { return Charset.decoder("windows-1252")->feed( data )->drain(); }
decoder(), Encoder
string Charset.Decoder.charset
Canonical name of the charset - giving this name to decoder returns an
instance of the same class as this object.
This is not necessarily the same name that was actually given to
decoder to produce this object.
this_program clear()
Clear buffers, and reset all state.
Returns the current object to allow for chaining of calls.
string drain()
Get the decoded data, and reset buffers.
Returns the decoded string.
this_program feed(string(8bit) s)
variant this_program feed(Stdio.Buffer buf)
Feeds a string to the decoder.
sString to be decoded.
bufStdio.Buffer containing data to be decoded.
Returns the current object, to allow for chaining of calls.
Error thrown when encode fails (and no replacement char or replacement callback has been registered).
This error class is not actually used by this module yet - encode errors are still thrown as untyped error arrays. At this point it exists only for use by other modules.
inherit CharsetGenericError : CharsetGenericError
string Charset.EncodeError.charset
The encoding charset, typically as known to
Charset.encoder.
Other code may produce errors of this type. In that case this
name is something that Charset.encoder does not accept
(unless it implements exactly the same charset), and it should
be reasonably certain that Charset.encoder never accepts that
name in the future (unless it is extended to implement exactly
the same charset).
int Charset.EncodeError.err_pos
The failing position in err_str.
string Charset.EncodeError.err_str
The string that failed to be encoded.
Virtual base class for charset encoders.
Encoders take a stream of Unicode code points and converts them to a string of 8-bit bytes.
encoder(), Decoder
inherit Decoder : Decoder
An encoder only differs from a decoder in that it has an extra function.
And in that feed() accepts wide strings and drain() returns only
8-bit strings.
string Charset.Encoder.charset
Canonical name of the charset - giving this name to encoder returns
an instance of the same class as this one.
This is not necessarily the same name that was actually given to
encoder to produce this object.
string(8bit) drain()
Similar to ::drain(), but always returns 8-bit strings.
this_program feed(string|String.Buffer s)
Similar to ::feed(), but accepts wide strings.
this_program set_replacement_callback(function(string:string) rc)
Change the replacement callback function.
rcFunction that is called to encode characters outside the current character encoding.
Returns the current object to allow for chaining of calls.
Codec for the ISO-8859-1 character encoding.