Pike Reference Manual

23. Charset

Module Charset

Description

The Charset module supports a wide variety of different character sets, and it is flexible in regard of the names of character sets it accepts. The character case is ignored, as are the most common non-alaphanumeric characters appearing in character set names. E.g. "iso-8859-1" works just as well as "ISO_8859_1". All encodings specified in RFC 1345 are supported.

First of all the Charset module is capable of handling the following encodings of Unicode:

ucs2
ucs2be
ucs2le
ucs4
ucs4be
ucs4le
Universal Coded Character Set encodings.
utf7
utf8
utf16
utf16be
utf16le
utf32
utf32be
utf32le
utf75
utf7½
Unicode Transformation Format (aka UTF) encodings.
shiftjis
euc-kr
euc-cn
euc-jp

Most, if not all, of the relevant code pages are represented, as the following list shows. Prefix the numbers as noted in the list to get the wanted codec:

037
038
273
274
275
277
278
280
281
284
285
290
297
367
420
423
424
437
500
819
850
851
852
855
857
860
861
862
863
864
865
866
868
869
870
871
880
891
903
904
905
918
932
936
950
1026
These may be prefixed with "cp", "ibm" or "ms".
1250
1251
1252
1253
1254
1255
1256
1257
1258
These may be prefixed with "cp", "ibm", "ms" or "windows"
mysql-latin1
The default charset in MySQL, similar to cp1252.

+359 more.

Note

In Pike 7.8 and earlier this module was named Locale.Charset.

Method decode_error: void decode_error(string err_str, int err_pos, string charset, void|string reason, mixed ... args)
Description: Throws a DecodeError exception. See DecodeError.create for details about the arguments. If args is given then the error reason is formatted using sprintf(reason, @args).

Method decoder: Decoder decoder(string|zero name)
Description: Returns a charset decoder object.
Parameter name: The name of the character set to decode from. Supported charsets include (not all supported charsets are enumerable): "iso_8859-1:1987", "iso_8859-1:1998", "iso-8859-1", "iso-ir-100", "latin1", "l1", "ansi_x3.4-1968", "iso_646.irv:1991", "iso646-us", "iso-ir-6", "us", "us-ascii", "ascii", "cp367", "ibm367", "cp819", "ibm819", "iso-2022" (of various kinds), "utf-7", "utf-8" and various encodings as described by RFC 1345.
Throws: If the asked-for name was not supported, an error is thrown.

Method decoder_from_mib: Decoder decoder_from_mib(int mib)
Description: Returns a decoder for the encoding schema denoted by MIB mib.

Method encode_error: void encode_error(string err_str, int err_pos, string charset, void|string reason, mixed ... args)
Description: Throws an EncodeError exception. See EncodeError.create for details about the arguments. If args is given then the error reason is formatted using sprintf(reason, @args).

Method encoder: Encoder encoder(string|zero name, string|void replacement, function(string:string)|void repcb)
Description: Returns a charset encoder object.
Parameter name: The name of the character set to encode to. Supported charsets include (not all supported charsets are enumerable): "iso_8859-1:1987", "iso_8859-1:1998", "iso-8859-1", "iso-ir-100", "latin1", "l1", "ansi_x3.4-1968", "iso_646.irv:1991", "iso646-us", "iso-ir-6", "us", "us-ascii", "ascii", "cp367", "ibm367", "cp819", "ibm819", "iso-2022" (of various kinds), "utf-7", "utf-8" and various encodings as described by RFC 1345.
Parameter replacement: The string to use for characters that cannot be represented in the charset. It's used when repcb is not given or when it returns zero. If no replacement string is given then an error is thrown instead.
Parameter repcb: A function to call for every character that cannot be represented in the charset. If specified it's called with one argument - a string containing the character in question. If it returns a string then that one will replace the character in the output. If it returns something else then the replacement argument will be used to decide what to do.
Throws: If the asked-for name was not supported, an error is thrown.

Method encoder_from_mib: Encoder encoder_from_mib(int mib, string|void replacement, function(string:string)|void repcb)
Description: Returns an encoder for the encoding schema denoted by MIB mib.

Method normalize: string|zero normalize(string|zero in)
Description: All character set names are normalized through this function before compared.

Method set_decoder: void set_decoder(string name, program decoder)
Description: Adds a custom defined character set decoder. The name is normalized through the use of normalize.

Method set_encoder: void set_encoder(string name, program encoder)
Description: Adds a custom defined character set encoder. The name is normalized through the use of normalize.

Class Charset.CharsetGenericError

Description: Base class for errors thrown by the Charset module.

Inherit Generic: inherit Error.Generic : Generic

Class Charset.DecodeError

Description: Error thrown when decode fails (and no replacement char or replacement callback has been registered).
FIXME: This error class is not actually used by this module yet - decode errors are still thrown as untyped error arrays. At this point it exists only for use by other modules.

Inherit CharsetGenericError: inherit CharsetGenericError : CharsetGenericError

Variable charset: string Charset.DecodeError.charset
Description: The decoding charset, typically as known to Charset.decoder.
Note: Other code may produce errors of this type. In that case this name is something that Charset.decoder does not accept (unless it implements exactly the same charset), and it should be reasonably certain that Charset.decoder never accepts that name in the future (unless it is extended to implement exactly the same charset).

Variable err_pos: int Charset.DecodeError.err_pos
Description: The failing position in err_str.

Variable err_str: string Charset.DecodeError.err_str
Description: The string that failed to be decoded.

Class Charset.Decoder

Description

Virtual base class for charset decoders.

Decoders take a stream of bytes and convert them to a (possibly wide) string of Unicode code points.

Example

string win1252_to_string( string(8bit) data )
   {
     return Charset.decoder("windows-1252")->feed( data )->drain();
   }

See also

decoder(), Encoder

Variable charset: string Charset.Decoder.charset
Description: Canonical name of the charset - giving this name to decoder returns an instance of the same class as this object.
Note: This is not necessarily the same name that was actually given to decoder to produce this object.

Method clear: this_program clear()
Description: Clear buffers, and reset all state.
Returns: Returns the current object to allow for chaining of calls.

Method drain: string drain()
Description: Get the decoded data, and reset buffers.
Returns: Returns the decoded string.

Method feed: this_program feed(string(8bit) s)
variant this_program feed(Stdio.Buffer buf)
Description: Feeds a string to the decoder.
Parameter s: String to be decoded.
Parameter buf: Stdio.Buffer containing data to be decoded.
Returns: Returns the current object, to allow for chaining of calls.

Class Charset.EncodeError

Description: Error thrown when encode fails (and no replacement char or replacement callback has been registered).
FIXME: This error class is not actually used by this module yet - encode errors are still thrown as untyped error arrays. At this point it exists only for use by other modules.

Inherit CharsetGenericError: inherit CharsetGenericError : CharsetGenericError

Variable charset: string Charset.EncodeError.charset
Description: The encoding charset, typically as known to Charset.encoder.
Note: Other code may produce errors of this type. In that case this name is something that Charset.encoder does not accept (unless it implements exactly the same charset), and it should be reasonably certain that Charset.encoder never accepts that name in the future (unless it is extended to implement exactly the same charset).

Variable err_pos: int Charset.EncodeError.err_pos
Description: The failing position in err_str.

Variable err_str: string Charset.EncodeError.err_str
Description: The string that failed to be encoded.

Class Charset.Encoder

Description

Virtual base class for charset encoders.

Encoders take a stream of Unicode code points and converts them to a string of 8-bit bytes.

See also

encoder(), Decoder

Inherit Decoder: inherit Decoder : Decoder
Description: An encoder only differs from a decoder in that it has an extra function. And in that feed() accepts wide strings and drain() returns only 8-bit strings.

Variable charset: string Charset.Encoder.charset
Description: Canonical name of the charset - giving this name to encoder returns an instance of the same class as this one.
Note: This is not necessarily the same name that was actually given to encoder to produce this object.

Method drain: string(8bit) drain()
Description: Similar to ::drain(), but always returns 8-bit strings.

Method feed: this_program feed(string|String.Buffer s)
Description: Similar to ::feed(), but accepts wide strings.

Method set_replacement_callback: this_program set_replacement_callback(function(string:string) rc)
Description: Change the replacement callback function.
Parameter rc: Function that is called to encode characters outside the current character encoding.
Returns: Returns the current object to allow for chaining of calls.

Module Charset.Tables

Module Charset.Tables.iso88591

Description: Codec for the ISO-8859-1 character encoding.

23. Charset

Module Charset

Class Charset.CharsetGenericError

Class Charset.DecodeError

Class Charset.Decoder

Class Charset.EncodeError

Class Charset.Encoder

Module Charset.Tables

Module Charset.Tables.iso88591