The CharacterSet Intrinsic Class

TADS 3 uses the Unicode character set to represent all strings internally. Unicode is an international standard that was designed to be capable of representing, in a single character set, characters from every natural language in use throughout the world. Since most computers use other character sets for the display, keyboard, and file system, though, it is often necessary to translate strings between the Unicode characters that TADS uses internally and the coding systems. In almost all cases, TADS performs this translation automatically; when you display a string, for example, TADS translates the string to the display character set, and when you read a string from the keyboard, TADS translates the local character encoding to Unicode in the returned string.

In some cases, though, it's useful to be able to translate characters to and from Unicode, or from one local character set to another, under explicit program control. For these situations, TADS provides the CharacterSet intrinsic class. This class encapsulates a "character mapping," which defines the correspondences between local character codes and Unicode character codes.

To create a CharacterSet object, you use the new operator, specifying the name of the character set you want to translate to or from:

  local cs = new CharacterSet('us-ascii');

The CharacterSet object can then be used to specify the encoding to use for explicit character translations. You can use a CharacterSet in these situations:

You can specify the encoding of a text file you are reading or writing, by passing the CharacterSet to the fileOpen() function.
You can specify the interpretation of raw bytes in a ByteArray by passing the CharacterSet to the mapToString method.
You can specify how to encode a string into raw bytes by passing the CharacterSet to the mapToByteArray method of a string.

In addition, CharacterSet provides a few methods that let you get information about the character mapping it describes.

Note: when using the CharacterSet class, you should #include the system header file "charset.h".

Built-In And External Character Mappings

TADS 3 has several pre-defined character mappings built in to the system:

'US-ASCII' – the 7-bit ASCII character set. This "least common denominator" character set is available on practically every modern computer. Most computers extend this set by adding an additional set of accented letters and punctuation, but the extended sets vary by operating system and localization.
'ISO-8859-1' – the ISO 8859-1 character set, also known as ISO Latin-1. This is an 8-bit character set that contains the ASCII characters plus a set of punctuation and accented letters for Western European languages. This character set is not supported on all computers, but it has become widely supported because of its status as the default character set for HTTP.
'UTF-8' – the Unicode UTF-8 encoding. This encoding represents each 16-bit Unicode character as one, two, or three bytes; it is designed to be especially compact when coding strings that consist mostly of the ASCII subset of Unicode.
'UTF-16BE' – the 16-bit Unicode character set, in "big endian" representation: this means that each 16-bit character is encoded in a pair of 8-bit bytes, with the more significant byte first.
'UTF-16LE' – the 16-bit Unicode character set, in "little endian" representation: : this means that each 16-bit character is encoded in a pair of 8-bit bytes, with the more significant byte first.

The character sets above are available on every TADS 3 interpreter. In addition, TADS has a mechanism that allows new character set definitions to be added with external mapping files – see the Character Mapping documentation for details. You can use any character set for which an external mapping file exists on the local system, simply by using the mapping name in the CharacterSet constructor. (However, don't use the ".tcm" or other filename suffix – just use the plain mapping name.)

The standard TADS 3 distribution includes a full suite of external character mapping files, including all of the 8-bit Windows, MS-DOS, and Macintosh code pages, and the ISO Latin-1 through Latin-10 character sets. This standard suite is normally included with all distributions on all TADS 3 platforms, but individual platforms may add or delete some of these standard encodings, so it's best not to assume that a mapping is present just because it's in the standard suite. You can check at run-time to see if a given mapping is available using the isMappingKnown() method.

Unknown Character Mappings

You can create a CharacterSet object that refers to a character mapping that doesn't exist on the local system. This is legal and will not cause any errors at the time you create the object; however, if you try to use the object to perform any character mapping, an exception – UnknownCharSetException – will be thrown.

You can check to see if a character mapping is known by calling the isMappingKnown() method after creating the CharacterSet object. If this method returns true, the character set is known and you can use it to perform character mapping.

It is legal to create a CharacterSet referring to an unknown mapping because it would otherwise be impossible to save the state of a program that contains a CharacterSet object and then restore the state on another computer without the same character mappings.

CharacterSet Methods

getName() – returns a string giving the name of the character set. This is the same as the name that was used to create the character set object.

isMappable(val) – returns true if the character or characters val, which can be given as an integer (giving a Unicode character value) or a string of characters, can be mapped to characters in the character set, nil if not. If val is a string, the method returns true only if all of the characters in the string can be mapped.

Note that it is legal to map a string even if it contains unmappable characters, because the mapping process will simply map any unmappable characters to the "default" character defined in the character mapping. The default character varies by character set – it's part of the Unicode-to-local mapping definition – but it usually indicates visually that a character is missing; in some character sets it looks like an empty rectangle, and in others it's simply a question mark.

isMappingKnown() – returns true if the character set has a known mapping, nil if not. If this returns nil, any attempts to map characters using the object will throw a CharacterSetUnknownException.

IsRoundTripMappable(val) – returns true if the character or characters val, which can be given as an integer (giving a Unicode character value) or a string of characters, can be mapped to the local character set and then back to Unicode again with no loss of information. In other words, if converting val to the local character set and then converting it back to Unicode yields the original set of characters in val, then val has a round-trip mapping. The existence of a round-trip mapping generally means that the characters in val have an exact representation in the local character set, as opposed to an approximation. Approximations require either multiple local characters being used to represent a single local character, or a visually similar glyph being used as a graphical approximation. In the case of a mapping to multiple local characters, a round-trip mapping is inherently impossible because the string of multiple local characters will always map back to multiple Unicode characters, hence mapping to local and back will not yield the original string. Graphical approximations are usually achieved by mapping an accented Unicode character to an unaccented local character (such as a mapping from an "a" with an acute accent to a plain, unaccented "a"); these usually don't have round-trip mappings because the unaccented local character usually maps back to the unaccented Unicode character.

Examples

Example 1: Using a CharacterSet to determine if the local machine is capable of displaying Cyrillic characters.

If you're writing a game in Russian, you would probably want to make sure the player's computer is capable of displaying Cyrillic characters – if it weren't, the player probably wouldn't be able to read most of the text in your game.  You can do this by creating a CharacterSet object for the local system's display character set, and then testing a string of characters for mappability with the isMappable() method.

  #include <tads.h>

  #include <charset.h>

  testCyrillic(args)

    /* get the local display character set */

    local cs = new CharacterSet(getCharacterSet(CharsetDisplay));

/*

     *  Check a few representative Cyrillic alphabetic characters

     *  (see http://charts.unicode.org/Web/U0400.html)

*/

    if (cs.isMappable('\u0410\u0411\u041a\u042f\0430\0431\u044f'))

      "Warning: This game uses Cyrillic characters.  Your system

      does not appear to be localized for Russian, so the text

      in this game might not display properly.  You might need

      to adjust your system localization settings to display

      Cyrillic characters before you can play this game.  If

      you change your localization settings, please close and

      then re-start the game to ensure the new settings are used.";

Example 2: Translating a file from one character set to another.

This isn't a very typical situation for most games, but suppose you wanted to write a program that reads a text file that was saved in one character set and save it in a different character set – say, translate the file from the Macintosh Roman character set to ISO Latin-1. To do this, you would need a Mac Roman mapping definition on your computer, because this isn't one of the built-in character sets; assuming we had this mapping file (let's say it's called "MacRoman.tcm"), we could perform the translation quite easily using the text file functions.

  #include <tads.h>

  translate(inFileName, outFileName)

    local inFile, outFile;

    local csMac, csISO;

    /* create the character set objects */

    csMac = new CharacterSet('MacRoman');

    csISO = new CharacterSet('iso-8859-1');

    /* open the files */

    inFile = fileOpen(inFileName, 'rt', csMac);

    outFile = fileOpen(outFileName, 'wt', csISO);

    if (inFile == nil || outFile == nil)

      "Error: cannot open files.\n";

      return;

    /* read text and write it back out */

    for (;;)

      local txt;

      /* read a line of input; stop if at end of file */

      txt = fileRead(inFile);

      if (txt == nil)

        break;

      /* write it out */

      fileWrite(outFile, txt);

    /* close the files */

    fileClose(inFile);

    fileClose(outFile);

Note that creating CharacterSet objects isn't really necessary in this example, since we could have more simply passed the name of the character set directly to fileOpen().