Double-Byte Character Set Support

VSC2MF

Many of the world's languages use sets of characters that run into the thousands. Most computers use 8-bit bytes, and assign a different 8-bit code to represent each character; this scheme can represent no more than 256 different characters.

Ideally a COBOL programmer should not need to be aware of the internal code used to represent characters. However, in practice some features of the internal code can affect the source programmer, and this limitation to 256 different characters is one of the most restricting of these.

For this reason the Double-Byte Character Set (DBCS) is provided. In this scheme each character is represented by a 16-bit code, each character occupying a pair of adjacent bytes. This scheme can represent thousands of different characters.

The assignment of DBCS character codes to characters varies from country to country.

The 8-bit code used by your COBOL system is the American Standard Code for Information Interchange (ASCII). In this chapter this will be referred to as the Single-Byte Character Set (SBCS).

MFDouble-Byte Character set support is sensitive to the DBCS Compiler directive.

See also the chapter Micro Focus Extensions for Double-Byte Character Support, primarily for Japanese language support.

DBCS Data

The DBCS Compiler directive makes your COBOL compiler recognize two data categories in which data is stored in DBCS. It does not prevent the use of other data categories; thus you can still use those data categories in which data is stored in SBCS.

Provided you have the necessary hardware support, DBCS data items used in input and output will be recognized and their data displayed and accepted correctly on such devices as screens, keyboard, printers, et cetera.

Roman Script in DBCS

The character set that can be represented by SBCS is based on the Roman alphabet plus some other characters. In some countries the DBCS character codes also include codes for many of these characters.

On some hardware the character displayed is visibly different according to whether the character is stored in SBCS or DBCS; for example on some screens the DBCS code for a letter causes it to be printed larger than does its SBCS code.

Multivendor Integration Architecture Support

Programs written to the NTT Multivendor Integration Architecture (MIA) Support are accepted by the COBOL compiler, using the DBCS and CURRENCY-SIGN"92" directives.

Source Programs

DBCS characters can be used in literals (since literals are data), in comments and comment-entries, and in user-defined words. Otherwise the DBCS directive does not change the range of characters that can be used in source programs - the program is still written using the COBOL character set (see Concepts of the COBOL Language).

Language Extensions

There are extensions to the PICTURE and USAGE clauses to define items that are to contain DBCS data. A new format of literal is required for DBCS data.

There are additional rules for various options, clauses and statements to define the behavior of DBCS data.

Except where otherwise stated, all the rules and features of COBOL remain applicable when DBCS is in use. The following sections give only the additional rules and formats pertaining to DBCS.

Comments and Comment-entries

SBCS and DBCS characters can be mixed freely in comments and comment-entries.

User-defined Words

Either SBCS or DBCS characters can be used in user-defined words for: Alphabet-name, Class-name, Condition-name, Data-name/Identifier, Record-name, File-name, Index-name, Mnemonic-name, Paragraph-name, Section-name, and Symbolic-character.

MFSBCS and DBCS characters can be freely mixed in user-defined words. Where a character exists in both the DBCS and SBCS character sets, its DBCS and SBCS representations will not be regarded as equivalent. See the section Roman Script in DBCS.

Spaces

Spaces in data of class DBCS will be represented by the DBCS code for space. A space character represented by a 2-byte code is referred to as a DBCS space.

MFThe values assigned to a DBCS space are sensitive to the DBCS and DBSPACE Compiler directives.

Data Items

DBCS Data Items

There is a class of data additional to the classes described in the section Class and Category of Data: it is called DBCS. It includes two data categories: DBCS and DBCS edited.

A data item of class DBCS is described by using the USAGE DISPLAY-1 clause. An item with this clause can have only the characters "G" and "B" in its PICTURE character-string. A " G" represents a DBCS character position; "B" is an editing character, and indicates a position that will always have a DBCS space inserted in editing. An item whose PICTURE character-string is all "G"s is of category DBCS; an item whose PICTURE character-string contains both "G" s and "B"s is of category DBCS edited.

Note that each "G" or "B" represents one 2-byte character position. Except where otherwise stated, the length of the data item for all purposes is the number of "G" s and "B"s in its PICTURE character-string.

For reference modification, the leftmost-character-position and length specify the number of DBCS characters, not bytes.

Data items of class DBCS can be used wherever data items of class alphanumeric can be used, subject to rules and exceptions given in the appropriate places in this chapter.

Mixed Data Items

DBCS characters can be included in data stored in data items of category alphanumeric. In such data, SBCS characters are represented by SBCS codes and DBCS characters by DBCS codes. Each space character is represented by the SBCS code for space.

On input and output both the SBCS and the DBCS codes will be recognized. The first byte of a DBCS code is never a valid SBCS code; hence the two can be used together without confusion. But in operations within the program the data will be treated as ordinary alphanumeric data. It is the programmer's responsibility to ensure that the two halves of a DBCS code do not get separated.

The length of the data item for all purposes is its length in bytes.

Literals

DBCS Literals

There is a fourth type of literal in addition to the nonnumeric, numeric and national literals described in the section Literals, the DBCS literal.

A DBCS literal is a character-string delimited at both ends by quotation marks or apostrophes, with the beginning delimiter preceded by a "G". It can consist of any characters in the computer's DBCS character set. It can be up to 28 DBCS characters in length. It cannot be continued across lines.

Whether quotation marks or apostrophes are used, the presence of that delimiter within a DBCS literal can be represented by two contiguous occurrences. The presence of the character that is not serving as the delimiter is represented by a single occurrence. The value of a DBCS literal in the object program is the string of characters itself, except:

  1. The initial G and the delimiters are excluded, and
  2. Each embedded pair of contiguous delimiter characters represents a single character.

Category of DBCS Literals

All DBCS literals can be used wherever nonnumeric literals can be used, subject to rules and exceptions given in the appropriate places in this chapter.

Mixed Literals

DBCS characters can be included in nonnumeric literals. A nonnumeric literal that includes DBCS characters is called a mixed literal. In such a literal, SBCS characters are represented by SBCS codes and DBCS characters by DBCS codes. Each space character is represented by the SBCS code for space.

On output both the SBCS and the DBCS codes will be recognized. The first byte of a DBCS code is never a valid SBCS code; hence the two can be used together without confusion. But in operations within the program the literal will be treated as an ordinary nonnumeric literal. It is the programmer's responsibility to ensure that the two halves of a DBCS code do not get separated.

A nonnumeric literal is of category alphanumeric, not DBCS, regardless of whether it includes DBCS characters.

A mixed literal cannot be continued across lines.

MFThis restriction has been removed.

Figurative Constants

If a figurative constant is used where only a DBCS literal is allowed (according to the rules concerning classes and categories given in the appropriate places in this chapter), it is a DBCS literal. Each space in this literal is a DBCS space.

Only the figurative constant SPACE(S) can be a DBCS literal.

MFCOB370

Another format of literal, equivalent to the DBCS literal, is used in COBOL/370 and the MIA COBOL specification.

General Format
Syntax Rules
  1. An N-literal can contain no more than 18 DBCS-characters, and can not be split over two lines.
  2. An N-literal can contain only double-byte characters for your computer's Double Byte Character Set.
  3. Any double-byte quotation marks used in the literal should be written twice. For example, in order to express a double-byte quotation mark in the literal, you should write:

    N"ABC""DEF"

  4. N-literal specification and behavior can be modified in exactly the same way as G-literals using the APOST Compiler directive to replace a quotation (double-line) by an apostrophe (single-line) character.
General Rules
  1. The N-literal can be used in conjunction with ALL to make a figurative constant (see the section Figurative Constants.
  2. All characters must be double-byte characters.