Clause 7 (revised/draft)

The following text is intended as a `preview' of the possible form of the revised clause 7. Some alterations to layout have been necessitated by its `HTML'-isation. Be aware that no alteration to ISO/IEC 8211 exists until it has been accepted by ISO, and that anyway this is only a draft - I do not guarantee that the final form of the clause will match this text.


7 Use of coded character sets

This International Standard provides for the use of coded character sets in external file titles, field names, subfield labels, data fields and subfields of the A-type, and appropriate delimiters, by the methods of ISO/IEC 2022 and ISO/IEC 10646.

This International Standard requires the use of the C0 and G0 set of the Basic Character Set throughout a DDF except for the following:

  1. External file titles and field names.
  2. Subfield labels, which shall be encoded in BCS or, when it has been invoked, UCS-2, the two octet form, or UCS-4, the four octet form of ISO/IEC 10646. When using ISO/IEC 10646, the subfield labels shall be restricted to characters taken from the Basic Multilingual Plane at Level 1.
    NOTE 26 - The exclusion of some multiple byte character sets from Cartesian labels is a departure from ISO/IEC 8211-1985 but is necessary since some multiple byte sets do not include the necessary delimiter characters and others may be ambiguous. Moreover, processing escape sequences within system identifiers imposes significant overhead for direct access applications.

    For similar reasons, subfield labels have been restricted to ISO/IEC 10646 BMP level 1 in order to facilitate their use as data identifiers in processing programs.

  3. Data fields and subfields of character data type (A-type).
  4. User delimiters, which shall be encoded in the same character set as the data field or subfield that they terminate.

The forgoing text items, a) - d), which may include extended code sets are termed `extensible text' in the rest of this clause. Their extent and the scope of their character sets shall include their delimiters, if any.

NOTE 27 - When UT and FT are used to terminate extensible text, they shall be encoded in the C0 set of BCS or in the C0 set of ISO/IEC 10646 UCS-2 or UCS-4, according to the character set in use within the item being terminated.

Default coded character set extensions may be announced for a file or separately for each field as described in this clause. Subclause 7.1 specifies the announcement of the method of choice and requirements which are common to both methods. Subclause 7.2 specifies the use of ISO/IEC 2022 and subclause 7.3 specifies the use of ISO/IEC 10646. Both code sets may be specified as filewise and fieldwise defaults or be used as in-line escape sequences. The use of appropriate escape sequences will allow the intermixture of the code sets in the same file or field.

7.1 Announcement of coded character set extension

This field, DDR Leader RP 7, shall specify if coded character sets other than the G0 set of BCS are used in the file. The values of this field shall have the following meanings:
SPACE Only the G0 set of BCS is used and there are no in-line escape sequences.
"E" The BCS is the default character set and in-line ISO/IEC 2022 escape sequences may be used.
"h" Collections 1 and 2 of the ISO/IEC 10646 coded character set form the default character set and in-line Identification of Feature control functions, CSI sequences and ESC sequences shall not be used.
"H" Collections 1 and 2 of the ISO/IEC 10646 coded character set form the default character set and in-line Identification of Feature control functions, CSI sequences and ESC sequences may be used.

NOTE 28 - The meaning of the above controls is further modified by the contents of DDR Leader RP 17 - 19 and the DDR Field Controls RP 0-2 (level 1) or RP 6-8 (levels 2 and 3) (see 7.2 and 7.3) which specify filewise and fieldwise default character sets.

7.1.1 Scope of active character sets

The scope of an active character set, i.e., one which has been designated and invoked implicitly or explicitly, shall start at the beginning of and terminate at the end of an instance of extensible text. The scope of a user delimiter in a format control shall not include the enclosing parentheses.

An Extended Code Set designated and invoked by an in-line escape sequence within an A-type subfield shall apply to all subsequent A-type subfields until the end of the field or until replaced by another invocation, and its scope and invocation shall terminate at the end of the field.

7.1.2 Length of fields and subfields

The length of a field or subfield shall be the octet count including any control characters, escape sequences and multiple octet or octet character encodings.

When the use of extended character sets requires the presence of escape sequences, shiftin, shiftout or other control characters, the subfield width shall be indicated by delimiters and fixed width formats shall not be used.

7.1.3 Use of multiple octet character sets

This use is subject to the following conditions:

  1. Field names shall be written in the designated default set. Inline escape sequences may occur in the field names and the scope of the invocation shall terminate at the end of the field name excluding the terminator.
  2. The delimiters, "!" (EXCLAMATION MARK), "*"(ASTERISK) and "\\" (double REVERSE SOLIDIS) of the vector and Cartesian labels shall be written in the BCS or, when invoked, ISO/IEC 10646 BMP at level 1, UCS-2 or UCS-4.
  3. If multiple octet delimiters are used as the user delimiter, they shall be invoked and their scope shall terminate within the parentheses of the applicable format control.
  4. The octet counts of a variable-length bit subfield shall be coded as single octets in the BCS.

7.2 ISO/IEC 2022 coded character set extension

This International Standard permits the use of ISO/IEC 2022 for extended coded character sets in A-type data subfields, user delimiters, field names and external file titles. The use of coded character set extensions as specified in ISO/IEC 2022 as default character sets in A-type data subfields, field names and for user delimiters shall be limited to sets having three or four octet escape sequences associated with the announcing sequences as specified in ISO/IEC 2022. Within A-type data subfields, national variant sets and any C0, C1, G0, G1, G2 and G3 character sets may be used in the manner described by ISO/IEC 2022 without restriction to length of escape sequences.

NOTE 29 - National variant character sets are designated and invoked by use of their registered escape sequences in the same manner as other extended character sets. lSO/IEC 10646 sets may be designated from within an ISO/IEC 2022 character set.

7.2.1 Designation of ISO/IEC 2022 coded character sets

An extended character set shall be designated by the presence of its truncated escape sequence in the DDR and shall be invoked by default upon entry into the associated data field or subfield at which time the G0 set shall be invoked into columns 02-07 (in the 7-bit environment; 2-7) and the G1 set shall be invoked into columns 10-16. Other invocations shall require the use of an appropriate shift control character. If a G0 set has not been designated, the BCS is the G0 set by default. The scope of any invocation shall terminate at the end of each data field or subfield.

7.2.1.1 Use in the 7-bit environment

When an extended G1, G2 or G3 set has been specified as a default character set in a 7-bit environment, the extensible text shall begin with the G0 set unless an SO control or other shift control character is present to invoke the G1, G2 or G3 set.

7.2.2 Designation of default code set for file

A default character set shall be designated for a file by placing a SPACE or an "E" in DDR Leader RP 7 and the truncated escape sequence in DDR Leader RP 17-19.

The truncated escape sequence is the last (n-1) characters of the escape sequence used to specify the extended character set, where n is less than or equal to 4. The sequence shall be left justified and, if necessary, filled on the right with SPACEs. If there is no extension specified for a file, these three characters shall be SPACEs.

7.2.3 Designation of default code sets for fields

A default character set having an n-character escape sequence for each field shall be designated by:

  1. placing a SPACE or an "E" in DDR Leader RP 7,
  2. placing (2/0)(2/1)(210) in DDR Leader RP 17-19,
  3. setting the value in the Field Control Length field, DDR Leader RP 10-11, to "03" for a level 1 DDF and "09" for a level 2 or 3, and
  4. placing the truncated escape sequence in the Field Controls RP 0-2 (level 1) or RP 6-8 (levels 2 and 3) of the appropriate DDR field.

Three SPACEs in the Field Controls shall mean that the corresponding Data Field is encoded in BCS.

7.2.4 ISO/IEC 2022 announcer sequence field (tag 0...3)

The ISO/IEC 2022 announcer sequences (see ISO/IEC 2022 clause 15) for the code extension services used may be supplied by placing the complete list of announcers in the DDR as the contents of the data descriptive field having the tag 0...3. The list of announcers shall be in one of the following formats:

  1. Announcers for the entire file:

    The list of announcers shall be the complete three octet sequences concatenated together without further demarcation. The field shall be terminated by a field terminator.

  2. Announcer for each field:

    The list of announcers shall comprise a list of field tags, each field tag followed immediately by the applicable list of announcers terminated by a unit separator. The last announcer shall be terminated by a field terminator.

    In the absence of this field or in the absence of a field tag from the contents of this field, the announcer sequence is ESC (2/0) (4/4) by default.

NOTE 30 - These two cases can be resolved by the presence of the ESC at the start of case a).
7.2.4.1 Additional escape sequences

Any additional G-sets allowed under the announcer sequence specified may be defined by adding their escape sequences immediately after the announcer sequences, if any, for either a file or for each field tag.

7.3 ISO/IEC 10646 coded character sets

This International Standard permits the use of ISO/IEC 10646 for extended coded character sets in extensible text. The default state for ISO/IEC 10646 coded character sets shall be:

  1. UCS-2, the two octet form,
  2. collections 1 and 2,
  3. the C0 and C1 sets from ISO/IEC 6429, and
  4. level 1, no combining characters.

Other levels and collections may be designated for the entire file or for each field as specified in this clause. The collections of ISO/IEC 10646 Annex A shall be specified by their collection numbers using BCS digits which shall be right-justified with left-zerofill.

NOTE 31 - The user is referred to ISO/IEC 10646 for a complete description of ISO/IEC 10646 character encoding.

The use of the ISO/IEC 10646 multi-octet encoding shall be subject to the specifications of 7.1.3.

7.3.1 Announcement of filewise default character set

The use of a filewise default UCS collection shall be announced by placing the "h" or "H" character in DDR Leader RP 7 and its right-adjusted, zero-filled, ISO/IEC 10646 collection number in DDR Leader RP 17-19.

7.3.2 Announcement of fieldwise default character set

The use of a fieldwise default UCS collection shall be announced by:

  1. placing an "h" or "H" character in DDR Leader RP 7,
  2. placing three "H" characters in DDR RP Leader 17-19,
  3. setting the value in the Field Control Length field, DDR Leader RP 10-11, to "03" for a level 1 DDF and "09" for a level 2 or 3 DDF, and
  4. placing its right-adjusted, zero-filled, ISO/IEC 10646 collection number in the Field Controls RP 0-2 (level 1) or 6-8 (levels 2 and 3) of the appropriate DDR field.
    NOTE 32 - See ISO/IEC 10646 Annex A for collection numbers.

Three SPACEs in the Field Controls shall mean that the corresponding Data Field is encoded in BCS.

The C1 control functions may be used within the data strings (see 7.1).

7.3.3 ISO/IEC 10646 feature identifier field (tag 0...3)

Any additional ISO/IEC 10646 features used in a file (see ISO/IEC 10646 clause 17) may be identified by placing the complete list of the additional feature identifiers in the DDR as the contents of the data descriptive field having the tag 0...3. These additional features shall be the default state of the file or a field and any collections specified shall be in addition to the specification of the data field description. The list of identifiers shall be in one of the following formats:

  1. Feature Identifiers for the entire file:

    The list of feature identifier sequences shall be the complete ISO/IEC 10646 sequences concatenated together without further demarcation. The field shall be terminated by a field terminator.

  2. Feature Identifiers for each field:

    The list of feature identifier sequences shall comprise a list of field tags, each field tag followed immediately by the applicable concatenated list of feature identifier sequences terminated by a unit separator. The last identifier shall be terminated by a field terminator.

NOTE 33 - These two cases can be resolved by the presence of the ESC or CSI character at the start of case a).

Author: tony@lsl.co.uk

Last modified: Tue Aug 26 15:00:17 BST 1997