SAPTechno

Note 788449 - Byte-order Marks in UTF-8 Files

Header
Version / Date 2 / 2005-02-28
Priority Recommendations/additional info
Category Program error
Primary Component BC-ABA-LA Syntax, Compiler, Runtime
Secondary Components

Summary
Symptom
    1. It is not detected that a file is encoded in UTF-8.
    2. If a UTF-8 file has a byte-order mark at the beginning, the first READ DATASET statement reads the byte-order mark into the memory, or, if the system is a non-Uniocde system, reads a '#' into the memory.
Other terms

CL_ABAP_FILE_UTILITIES, CHECK_FOR_BOM

Reason and Prerequisites

A byte-order mark can be used to indicate that a file is encoded in UTF-8 or UTF-16. For UTF-8, the byte-order mark is the byte sequence EF BB BF. The byte-order mark should not be read into the memory.

Solution

The method CHECK_FOR_BOM of the class CL_ABAP_FILE_UTILITIES checks whether a file starts with a byte-order mark. If the method returns BOM_UTF8, the statement OPEN DATASET should be called with the addition "IN TEXT MODE ENCODING UTF-8" and the byte-order mark has to be skipped. This can be done as follows:

OPEN DATASET ... IN TEXT MODE ENCODING UTF-8 FOR INPUT AT POSITION 3.

The method CHECK_FOR_BOM became available with 6.20 SAP_BASIS support package 47 and 6.40 SAP_BASIS support package 10.

If a file does not contain a byte-order mark, it may nevertheless be encoded in UTF-8. To check this, the class CL_ABAP_FILE_UTILITIES has a method CHECK_UTF8 which usually reads the first 8 kilobytes of the file. For details, see the online documentation.
The method CHECK_UTF8 became available with 6.20 SAP_BASIS support package 50 and 6.40 SAP_BASIS support package 12.

The OPEN DATASET statement does currently not support UTF-16. A workaround is to use the BINARY MODE and CL_ABAP_CONV_IN_CE, or the report RSCP_CONVERT_FILE (see Note 747615).

The reverse situation is that an ABAP program creates a UTF-8 file, and the communication partner expects a byte-order mark at the beginning of the file. In this case the method CREATE_UTF8_FILE_WITH_BOM of the class CL_ABAP_FILE_UTILITIES can be used. It creates a UTF-8 file which contains a UTF-8 byte-order mark. Subsequently, the file should be opened with

OPEN DATASET ... FOR APPENDING IN TEXT MODE ENCODING UTF-8.

Then TRANSFER can be used to write data into the file.
The method CREATE_UTF8_FILE_WITH_BOM became available with 6.20 SAP_BASIS support package 48 and 6.40 SAP_BASIS support package 11.

Starting with SAP_BASIS 7.00, the OPEN DATASET statement has additions "SKIPPING BYTE-ORDER MARK" (for reading) and "WITH BYTE-ORDER MARK" (for writing files).

Affected Releases
Software Component Release From Release To Release And subsequent
SAP_BASIS60620640

Correction delivered in Support Package
Support Packages Release Package Name
SAP_BASIS620SAPKB62047
SAP_BASIS620SAPKB62048
SAP_BASIS620SAPKB62050
SAP_BASIS640SAPKB64010
SAP_BASIS640SAPKB64011
SAP_BASIS640SAPKB64012

Related Notes
1375438Globalization Collection Note
1319517Unicode Collection Note
1038151Legacy files in LSMW: BOM for UTF-8 not possible
910857Incorrect characters (#) in batch input session
863392RFBIBL00: The first record is not a session record
855495RFBIDE00 / RFBIKR00: First record is not a session record
752835Usage of the file interfaces in Unicode systems
747615Tool for converting files from one code page to another
27Recommendations for the ABAP file interface