| Oracle®
Database Globalization Support Guide 10g Release 1 (10.1) Part Number B10749-01 |
|
|
View PDF |
This chapter describes how to use Oracle's database access products with Unicode. It contains the following topics:
Oracle offers several database access products for inserting and retrieving Unicode data. Oracle offers database access p
roducts for commonly used programming environments such as Java and C/C++. Data is transparently converted between the database and c
lient programs, which ensures that client programs are independent of the database character set and national character set. In addit
ion, client programs are sometimes even independent of the character datatype, such as NCHAR or CHAR, used
in the database.
To avoid overloading the database server with data conversion operations, Oracle always tries to move them to the client side database access products. In a few cases, data must be converted in the database, which affects performance. This chapter discusses details of the data conversion paths.
Oracle Corporation offers a comprehensive set of database access products that allow programs from different development environments to access Unicode data stored in the database. These products are listed in Table 7-1.
| Programming Environment | Oracle Database Access P roducts |
|---|---|
|
C/C++ |
Oracle Call Interface (OCI) |
|
Java |
a>
Oracle JDBC OCI or thin driver |
tr>
|
PL/SQL |
Oracle PL/SQL and SQL |
|
Visual Basic/C# |
Oracle ODBC driver |
Figure 7-1 shows how the database access products can access the database.
Text description o f the illustration nlspg024.gif
The Oracle Call Interface (OCI) is the lowest level API
that the rest of the client-side database access products use. It provides a flexible way for C/C++ programs to access Unicode data
stored in SQL CHAR and NCHAR datatypes. Using OCI, you can programmatically specify the character set (UTF-
8, UTF-16, and others) for the data to be inserted or retrieved. It accesses the database through Oracle Net.
Oracle Pro*C/C++ enables you to embed SQL and PL/SQL in your programs. It uses OCI's Unicode capabilities to provi
de UTF-16 and UTF-8 data access for SQL CHAR and NCHAR datatypes.
The Oracle ODBC driver enables C/C++, Visual Basic, and VBScript programs running on Windows platforms to access Unicode data stored
in SQL CHAR and NCHAR datatypes of the database. It provides UTF-16 data access by implementing the
SQLWCHAR interface specified in the ODBC standard specification.
The Oracle Provider
for OLE DB enables C/C++, Visual Basic, and VBScript programs running on Windows platforms to access Unicode data stored in SQL NCHAR datatypes. It provides UTF-16 data access through wide string OLE DB datatypes.
The Oracle Data Provider for .NET enables programs running in any .NET programming environment on Windows p
latforms to access Unicode data stored in SQL CHAR and NCHAR datatypes. It provides UTF-16 data access thro
ugh Unicode datatypes.
Oracle JDBC drivers are the primary Java programmatic interface for accessing an Oracle database. Oracle provides the following JDBC drivers:
All drivers support Unicode data access to SQL CHAR and NCHAR datatypes in the database.
The PL/SQL and SQL engines process PL/SQL programs and SQL statements on behalf of client-side
programs such as OCI and server-side PL/SQL stored procedures. They allow PL/SQL programs to declare CHAR, VARCHA
R2, NCHAR, and NVARCHAR2 variables and to access SQL CHAR and NCHAR dataty
pes in the database.
The following sections describe how each of the database access produc ts supports Unicode data access to an Oracle database and offer examples for using those products:
SQL is the fundamental language with which all programs and users access data in an Oracl e database either directly or indirectly. PL/SQL is a procedural language that combines the data manipulating power of SQL with the d ata processing power of procedural languages. Both SQL and PL/SQL can be embedded in other programming languages. This section descri bes Unicode-related features in SQL and PL/SQL that you can deploy for multilingual applications.
This section contains the following topics:
See A
lso:
|
There are three SQL NCHAR datatypes:
NCHAR datatype, the length is always specified a
s the number of characters. For example, the following statement creates a column with a maximum length of 30 characters:
CREATE TABLE table1 (column1 NCHAR(30));
The maximum number of bytes for the column is determined as follows:
maximum number of bytes = (maximum number of characters) x (maximum number of bytes for each character)
For example, if the national character set is UTF8, then the maximum byte length is 30 characters times 3 b ytes for each character, or 90 bytes.
The national character set, which is used for all
The maximum column size allowed is 2000 characters when the national c
haracter set is UTF8 and 1000 when it is AL16UTF16. The actual data is subject to the maximum byte limit of 2000. The two size constr
aints must be satisfied at the same time. In PL/SQL, the maximum length of NCHAR data is 32767 bytes. You can define an
NCHAR variable of up to 32767 characters, but the actual data cannot exceed 32767 bytes. If you insert a value that is s
horter than the column length, then Oracle pads the value with blanks to whichever length is smaller: maximum character length or max
imum byte length.
The NVARCHAR2 datatype specifies a variable length character string that uses the national char
acter set. When you create a table with an NVARCHAR2 column, you specify the maximum number of characters for the column
. Lengths for NVARCHAR2 are always in units of characters, just as for NCHAR. Oracle subsequently stores ea
ch value in the column exactly as you specify it, if the value does not exceed the column's maximum length. Oracle does not pad the s
tring value to the maximum length.
The maximum column size allowed is 4000 characters when
the national character set is UTF8 and 2000 when it is AL16UTF16. The maximum length of an NVARCHAR2 column in bytes is
4000. Both the byte limit and the character limit must be met, so the maximum number of characters that is actually allowed in an
I
n PL/SQL, the maximum length for an NVARCHAR2 variable is 32767 bytes. You can define NVARCHAR2 variables u
p to 32767 characters, but the actual data cannot exceed 32767 bytes.
The following stateme
nt creates a table with one NVARCHAR2 column whose maximum length in characters is 2000 and maximum length in bytes is 4
000.
CREATE TABLE table2 (column2 NVARCHAR2(2000));
Oracle supports implicit conversions between SQL NCHAR datatypes
and other Oracle datatypes, such as CHAR, VARCHAR2, NUMBER, DATE, ROWID, and CLOB. Any implicit conversions for CHAR and VARCHAR2 datatypes are also supported fo
r SQL NCHAR datatypes. You can use SQL NCHAR datatypes the same way as SQL CHAR datatypes.
Type conversions between SQL CHAR datatypes and SQL NCHAR datatypes
may involve character set conversion when the database and national character sets are different. Padding with blanks may occur if th
e target data is either CHAR or NCHAR.
Data loss can occur during datatype convers
ion when character set conversion is necessary. If a character in the source character set is not defined in the target character set
, then a replacement character is used in its place. For example, if you try to insert NCHAR data into a regular C
HAR column and the character data in NCHAR (Unicode) form cannot be converted to the database character set, then
the character is replaced by a replacement character defined by the database character set. The NLS_NCHAR_CONV_EXCP ini
tialization parameter controls the behavior of data loss during character type conversion. When this parameter is set to TRUE
code>, any SQL statements that result in data loss return an ORA-12713 error and the corresponding operation is stopped.
When this parameter is set to FALSE, data loss is not reported and the unconvertible characters are replaced with repla
cement characters. The default value is TRUE. This parameter works for both implicit and explicit conversion.
In PL/SQL, when data loss occurs during conversion of SQL CHAR and NCHAR da
tatypes, the LOSSY_CHARSET_CONVERSION exception is raised for both implicit and explicit conversion.
In some cases, conversion between datatypes is possible i n only one direction. In other cases, conversion in both directions is possible. Oracle defines a set of rules for conversion between datatypes. Table 7-2 contains the rules for conversion between datatypes.
| Statement | Rule |
|---|---|
|
|
Values are converted to the datatype of the target database column. |
|
|
Data from the database is converted to the datatype of the target variable. |
|
Variable assignments |
Values on the right of the equal sign are converted to the datatype of the t arget variable on the left of the equal sign. |
|
Parameters in SQL and PL/SQL functions |
|
|
Concatenation || operation or |
<
a name="1006189">
If one operand is a SQL |
|
SQL |
Character value is converted to |
|
SQL |
Character value is
converted to |
|
SQL |
Character datatypes are converted to |
|
SQL |
Character values are converted to |
|
SQ
L |
a>
Character values are converted to |
|
SQL |
Character values are converted to |
|
SQL <
code>CHAR or |
Character values are converted to |
|
SQL |
Comparisons between SQL
When When When there is compari
son between SQL |
SQL CHAR datatypes and other datatypes using explicit conversion f
unctions. The examples in this section use the table created by the following statement:
CR EATE TABLE customers (id NUMBER, name NVARCHAR2(50), address NVARCHAR2(200), birthdate DATE);
The TO_NCHAR function converts the data at run tim
e, while the N function converts the data at compilation time.
INSERT INTO cus tomers VALUES (1000, TO_NCHAR('John Smith'),N'500 Oracle Parkway',sysdate);< h4 class="ET">Example 7-2 Selecting from the Customer Table Using the TO_CHAR Function< /em>
The following statement converts the values of name from characte
rs in the national character set to characters in the database character set before selecting them according to the LIKE
clause:
SELECT name FROM customers WHERE TO_CHAR(name) LIKE '%Sm%';
You should see the following output:
NAME -------------------------------------- John Smith
Using the N function shows that either NCHAR or CHAR data can be passed as parameters for the TO_DATE function. The datatypes can mixed because they are converted at run
time.
DECLARE ndatestring NVARCHAR2(20) := N'12-SEP-1975'; BEGIN SELECT name into ndstr FROM customers WHERE (birthdate)> TO_DATE(nd atestring, 'DD-MON-YYYY', N'NLS_DATE_LANGUAGE = AMERICAN'); END;
As demonstrated in Example 7-3, SQL NCHAR data can be pas
sed to explicit conversion functions. SQL CHAR and NCHAR data can be mixed together when using multiple str
ing parameters.
| See Also:
Oracle Database SQL Reference for more information about explicit conversion functions for SQL |
Most SQL
functions can take arguments of SQL NCHAR datatypes as well as mixed character datatypes. The return datatype is based
on the type of the first argument. If a non-string datatype like NUMBER or DATE is passed to these function
s, then it is converted to VARCHAR2. The following examples use the customer table created in "SQL Functions for Unicode Datatypes".
SELECT INSTR(name, N'Sm', 1, 1) FROM customers;
SELECT CONCAT(name,id) FROM customers;
id is converted to NVARCHAR2 and then conc
atenated with name.
SELECT RPAD(name,100,' ') FROM customers;
The following output results:
RPAD(NAME,100, '') ------------------------------------------ John Smith
Space character ' ' is converted to the corresponding character in the NCHAR charac
ter set and then padded to the right of name until the total display length reaches 100.
| See Also: |
You can inpu t Unicode string literals in SQL and PL/SQL as follows:
N before a string literal that is enclosed with single quote marks. This explicitly indicates that the following
string literal is an NCHAR string literal. For example, N'12-SEP-1975' is an NCHAR string lite
ral.NCHAR datatypes, a string literal is converted to a SQL NCHAR datatype wherever n
ecessary.
NCHR(n<
code>) SQL function, which returns a unit of character code in the national character set, which is AL16UTF16 or UTF8. The res
ult of concatenating several NCHR(n) functions is NVARCHAR2 data. In thi
s way, you can bypass the client and server character set conversions and create an NVARCHAR2 string directly. For examp
le, NCHR(32) represents a blank character.
Because NCHR(<
code>n) is associated with the national character set, portability of the resulting value is limited to appl
ications that run in the national character set. If this is a concern, then use the UNISTR function to remove portabilit
y limitations.
UNISTR('string')
SQL function. UNISTR('string') converts a string to the national character set. To ensure portabi
lity and to preserve data, include only ASCII characters and Unicode encoding in the following form: \xxxx, where
xxxx is the hexadecimal value of a character code value in UTF-16 encoding format. For example, UNISTR('G\0061ry') represents 'Gary'. The ASCII characters are converted to the database character set and then to the national character
set. The Unicode encoding is converted directly to the national character set.The l ast two methods can be used to encode any Unicode string literals.
The UTL_FILE package was enhanced in Oracle9i to handle Unicode
national character set data. The following functions and procedures were added:
FOPEN_NCHAR
This function opens a file in Unicode for i nput or output, with the maximum line size specified. With this function, you can read or write a text file in Unicode instead of in the database character set.
GET_LINE_NCHAR
This procedure reads text from the open file identified by the file handle and places the text in the out put buffer parameter. With this procedure, you can read a text file in Unicode instead of in the database character set.
PUT_NCHAR
This procedure wr ites the text string stored in the buffer parameter to the open file identified by the file handle. With this procedure, you can writ e a text file in Unicode instead of in the database character set.
P
UT_LINE_NCHAR
This procedure writes the text string stored in the buffer paramet er to the open file identified by the file handle. With this procedure, you can write a text file in Unicode instead of in the databa se character set.
PUTF_NCHAR
<
p class="BP1">This procedure is a formatted PUT_NCHAR procedure. With this procedure, you can write a text file in Unico
de instead of in the database character set.
| See Also:
PL/SQL Packages and Types Reference for more information about the |
OCI is the lowest -level API for accessing a database, so it offers the best possible performance. When using Unicode with OCI, consider these topics:< /p>
The OCIEnvNlsCreate() function is used to specify a SQL CHAR
character set and a SQL NCHAR character set when the OCI environment is created. It is an enhanced version of the
OCIEnvCreate() function and has extended arguments for two character set IDs. The OCI_UTF16ID UTF-16 character set ID replaces
the Unicode mode introduced in Oracle9i release 1 (9.0.1). For example:
OCIEnv *envhp; status = OCIEnvNlsCreate((OCIEnv **)&envhp, (ub4)0, (void *)0, (void *(*) ()) 0, (void *(*) ()) 0, (void(*) ( )) 0, (size_t) 0, (void **)0, (ub2)OCI_UTF16ID, /* Metadata and SQL CHAR character set */ (ub2)OCI_UTF16ID /* SQL NCHAR character set */);
The Unicode mode, in which the OCI_UTF16 flag is used with the OCIEnvCreate() function, is
deprecated.
When OCI_UTF16ID is specified for both SQL CHAR and SQL NCHA
R character sets, all metadata and bound and defined data are encoded in UTF-16. Metadata includes SQL statements, user names,
error messages, and column names. Thus, all inherited operations are independent of the NLS_LANG setting, and all metat
ext data parameters (text*) are assumed to be Unicode text datatypes (utext*) in UTF-16 encoding.
To prepare the SQL statement when the OCIEnv() function is initialized with the OCI_UTF
16ID character set ID, call the OCIStmtPrepare() function with a (utext*) string. The following example run
s on the Windows platform only. You may need to change wchar_t datatypes for other platforms.
const wchar_t sqlstr[] = L"SELECT * FROM ENAME=:ename";
...
OCIStmt* stmt
hp;
sts = OCIHandleAlloc(envh, (void **)&stmthp, OCI_HTYPE_STMT, 0,
NULL);
status = OCIStmtPrepare(stmthp, errhp,(const text*)sqlstr,
wcslen(sqlstr),
OCI_NTV_SYNTAX, OCI_DEFAULT);
To bind and d
efine data, you do not have to set the OCI_ATTR_CHARSET_ID attribute because the OCIEnv() function has alre
ady been initialized with UTF-16 character set IDs. The bind variable names must be also UTF-16 strings.
/* Inserting Unicode data */
OCIBindByName(stmthp1, &bnd1p, errhp, (const text*)L":ename",
(sb4)wcslen(L":ename"),
(void *) ename, sizeof(ename), SQLT_STR, (void
*)&insname_ind,
(ub2 *) 0, (ub2 *) 0, (ub4) 0, (ub4 *)0,
a>OCI_DEFAULT);
OCIAttrSet((void *) bnd1p, (ub4) OCI_HTYPE_BIND, (void *)
&ename_col_
len,
(ub4) 0, (ub4)OCI_ATTR_MAXDATA_SIZE, errhp);
...
/*
Retrieving Unicode data */
OCIDefineByPos (stmthp2, &dfn1p, errhp, (ub4)1, (void *)ename,
(sb4)sizeof(ename), SQLT_STR, (void *)0, (ub2 *)0,
(ub2*)0,
(ub4)OCI_DEFAULT);
The OCIExecute() function per
forms the operation.
Unicod e character set conversions take place between an OCI client and the database server if the client and server character sets are diff erent. The conversion occurs on either the client or the server depending on the circumstances, but usually on the client side.
< a name="1006456">You can lose data during conversion if you call an OCI API inapprop riately. If the server and client character sets are different, then you can lose data when the destination character set is a smalle r set than the source character set. You can avoid this potential problem if both character sets are Unicode character sets (for exam ple, UTF8 and AL16UTF16).
When you bind or define SQL NCHAR datatypes, you sho
uld set the OCI_ATTR_CHARSET_FORM attribute to SQLCS_NCHAR. Otherwise, you can lose data because the data i
s converted to the database character set before converting to or from the national character set. This occurs only if the database c
haracter set is not Unicode.
Redundant data conversions can cause performance degradation in your OCI applications. These conversions occur in two cases:
CHAR datatypes and set the SQLCS_NCHAR, data conversions take place from client character set to the
national database character set, and from the national character set to the database character set. No data loss is expected, but two
conversions happen, even though it requires only one.NCHAR datatypes and do not set OCI_ATTR_CHARSET_FORM, data conversions take place from client characte
r set to the database character set, and from the database character set to the national database character set. In the worst case, d
ata loss can occur if the database character set is smaller than the client's.To avo
id performance problems, you should always set OCI_ATTR_CHARSET_FORM correctly, based on the datatype of the target colu
mns. If you do not know the target datatype, then you should set the OCI_ATTR_CHARSET_FORM attribute to SQLCS_NCHA
R when binding and defining.
Table 7-3 conta ins information about OCI character set conversions.
Data conversion can result in data expansion, which can cause a buffer to overflow. For binding
operations, you need to set the OCI_ATTR_MAXDATA_SIZE attribute to a large enough size to hold the expanded data on the
server. If this is difficult to do, then you need to consider changing the table schema. For defining operations, client application
s need to allocate enough buffer space for the expanded data. The size of the buffer should be the maximum length of the expanded dat
a. You can estimate the maximum buffer length with the following calculation:
Th is method is the simplest and quickest way, but it may not be accurate and can waste memory. It is applicable to any character set co mbination. For example, for UTF-16 data binding and defining, the following example calculates the client buffer:
ub2 csid = OCI_UTF16ID;
oratext *selstmt = "SELECT ename FROM emp";
a>counter = 1;
...
OCIStmtPrepare(stmthp, errhp, selstmt, (ub4)strlen((char*)selstmt),
<
a name="1006596"> OCI_NTV_SYNTAX, OCI_DEFAULT);
OCIStmtExecute ( svchp, stmthp, errhp, (ub4)0
, (ub4)0,
(CONST OCISnapshot*)0, (OCISnapshot*)0,
OCI_D
ESCRIBE_ONLY);
OCIParamGet(stmthp, OCI_HTYPE_STMT, errhp, &myparam, (ub4)counter);
OC
IAttrGet((void*)myparam, (ub4)OCI_DTYPE_PARAM, (void*)&col_width,
(ub4*)0, (ub4)OCI_ATTR_DATA_SI
ZE, errhp);
...
maxenamelen = (col_width + 1) * sizeof(utext);
cbu
f = (utext*)malloc(maxenamelen);
...
OCIDefineByPos(stmthp, &dfnp, errhp, (ub4)1, (vo
id *)cbuf,
(sb4)maxenamelen, SQLT_STR, (void *)0, (ub2 *)0,
(ub2*)0, (ub4)OCI_DEFAULT);
OCIAttrSet((void *) dfnp, (ub4) OCI_HTYPE_DEFINE, (void *) &csid,
(ub4) 0, (ub4)OCI_ATTR_CHARSET_ID, errhp);
OCIStmtFetch(stmthp, errhp, 1, OCI_FETCH_NEXT
, OCI_DEFAULT);
...
You can use UTF8 and AL32UTF8 by setting NLS_LANG for OCI client applications. If you do not
need supplementary characters, then it does not matter whether you choose UTF8 or AL32UTF8. However, if your OCI applications might h
andle supplementary characters, then you need to make a decision. Because UTF8 only supports characters of up to three bytes, no supp
lementary character can be represented in UTF8. In AL32UTF8, one supplementary character is represented in one code point, totalling
four bytes.
Do not set NLS_LANG to AL16UTF16, because AL16UTF16 is the nationa
l character set for the server. If you need to use UTF-16, then you should specify the client character set to OCI_UTF16ID, using the OCIAttrSet() function when binding or defining data.
To specify a Unicode character set for binding and defining data with SQL CHA
R datatypes, you may need to call the OCIAttrSet() function to set the appropriate character set ID after O
CIBind() or OCIDefine() APIs. There are two typical cases:
OCIBind() or OCIDefine() followed by OCIAttrSet() to specify UTF-16 Unic
ode character set encoding. For example:
... ub2 csid = OCI_UTF16ID; utext ename[100]; /* enough buffer for ENAME */ ... /* Inserting Uni code data */ OCIBindByName(stmthp1, &bnd1p, errhp, (oratext*)":ENAME", (sb4)strlen((char *)":ENAME"), (void *) ename, sizeof(ename), SQLT_STR, (void *)&insname_ind, (ub2 *) 0, (ub2 *) 0, (ub4) 0, (ub4 *)0, OCI_DEFAULT); OCIAttrSet((void *) bnd1p, (ub4) OCI_HTYPE_BIND, (void *) &csid, (ub4) 0, (ub4)OCI_ATTR_CHARSET_ID, errhp); OCIAttrSet((void *) bnd1p, (ub4) OCI_HTYPE_BIND, (void *) &ename_col_len, (ub4) 0 , (ub4)OCI_ATTR_MAXDATA_SIZE, errhp); ... /* Retrieving Unicode data */ OCIDefineByPos (stmthp2, &dfn1p, errhp, (ub4)1, (void *)ename, (sb4)sizeof(ename), SQLT_STR, (void *)0, (ub2 *)0, (ub2*)0, (ub4)OCI_DEFAULT); OCIAttrSet((v oid *) dfn1p, (ub4) OCI_HTYPE_DEFINE, (void *) &csid, (ub4) 0, (ub4)OCI_ATTR_CHARSET_ID, errhp); ...
If bound buffers are of the ut
ext datatype, then you should add a cast (text*) when OCIBind() or OCIDefine() is calle
d. The value of the OCI_ATTR_MAXDATA_SIZE attribute is usually determined by the column size of the server character set
because this size is only used to allocate temporary buffer space for conversion on the server when you perform binding operations.<
/p>
OCIBind() or OCIDefine() with the NLS
_LANG character set specified as UTF8 or AL32UTF8.
UTF8 or AL32UTF8 can be set i
n the NLS_LANG environment variable. You call OCIBind() and OCIDefine() in exactly the same ma
nner as when you are not using Unicode. Set the NLS_LANG environment variable to UTF8 or AL32UTF8 and run the following
OCI program:
... oratext ename[100]; /* enough buffer size for ENAME */ ... /* Inserting Unicode data */ OCIBindByName(stmthp1, & bnd1p, errhp, (oratext*)":ENAME", (sb4)strlen((char *)":ENAME"), (void *) ename, sizeof(ename), < a name="1006657"> SQLT_STR, (void *)&insname_ind, (ub2 *) 0, (ub2 *) 0, (ub4 ) 0, (ub4 *)0, OCI_DEFAULT); OCIAttrSet((void *) bnd1p, (ub4) OCI_HTYPE_BIND, (void *) &ename_col_len, (ub4) 0, (ub4)OCI_ATTR_MAXDATA_SIZE, errhp); ... /* Retri eving Unicode data */ OCIDefineByPos (stmthp2, &dfn1p, errhp, (ub4)1, (void *)ename, (sb4)sizeof(ename), SQLT_STR, (void *)0, (ub2 *)0, (ub2*)0, (ub4)OCI_DEFAULT); ...
Oracle Corporation recommends that you access SQL NCHAR datatypes using UTF-16 binding or defining when using
OCI. Beginning with Oracle9i, SQL NCHAR datatypes are Unicode datatypes with an encoding of eit
her UTF8 or AL16UTF16. To access data in SQL NCHAR datatypes, set the OCI_ATTR_CHARSET_FORM attribute to NCHAR datatypes is always in the number of Unicode code units.
NCHAR data column:
...
ub2 csid = OCI_UTF16ID;
ub1 cform = SQLCS_NCHA
R;
utext ename[100]; /* enough buffer for ENAME */
...
/* Insertin
g Unicode data */
OCIBindByName(stmthp1, &bnd1p, errhp, (oratext*)":ENAME",
(sb4)strlen((char *)":ENAME"), (void *) ename,
sizeof(ename), SQLT_STR, (void *)&insnam
e_ind, (ub2 *) 0,
(ub2 *) 0, (ub4) 0, (ub4 *)0, OCI_DEFAULT);
OCIAttrSet((
void *) bnd1p, (ub4) OCI_HTYPE_BIND, (void *) &cform, (ub4) 0,
(ub4)OCI_ATTR_CHARSET_FORM, errhp
);
OCIAttrSet((void *) bnd1p, (ub4) OCI_HTYPE_BIND, (void *) &csid, (ub4) 0,
(ub4)OCI_ATTR_CHARSET_ID, errhp);
OCIAttrSet((void *) bnd1p, (ub4) OCI_HTYPE_BIND, (void *) &ename_col_
len,
(ub4) 0, (ub4)OCI_ATTR_MAXDATA_SIZE, errhp);
...
/* Retrieving Unicode data */
OCIDefineByPos (stmthp2, &dfn1p, errhp, (ub4)1, (void *)ename,
(sb4)sizeof(ename), SQLT_STR, (void *)0, (ub2 *)0, (ub2*)0,
(ub4)OCI_DE
FAULT);
OCIAttrSet((void *) dfn1p, (ub4) OCI_HTYPE_DEFINE, (void *) &csid, (ub4) 0,
(ub4)OCI_ATTR_CHARSET_ID, errhp);
OCIAttrSet((void *) dfn1p, (ub4) OCI_HTYPE_DEFINE, (void *) &
cform, (ub4) 0,
(ub4)OCI_ATTR_CHARSET_FORM, errhp);
...
In order to write (bind) and read (defi
ne) UTF-16 data for CLOB or NCLOB columns, the UTF-16 character set ID must be specified as OCILobWri
te() and OCILobRead(). When you write UTF-16 data into a CLOB column, call OCILobWrite() as follows:
...
ub2 csid = OCI_UTF16ID;
err =
OCILobWrite (ctx->svchp, ctx->errhp, lobp, &amtp, offset, (void *) buf,
(ub4) BUF
SIZE, OCI_ONE_PIECE, (void *)0,
(sb4 (*)()) 0, (ub2) csid, (ub1) SQLCS_IMPLICIT);
The amtp parameter is the data length in number of Unicode co
de units. The offset parameter indicates the offset of data from the beginning of the data column. The csid
parameter must be set for UTF-16 data.
To read UTF-16 data from CLOB columns,
call OCILobRead() as follows:
...
ub2 csid = OCI_UTF16I
D;
err = OCILobRead(ctx->svchp, ctx->errhp, lobp, &amtp, offset, (void *) buf,
(ub4)BUFSIZE , (void *) 0, (sb4 (*)()) 0, (ub2)csid,
(ub1) SQLCS_IMPLICIT);
<
a name="1006716">
The data length is always represented in the number of Unicode cod
e units. Note one Unicode supplementary character is counted as two code units, because the encoding is UTF-16. After binding or defi
ning a LOB column, you can measure the data length stored in the LOB column using OCILobGetLength()
code>. The returning value is the data length in the number of code units if you bind or define as UTF-16.
err = OCILobGetLength(ctx->svchp, ctx->errhp, lobp, &lenp);
If you are using an NCLOB, then you must set OCI_ATTR_CHARSET_FORM to SQLCS_NCHA
R.
Pro*C/C++ provides the following ways to insert or retrieve Unicode data into or from the database:
VARCHAR Pro*C/C++ datatype or the native C/C++ text datatype, a program can access Unicode
data stored in SQL CHAR datatypes of a UTF8 or AL32UTF8 database. Alternatively, a program could use the C/C++ native UVARCHAR Pro*C/C++ datatype or th
e native C/C++ utext datatype, a program can access Unicode data stored in NCHAR datatypes of a database.
li>
NVARCHAR Pro*C/C++ datatype, a program can access Unicod
e data stored in NCHAR datatypes. The difference between UVARCHAR and NVARCHAR in a Pro*C/C++
program is that the data for the UVARCHAR datatype is stored in a utext buffer while the data for the NVARCHAR datatype is stored in a text datatype.Pro*C/C++ does n
ot use the Unicode OCI API for SQL text. As a result, embedded SQL text must be encoded in the character set specified in the N
LS_LANG environment variable.
This section contains the following topics:
Data conversion occurs in the OCI layer, but it is the Pro*C/C++ preprocessor that instructs OCI which conversion path should be taken based on the datatypes used in a Pro*C/ C++ program. Table 7-4 illustrates the conversion paths:
The Pro*C/C++ VARCHAR datatype is preprocessed to a struct wi
th a length field and text buffer field. The following example uses the C/C++ text native data
type and the VARCHAR Pro*C/C++ datatypes to bind and define table columns.
#in
clude <sqlca.h>
main()
{
...
/* Change to STRING datatype: */
EXEC ORACLE OPTION (CHAR_MAP=STRING) ;
text e
name[20] ; /* unsigned short type */
varchar address[50] ; /* Pro*C/C++ varcha
r type */
EXEC SQL SELECT ename, address INTO :ename, :address FROM emp;
/* ename is NULL-terminated */
printf(L"ENAME = %s, ADDRESS = %.*s\n", ename, address.len, a
ddress.arr);
...
}
When you use the VARCHAR datatype or native text datatype in a Pro*C/C++ program, the preprocessor assu
mes that the program intends to access columns of SQL CHAR datatypes instead of SQL NCHAR datatypes in the
database. The preprocessor generates C/C++ code to reflect this fact by doing a bind or define using the SQLCS_IMPLICIT
value for the OCI_ATTR_CHARSET_FORM attribute. As a result, if a bind or define variable is bound to a column of SQL
The Pro*C/C++ NVARCHAR datatype is similar to the Pro*C/C++ VARCHAR datatype. It s
hould be used to access SQL NCHAR datatypes in the database. It tells Pro*C/C++ preprocessor to bind or define a text bu
ffer to the column of SQL NCHAR datatypes. The preprocessor specifies the SQLCS_NCHAR value for the O
CI_ATTR_CHARSET_FORM attribute of the bind or define variable. As a result, no implicit conversion occurs in the database.
If the NVARCHAR buffer is bound against columns of SQL CHAR datatype
s, then the data in the NVARCHAR buffer (encoded in the NLS_LANG character set) is converted to or from the
national character set in OCI, and the data is then converted to the database character set in the database server. Data can be lost
when the NLS_LANG character set is a larger set than the database character set.
The UVARCHAR datatype is preprocessed to a struct with a utext buffer field. The following example code contains two host variables, ename
and address. The ename host variable is declared as a utext buffer containing 20 Unicode char
acters. The address host variable is declared as a uvarchar buffer containing 50 Unicode characters. The arr fields are accessible as fields of a struct.
#include &
lt;sqlca.h>
#include <sqlucs2.h>
main()
{
...
/* Change to STRING datatype: */
EXEC
ORACLE OPTION (CHAR_MAP=STRING) ;
utext ename[20] ; /* unsigned short type */
uvarchar address[50] ; /* Pro*C/C++ uvarchar type */
EXEC
SQL SELECT ename, address INTO :ename, :address FROM emp;
/* ename is NULL-terminated */
wprintf(L"ENAME = %s, ADDRESS = %.*s\n", ename, address.len,
address.arr);
...
}
When you use the UVARCHAR dat
atype or native utext datatype in Pro*C/C++ programs, the preprocessor assumes that the program intends to access SQL SQLCS_NCHAR value for
OCI_ATTR_CHARSET_FORM attribute. As a result, if a bind or define variable is bound to a column of a SQL NCHAR
Oracle provides the following JDBC drivers for Java prog rams to access character data in an Oracle database:
Java programs can insert or retrieve character data to and from columns of SQL CHAR and NCHAR datatypes. Specifically, JDBC enables Java programs to bind or define Java strings to SQL C
HAR and NCHAR datatypes. Because Java's string datatype is UTF-16 encoded, data retrieved from or in
serted into the database must be converted from UTF-16 to the database character set or the national character set and vice versa. JD
BC also enables you to specify the PL/SQL and SQL statements in Java strings so that any non-ASCII schema object names and string lit
erals can be used.
At database connection time, JDBC sets the server NLS_LANGUAGE and NLS_TERRITORY parameters to correspond to the locale of the Java VM that runs the JDBC driver. This operation en
sures that the server and the Java client communicate in the same language. As a result, Oracle error messages returned from the serv
er are in the same language as the client locale.
This section contains the following topic s:
Oracle JDBC drivers allow you to access SQL CHAR datatypes in the database using Java
string bind or define variables. The following code illustrates how to bind a Java string to a CHAR column.
int employee_id = 12345;
String last_name = "Joe";
Prepared
Statement pstmt = conn.prepareStatement("INSERT INTO" +
"employees (last_name, employee_id) VALUES (?, ?)")
;
pstmt.setString(1, last_name);
pstmt.setInt(2, employee_id);
pstm
t.execute(); /* execute to insert into first row */
employee_id += 1; /* n
ext employee number */
last_name = "\uFF2A\uFF4F\uFF45"; /* Unicode characters in name */
pstmt.setString(1, last_name);
pstmt.setInt(2, employee_id);
pstmt.execute();
/* execute to insert into second row */
You can define t
he target SQL columns by specifying their datatypes and lengths. When you define a SQL CHAR column with the datatype and
the length, JDBC uses this information to optimize the performance of fetching SQL CHAR data from the column. The follo
wing is an example of defining a SQL CHAR column.
OraclePreparedStatement pstm
t = (OraclePreparedStatement)
conn.prepareStatement("SELECT ename, empno from emp");
pstmt.defineColumnType(1,Types.VARCHAR, 3);
pstmt.defineColumnType(2,Types.INTEGER);
Res
ultSet rest = pstmt.executeQuery();
String name = rset.getString(1);
int id = reset.getIn
t(2);
You need to cast PreparedStatement to Or
aclePreparedStatement to call defineColumnType(). The second parameter of defineColumnType() is the
datatype of the target SQL column. The third parameter is the length in number of characters.
For binding or defining Java string variables to SQL
NCHAR datatypes, Oracle provides an extended PreparedStatement which has the setFormOfUse() method t
hrough which you can explicitly specify the target column of a bind variable to be a SQL NCHAR datatype. The following c
ode illustrates how to bind a Java string to an NCHAR column.
int employee_id
= 12345;
String last_name = "Joe"
oracle.jdbc.OraclePreparedStatement pstmt =
(oracle.jdbc.OraclePreparedStatement)
conn.prepareStatement("INSERT INTO employees (last_nam
e, employee_id)
VALUES (?, ?)");
pstmt.setFormOfUse(1, oracle.jdbc.OraclePrepared
Statement.FORM_NCHAR);
pstmt.setString(1, last_name);
pstmt.setInt(2, employee_id);
pstmt.execute(); /* execute to insert into first row */
employee_id += 1;
/* next employee number */
last_name = "\uFF2A\uFF4F\uFF45"; /* Unicode characters in name
*/
pstmt.setString(1, last_name);
pstmt.setInt(2, employee_id);
pst
mt.execute(); /* execute to insert into second row */
You can define the target SQL NCHAR columns by specifying their datatypes, forms of use, and lengths. JDBC uses thi
s information to optimize the performance of fetching SQL NCHAR data from these columns. The following is an example of
defining a SQL NCHAR column.
OraclePreparedStatement pstmt = (OraclePreparedSt
atement)
conn.prepareStatement("SELECT ename, empno from emp");
pstmt.defineColum
nType(1,Types.VARCHAR, 3,
OraclePreparedStatement.FORM_NCHAR);
pstmt.defineColumnType
(2,Types.INTEGER);
ResultSet rest = pstmt.executeQuery();
String name = rset.getStr
ing(1);
int id = reset.getInt(2);
To def
ine a SQL NCHAR column, you need to specify the datatype that is equivalent to a SQL CHAR column in the fir
st argument, the length in number of characters in the second argument, and the form of use in the fourth argument of defineCol
umnType().
You can bind or define a Java string against an NCHAR column
without explicitly specifying the form of use argument. This implies the following:
setString() method, then JDBC assumes that the bind or
define variable is for the SQL CHAR column. As a result, it tries to convert them to the database character set. When th
e data gets to the database, the database implicitly converts the data in the database character set to the national character set. D
uring this conversion, data can be lost when the database character set is a subset of the national character set. Because the nation
al character set is either UTF8 or AL16UTF16, data loss would happen if the database character set is not UTF8 or AL32UTF8.CHAR to SQL NCHAR dataty
pes happens in the database, database performance is degraded.In addition, if you bi
nd or define a Java string for a column of SQL CHAR datatypes but specify the form of use argument, then performance of
the database is degraded. However, data should not be lost because the national character set is always a larger set than the databas
e character set.
A gl obal flag has been introduced in the Oracle JDBC drivers for customers to tell whether the form of use argument should be specified b y default in a Java application. This flag has the following purposes:
CHAR datatypes can be migrated to support the SQL NCHAR da
tatypes for worldwide deployment without changing a line of code.setFormOfUse() method when binding and defining a SQL NCHAR column. The applicatio
n code can be made neutral and independent of the datatypes being used in the backend database. With this flag, applications can be e
asily switched from using SQL CHAR or SQL NCHAR.The global flag is specified in the command line that invokes the Java application. The syntax of specifying this flag is as follows:
java -Doracle.jdbc.defaultNChar=true <application class>
With this flag specified, the Oracle JDBC drivers assume the presence of the form of use argument for all bind and define operations in the application.
If you have a database schema that consi
sts of both the SQL CHAR and SQL NCHAR columns, then using this flag may have some performance impact when
accessing the SQL CHAR columns because of implicit conversion done in the database server.
| See Also:
< a href="ch7progr.htm#1006926">"Data Conversion in JDBC" for more information about the perform ance impact of implicit conversion |
Because Java strings are always encoded in UTF-16, JDBC drivers transparently convert data from the database character set to U TF-16 or the national character set. The conversion paths taken are different for the JDBC drivers:
For the OCI driver, the SQL statements are always converted to the database character set by the driver before it is sent to the database for processing. When the database character set is neithe r US7ASCII nor WE8ISO8859P1, the driver converts the SQL statements to UTF-8 first in Java and then to the database character set in C. Otherwise, it converts the SQL statements directly to the database character set. For Java string bind or define variables, Table 7-5 summarizes the conversion paths taken for different scenarios.
| Form of Use | SQL Data type | Conversion Path | |
|---|---|---|---|
|
|
|
Java string to and from data base character set happens in the JDBC driver | |
|
|
|
Java string to and from database character set happ ens in the JDBC driver. Data in the database character set to and from national character se t happens in the database server | |
|
|
|
Java string to and from national character set happens in the JDBC driver | |
|
|
|
Java string to and from national character set happens in the JDBC driver Data in national character set to and from database character set happens in the database server |
| Datatype | a> Maximum Bind Size Allowed by Database | Formula for Determini ng the Maximum Bind Size, Measured in UTF-8 Bytes |
|---|---|---|
|
|
< p class="TB">2000 bytes |
|
|
|
4000 bytes |
|
|
|
231 - 1 bytes |
|
The formulas guar antee that after the data is converted from UTF-8 to the database character set, the size of the data does not exceed the maximum siz e allowed in the database.
The number of UTF-16 characters that can be supported is determi ned by the number of bytes for each character in the data. All ASCII characters are one byte long in UTF-8 encoding. Other character types can be two or three bytes long.
Table 7-8 lis
ts the expansion factors of some common server character sets. It also shows the JDBC thin driver maximum bind sizes for CHAR
code> and VARCHAR2 data for each character set.
You should use the Oracle ODBC driver or Oracle Provider for OLE DB to access the Oracle server when using a Windows plat form. This section describes how these drivers support Unicode. It includes the following topics:
O racle's ODBC driver and Oracle Provider for OLE DB can handle Unicode data properly without data loss. For example, you can run a Uni code ODBC application containing Japanese data on English Windows if you install Japanese fonts and an input method editor for enteri ng Japanese characters.
Oracle provides ODBC and OLE DB products for Windows platforms only . For Unix platforms, contact your vendor.
OCI Unic
ode binding and defining features are used by the ODBC and OLE DB drivers to handle Unicode data. OCI Unicode data binding and defini
ng features are independent from NLS_LANG. This means Unicode data is handled properly, irrespective of the NLS_LA
NG setting on the platform.
In general, no redundant data conversion occurs unless you specify a different client datatype from that of the
server. If you bind Unicode buffer SQL_C_WCHAR with a Unicode data column like NCHAR, for example, then ODB
C and OLE DB drivers bypass it between the application and OCI layer.
If you do not specify
datatypes before fetching, but call SQLGetData with the client datatypes instead, then the conversions in Table 7-9 occur.
You must specify the datatype for inserting and updating operatio ns.
The datatype of the ODBC client buffer is given when you call SQLGetData b
ut not immediately. Hence, SQLFetch does not have the information.
Because the ODBC driver guarantees data integrity, if you perform implicit bindings, then redundant conversion may result in performance degrada tion. Your choice is the trade-off between performance with explicit binding or usability with implicit binding.
Unlike ODBC, OLE DB only enables you to perform implicit bindings for ins erting, updating, and fetching data. The conversion algorithm for determining the intermediate character set is the same as the impli cit binding cases of ODBC.
| |
Datatypes of the Target Column in the Database | In-Bin ding and Out-Binding Conversions | < /a> Comments |
|---|---|---|---|
|
|
|
If database character set is a subset of the D
atabase character set to and from If database character set is NOT a subset of Database character set to and from UTF-16 in OCI |
|
|
|
|
If database character set is a subset of the Datab
ase character set to and from If database character set is not a subset of Da
tabase character set to and from UTF-16 in OCI. UTF-16 to |
<
a name="1007225">
May degrade performance if data
base character set is not a subset of |
In ODBC Unicode applications, use SQLWCHAR to store Unicode data. All stan
dard Windows Unicode functions can be used for SQLWCHAR data manipulations. For example, wcslen counts the
number of characters of SQLWCHAR data:
SQLWCHAR sqlStmt[] = L"select ename fro m emp"; len = wcslen(sqlStmt);
Microsoft's
ODBC 3.5 specification defines three Unicode datatype identifiers for the SQL_C_WCHAR, SQL_C_WVARCHAR, and
SQL_WLONGVARCHAR clients; and three Unicode datatype identifiers for servers SQL_WCHAR, SQL_WVARCHAR<
/code>, and SQL_WLONGVARCHAR.
For binding operations, specify datatypes for bo
th client and server using SQLBindParameter. The following is an example of Unicode binding, where the client buffer SQL_C_WCHAR) is bound to the first bind variable associated with the Unicode
column (SQL_WCHAR):
SQLBindParameter(StatementHandle, 1, SQL_PARAM_INPUT, SQL _C_WCHAR, SQL_WCHAR, NameLen, 0, (SQLPOINTER)Name, 0, &Name);
Table 7-11 represents the datatype mappings of the ODBC Unicode dataty
pes for the server against SQL NCHAR datatypes.
| ODBC Datatype | Oracle Datatype |
|---|---|
|
|
|
|
|
|
|
|
| <
/tr>
According to ODBC specifications, SQL_WCHAR, SQL_WVARCH
AR, and SQL_WLONGVARCHAR are treated as Unicode data, and are therefore measured in the number of characters inst
ead of the number of bytes.
OLE DB offers the
wchar_t, BSTR, and OLESTR datatypes for a Unicode C client. In practice, wchar_t is the
most common datatype and the others are for specific purposes. The following example assigns a static SQL statement:
wchar_t *sqlStmt = OLESTR("SELECT ename FROM emp");
The OLESTR macro works exactly like an "L" modifier to indicate the Unicode string. If you need to allocat
e Unicode data buffer dynamically using OLESTR, then use the IMalloc allocator (for example, CoTaskMe
mAlloc). However, using OLESTR is not the normal method for variable length data; use wchar_t* inste
ad for generic string types. BSTR is similar. It is a string with a length prefix in the memory location preceding the s
tring. Some functions and methods can accept only BSTR Unicode datatypes. Therefore, BSTR Unicode string mu
st be manipulated with special functions like SysAllocString for allocation and SysFreeString for freeing m
emory.
Unlike ODBC, OLE DB does not allow you to specify the server datatype explicitly. Wh en you set the client datatype, the OLE DB driver automatically performs data conversion if necessary.
Table 7-12 illustrates OLE DB datatype mapping.
| |
Oracle Datatype |
|---|---|
|
|
|
If DBTYPE_BSTR is specified, then it is assumed to be DBTYPE_WCHAR be
cause both are Unicode strings.
ADO is a high-level API to access database with the OLE DB and ODBC drivers. Most database application developers use the ADO interface on Windows because it is easily accessible from Visual Basic, the primary scripting language for Active Server Pages (ASP) for the Internet Information Server (IIS) . To OLE DB and ODBC drivers, ADO is simply an OLE DB consumer or ODBC application. ADO assumes that OLE DB and ODBC drivers are Unic ode-aware components; hence, it always attempts to manipulate Unicode data.
XML support of Unicode is essential for software development for global markets so that text information can be exchanged in any language. Unicode uniformly supports almost every character and language, which makes it much easier to suppo rt multiple languages within XML. To enable Unicode for XML within an Oracle database, the character set of the database must be UTF- 8. By enabling Unicode text handling in your application, you acquire a basis for supporting any language. Every XML document is Unic ode text and potentially multilingual, unless it is guaranteed that only a known subset of Unicode characters will appear on your doc uments. Thus Oracle recommends that you enable Unicode for XML. Unicode support comes with Java and many other modern programming env ironments.
This section includes the following topics:
A common mistake in reading and writing XML files is u
sing the Reader and Writer classes for character input and output. Using Reader and Writ
er for XML files should be avoided because it requires character set conversion based on the default character encoding of the
runtime environment.
For example, using FileWriter class is not safe because
it converts the document to the default character encoding. The output file can suffer from a parsing error or data loss if the docum
ent contains characters that are not available in the default character encoding.
UTF-8 is popular for XML documents, but UTF-8 is not usually the default file encoding for Java. Thus using a Java class that assumes the defa ult file encoding can cause problems.
The following example shows how to avoid these proble ms:
import java.io.*; import oracle.xml.parser.v2.*; public class I18nSafeXMLFileWritingSample { public static void main(String[] args) throws Exception { // create a test document XMLDocument doc = new XMLDocument(); doc.setVersion( "1.0" ); doc.appendChild(doc.createComment( "This is a test empty document." )); doc.appendChild(doc.createEleme nt( "root" )); // create a file File fi le = new File( "myfile.xml" ); // create a binary output stream to write to the file just created FileOutputStream fos = new FileOutputStream( file ); // create a Writer that converts Java character stream to UTF-8 stream OutputStreamWriter osw = new OutputStreamWriter( fos, "UTF8" ); // buffering for efficiency Writer w = new BufferedWriter( osw ); // create a Print Writer to adapt to the printing method PrintWriter out = new PrintWriter( w ); < /a> // print the document to the file through the connected objects doc.print( ou t ); } }
Do not read XML files as text input. When reading an XML document stored in a file system, use the parser to a
utomatically detect the character encoding of the document. Avoid using a Reader class or specifying a character encodin
g on the input stream. Given a binary input stream with no external encoding information, the parser automatically figures out the ch
aracter encoding based on the byte order mark and encoding declaration of the XML document. Any well-formed document in any supported
encoding can be successfully parsed using the following sample code:
import java.io.*; import oracle.xml.parser.v2.*; public class I18nSafeXMLFileReadingSam ple { public static void main(String[] args) throws Exception { // create an instance of the xml file File file = new File( "my file.xml" ); // create a binary input stream FileInputStre am fis = new FileInputStream( file ); // buffering for efficiency BufferedInputStream in = new BufferedInputStream( fis ); // get an in stance of the parser DOMParser parser = new DOMParser(); // parse the xml file parser.parse( in ); } }
When the source of an XML document is not a fi le system, the encoding information is usually available before reading the document. For example, if the input document is provided in the form of a Java character stream or Reader, its encoding is evident and no detection should take place. The parser can begin pa rsing a Reader in Unicode without regard to the character encoding.
The following is an exa mple of parsing a document with external encoding information:
import java.io.*; import java.net.*; import org.xml.sax.*; import oracle.xml.parser.v2.*; public class I18nSafeXMLStreamReadingSample { public static void main(String[] args) throws Exception { // create an instance of the xml file URL url = new URL( "http://myhost/mydocument.xml" ); // create a connection to the xml document URLConnection conn = url.openConnection(); // get an input stream InputStream is = conn.getInputStream(); < a name="1007697"> // buffering for efficiency BufferedInputStream bis = new BufferedInputStream( is ); /* figure out the character encoding here */ /* a typical source of encoding information is the content-type header */ /* we assume it is found to be utf-8 in this example */ String charset = "utf- 8"; // create an InputSource for UTF-8 stream InputSource in = new InputSource( bis ); in.setEncoding( charset ); // get an instance of the parser DOMParser parser = new DOMParser(); // parse the xml stream parser.parse( in ); } a>}