Login Register Actian.com  

Actian Community Wiki

Navigation
Learn About
Developing With
Ingres Talk
Information
Toolbox

UTF-8 Transliteration

From Ingres Community Wiki

Jump to: navigation, search

Contents

References

Design Document is available here.

DBMS Changes Required - see Supporting GCC UTF8 Transliteration

UTF8 Transliteration

Allow mapping from UTF8 character set to non-UTF8 character set. Redesign the way we currently do this. Possibly change the network neutral protocol that currently exists. UTF-16 mapping should handle this with an algorithmic UTF-16 <-> UTF-8 mapping at the appropriate point.

CL Additions for UTF8 Transliteration Project

New CL routines were needed to convert from UTF8 to UTF16. More information can be found here.

Ingres Enhancement Number

SIR 121676 - Add routines for UTF8 to UTF16 conversion to support UTF-8 transliteration in Ingres/Net

Notes

DDS Review held on 8 Feb 2008 8am Pacific


(Phone) meeting May 5th 2009 (5pm Pacific)

Aim of call: Get agreement (or disagreement) documented on what is needed for this project to be successful. Solutions for problems to be handled seperately.

Discussion Group attendees: Gordy, Durwin, Teresa, Steve, Shubhagam, Chris

Result: Unilateral agreement that any silent truncation was unacceptable. Didn't talk about replacement character behavior as truncation was considered a show stopper. Agreed that current "GCC does transliteration" project could not meet these requirements due to GCA protocol of tuple descriptor and tuples being separated. Consensus was that DBMS was the only one who could make decisions and raise errors.


Slightly longer summary

Gordy has come up with schemes that can deal with many (GCA) metadata expansions that can/will happen when crossing different encodings. However when (for example) a "SELECT * FROM ...." is issued a descriptor tuple is returned to the client and then later (could be millisecs later, could be hours if there are pauses in the client FETCH) tuples are sent back. The tuples are sized according to the descriptor tuple passed earlier.

Briefly spit-balled ideas, one idea was to impose a "copy table not supported across different character sets (with UTF8) but allow other operations" but this was likely to be a show-stopper limitation as copydb/unloaddb was hoped to allow database charset migrations with the new feature (would also prevent Ingres Star from working).

Out lined a couple of usage scenarios (and expected behaviors).


Delayed select

Peers; client win1252 (aka cp1252, see http://msdn.microsoft.com/en-us/goglobal/cc305145.aspx), server utf8 Data; table with varchar(10)

varchar 10, 10 Euros

Client inserts 10 Euro symbols (single byte as this is a varchar column) using bind parameters, U+20AC (in 1256 0x80 and will take 10 bytes). When translated to utf8 this needs 30 bytes of space, 0xe2 0x82 0xac. This means there is space for 3 characters and the 4th will be truncated in the middle. Best case scenario GCC could truncate to 3 characters with one "byte" left for space, you could fill the single byte with _some_ character to indicate truncation, errors in GCC would be fatal and not warnings (with no table/column name information)

varchar 9, 10 Euros

Client inserts 10 Euro symbols (single byte as this is a varchar column) using bind parameters, U+20AC (in 1256 0x80 and will take 10 bytes). When translated to utf8 this needs 30 bytes of space, 0xe2 0x82 0xac. This means there is space for 3 characters. Best case scenario GCC could truncate to 3 characters with one "byte" left for space and no indication of truncation.


In both cases, a select later on, say 3 years later after the insert (or copy table...) would have no idea.

Assuming database session has string truncation enabled; Ideal behavior for above examples is to raise a (normal) truncation error/warning E_AD1082. This was agreed by everyone on the call.

Example below of normal string truncation error:

C:\users\clach04\sql>sql -string_truncation=fail nclach04 < truncate_string_test.sql
INGRES TERMINAL MONITOR Copyright 2007 Ingres Corporation
Ingres 2006 Release 2 Microsoft Windows Version II 9.1.1 (int.w32/103) login
Wed May 06 13:43:40 2009

continue
* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * /* SQL Startup File
*/
/*
** See:
**      http://docs.ingres.com/sqlref/ConnectionFlags and
**      http://docs.ingres.com/sqlref/SQLOperations
**
**      -string_truncation FlagùSpecify Error Handling for String Truncation
**
**      To specify error handling for string truncation, use the
**      -string_truncation=option flag on the connect statement,
**      specified when a session connects to a database.
**
**      The option can be one of the following:
**
**          * ignore
**              (Default)   Truncates and inserts the string.
**                          No error or warning is issued.
**          * fail
**
**      Does not insert the string, issues an error, and aborts the statement.
**
**      This flag can also be specified on the command line for Ingres
**      operating system commands that accept SQL option flags.
**      For details about SQL option flags, see the sql command description
**      in the Command Reference Guide .
**
** sql -string_truncation=ignore SOMEDB < truncate_string_test.sql ## default
** sql -string_truncation=fail SOMEDB < truncate_string_test.sql
*/
drop table trunctest;
Executing . . .

continue
* * *
create table trunctest(col1 varchar(10));
Executing . . .

continue
* * *
insert into trunctest(col1) values ('1234567890');
Executing . . .

(1 row)
continue
* * *
insert into trunctest(col1) values ('1234567890X');
Executing . . .

E_AD1082 String '1234567890X' is too long to fit into the resulting string
    type.
    (Wed May 06 13:43:40 2009)

continue
* * *
select * from trunctest;
Executing . . .


+----------+
|col1      |
+----------+
|1234567890|
+----------+
(1 row)
continue
*
Your SQL statement(s) have been committed.

Ingres 2006 Release 2 Version II 9.1.1 (int.w32/103) logout
Wed May 06 13:43:40 2009

NOTE during discussion discussed a case with OpenROAD3.5 driver update where data was loaded and ideal error message contains:

  1. table name
  2. column name
  3. length of target (column)
  4. length of source

We already have the string content (at least for literals in this example, not tested with bind params).

We didn't say this was needed but example; for example above something like:

E_NEWMESSAGE String of length X is too long to fit into the destination string of length Y

Where X and Y would be bytes for varchar/char and codepoints for nchar/nvarchar

Ideas not discussed at meeting

  1. similar to string truncation connection flag, have a new flag for how to deal with charset transliteration errors. E.g. need "raise error", "ignore (delete)", "ignore (replace with an unmappable character)"
  2. how could we implement transliteration in the DBMS - "Lying GCC" (server GCC always tells the client that it has the same charset as the client) which would allow existing clients to connect without a new GCA protocol. E.g. Ingres 6.4 could connect to Ingres 10.0 with utf8

Phone Meeting - May 19th 2009(5pm Pacific)

We held another meeting to discuss the implications of allows non-UTF8 Ingres clients to connect to a UTF8 Ingres Server (and conversely). This discussion was prompted by a question raised by Gordy regarding what he should do when transliterating a string from one character set to another and the target string is too small:

  1. Generate a GCF error;
  2. Insert a Substitute character at the end of the buffer
  3. Generate some type of warning to the client.

A discussion group was pulled together to try to come up with ideas and/or solutions.

Discussion Group attendees: Gordy, Durwin, Teresa, Bill, Steve, Shubhagam, Chris

In the meeting on 19-May-2009 the following discussions took place:

1. Legacy clients need to be able to participate

Bill pointed out that you can never identify all of the clients at a site. If you require that all clients be updated to the latest release then the Ingres Server will never be updated. Thus the non-UTF8 legacy clients must be able to talk to a new Ingres Server (Server being defined as the Ingres RDMS+ Ingres GCC). This means that the Ingres Server needs to be smart enough to be able to transliterate on behalf of the legacy clients. This is not the ideal solution since the resources on the server are in general much more expensive than the client.

2. New clients should be smarter.

Conversely, a new client should be able to perform transliteration solely on the client and present the same character set to the server as the Server implements. It was thus always appear to the Ingres Server as being in the same character set as the Ingres Server. This could allow new UTF8 clients to talk to Legacy Ingres Servers. It was agreed that the resources on the client are generally less expensive than the resources on the server when it comes to this type of processing.

3. Existing transliteration should remain in place.

The existing transliteration can stay in place for Single byte character sets. This recognizes that there are a number of deficiencies in the existing Single byte transliteration but it exists in all prior Ingres client and servers.

4. Doublebyte and Multibyte transliteration in new clients should use UTF16 as the pivot character set

Gordy has pointed out that he would like to preserve the work that he has done in the redesign of the Multibyte transliteration. I believe that the idea is to transliterate everything to UTF16 before sending it to the server. If item 2 (New clients should be smarter) is adopted, there should be no reason why the transliteration cannot take place solely on the client.

5. Implementation of support for Unicode Query Text (UTF-16 or UTF-8) should be considered

This was brought up as a result of the discussion regarding the sending of query text with UTF-16 string literals. The existing query text only supports a single encoding based on the II_CHARSETxx setting. Bill pointed out that technically, the query text (SQL_TEXT) could be implemented as a separate encoding from the default encoding of the DBMS. While the encoding of SQL_TEXT is implementation dependent the suggested preference should be Unicode with the implication that this will be UTF-16.

6. The prior topic prompted a discussion about changing our catalogs to UTF-16 encoding

This was an interesting discussion. The advantage of changing the catalogs to support UTF-16 at the same time as the length the SQL Identifiers was changed was discussed. One implication that was not discussed was the impact on legacy clients that need to form metadata queries against catalogs that they do not expect to be encoded in UTF-16. This implies that someone will perform this transliteration automatically. (This is the technique that was chosen when support for MS SQL 2003 by Enterprise Access was implemented. The MS SQL 2000 catalogs were UTF-16 encoded. The MS SQL ODBC Driver has the capability of coercing the UTF-16 strings to ANSI strings transparently.)

7. An extensive discussion was held on the topic String lengths

In Ingres 9.2 there is some discrepancy with regards to the length of a string. In DDL, the length is expressed as the number of OCTETs. In most DML operations the length of a string is expressed as the number of characters. String lengths in GCA are always expressed as byte lengths. Some discussion was given to adopting the use of CHARACTER LENGTH in all circumstances. The problem that this raises is two-fold: (1) How do you allocate a buffer big enough to deal with string that are composed of variable length characters and (2) GCF always assumes that string lengths are expressed in the number of bytes. (One exception was the use of UTF-16 as a network transport character set. In that case Gordy mentioned when the string is transliterated into a UTF-16 string, the original length of the string is retained and is not changed in anticipation of transliterating it back into the target character set encoding.) Chris was opposed to changing the DDL definition of a string length from BYTE LENGTH to CHARACTER LENGTH, if character length semantics are required we could either provide a tool to take "old" DDL and convert to using NVARCHAR, or a session/dbms setting could be implemented that takes DDL of VARCHAR in a session and autochanges it to nvarchar, then default coercion would kick in. If CHARACTER LENGTH were to be adopted then The DBMS would need to allocate four bytes for every UTF-8 character and also then implement string truncation on the character length not the byte length, libq would also need to do this too.

8. Truncation remained the primary reason why the meeting was held

The issue of truncation during transliteration remained the primary topic. We generally agreed that the capability of providing a SQL warning back to the client was a requirement if string truncation occurred. Currently, this SQL warning is injected by the DBMS. GCF does not have this ability. Gordy did state that SQL warnings are sent as GCA_ERROR response messages by the DBMS. When I asked him if it was possible for GCF to inject these messages he said that this would be a new requirement. (One of the things that Gordy mentioned was the need to make changes to the Finite State Machine if this approach was adopted.)

9. String Truncation Error Options

Some discussion was made about the various string truncation behaviors. The options that need to be supported are: (1) Silent; (2) Fail and (3) Warn. Ingres supports the first two but not the last. Support is implemented via a connection flag

10. String Transliteration Error Options

In transliteration there were three options that were discussed: (1) Fail; (2) Substitute and (3) Ignore. Ingres supports the first two but not the last. Support is implemented by the "sql" statements, "SET NOUNICODE_SUBSTITUTION" and "SET UNICODE_SUBSTITUTION .."

11. The existing Client GCC and Server GCC processes are assumed to remain in place in the foreseeable future

These discussions make the assumption that the existing Ingres Net processes will continue to be used in the foreseeable future. The main goal of Ingres Net has always been to hide the platform and physical characteristics of the Ingres clients and Ingres servers from each other. If we are to have smarter clients and smarter servers then the core logic where transliteration occurs needs to be encapsulated and exposed to *both* the Ingres Net stack and the client and server runtime. This would allow the clients or servers to take a more active role in directing transliteration. This would also allow the exposure of transliteration APIs to the Ingres client and Ingres server runtime for other uses.

12. The existing Ingres Character Sets do not match existing standards and the UCHAMAP (ICU-based tables)

Item 4 discussed the use of UTF-16 as the network character set when transliterating between the Asian character sets. Serious consideration should be made to the introduction of new character sets based on standards and using the UTF-16 pivot character set. This would potentially allow the introduction of character sets for Vietnamese and other requested languages.

13. A discussion was raised on the feasibility of increasing the maximum VARCHAR size from 32000/16000 to 64000/32000

In a side discussion around the possibility of preallocating 4 bytes for every UTF8 character processed, it was pointed out that the current limit of 32000 for a non-UTF8 VARCHAR (16000 for a UTF8 VARCHAR) is rather limiting. Steve said that while tuples can span pages, the largest column width is limited to the page size.

14. increasing the maximum page size to 128K

The discussion then moved onto increasing the maximum page size to 128K in order to accommodate larger VARCHAR sizes. Steve indicated that this was a possibility and matched the default block size used by many SANs. Going past 128Kb pages may require more effort.


The meeting was adjourned after two hours. We agreed to go away and try to digest what was said. We will probably have a few informal adhoc meetings and then resume our discussions after the UKIUA.

Phone Meeting - Jun 23rd 2009(5pm Pacific)

We held another follow-on discussion after the UKIUA

Discussion Group attendees: Gordy, Durwin, Teresa, Bill, Steve, Shubhagam

Proposals for where UTF8 translation would occur and identification of translations errors

The following proposals for where UTF8 translation would occur and the identification of translation errors were discussed:

  • GCC with message level flags

It was determined that this option did not provide sufficient information on items which had translation errors.

  • GCC with tuple level flags

Steve asked if this was a possibility. While possible, it still does not provide identification of the columns with errors nor does it handle query parameter values.

  • GCC with data value flags (extending NULL byte on nullable values to status byte on all values)

There was some concern that this would produce significant additional overhead, but was considered the best option to provide the level of detail needed to handle translation errors.

It was also noted that just as the NULL byte increases the GCA data length of data values by 1 in meta-data descriptors, the status byte will also require meta-data lengths to be increased by 1 when present.

Initially, it was thought that the NULL byte on nullable values could be overloaded with the status information. This would mean the status byte would only be added for non-nullable types. A minor concern was expressed that overloading the NULL indicator may result in confusion because of the many instances where the NULL byte is essentially treated as a boolean (zero or non-zero, false or true). A new proposal has been made to move the NULL byte to the start of the data so that the data portion can be omitted entirely if the value is NULL. Because the status bits are only fully formed once the data value has been fully processed, overloading these bits in a leading status byte would cause problems in stream processing when dealing with long character strings and LOBs. This new proposal provides an argument for adding a separate status byte which would be present for both nullable and non-nullable types.

Two additional issues were not discussed:

  • Adding the status byte only to character types. As it stands, the status byte will also be added to non-nullable numeric and date/time values. While there may be a use for a status byte for these types in the future, there currently isn't any forseen reason at this time and there would be less additional overhead if these types didn't have the additional status byte.
  • Translation errors in non-data value strings. The current proposal only applies to GCA_DATA_VALUES and copy/data tuples (GCA_TUPLES, GCA_CDATA), basically DB_DATA_VALUE's. There are additional strings in GCA messages which do not fall into this category. Details relating to this issue are given below.
  • Pushing translation and character set negotiation out to the GCA client level (Client interfaces and DBMS)

This option was not discussed in detail other than it was noted that it would be the most costly in terms of development and complexity. How this would be implemented in regards to backward compatibility and character set negotiations in both client/DBMS and GCC was not immediatly obvious. Backward compatibility issues in performing translation in GCC using the data value status byte were discussed:

  • Adding a status byte to data values will require a new GCA protocol level and thus will not be able to provide per item indication on connections involving either old clients or servers.
  • For data flow towards a new component (either old client to new server or old server to new client) the per message translation flag option discussed above would at least provide an indication that an error occured even though the item involved isn't identified. It was noted that this is better than what is currently done.
  • For data flow towards an old component (either new client to old server or new server to old client), translation errors will not be flagged. It was noted that this is the same as currently occurs.

Options for performing translation within GCC

Options for performing translation within GCC were discussed:

  • Implementing new GCC translation tables for UTF8 and the other supported charsets.
  • Using ADF to perform the translation.

Preference was expressed to maintain a single translation capability within ADF. There was concern regarding backward compatibility and the oddities that currently exist with GCC translation tables. This concern was addressed by limiting the use of ADF only for conversions involving UTF8. No changes will occur in currently supported translations (at least for the time being).

It was noted that things have come full circle: this whole discussion began with the question of how translations errors should be handled within GCC. Since GCC will flag errors for handling by the receiving entities, GCC translations must now handle errors to reasonably identify the error data while continuing to process the message stream. Truncation errors can occur in fixed length strings. There are a few fixed length strings in GCA which are not data values and thus would not have a status byte:

  • GCA_COPY gca_domname_cp[ 64 ] - Column name in copy map?
  • GCA_COPY gca_delim_cp[ 2 ] - Copy field delimiters. A single or double byte character.
  • GCA_RESPONSE gca_logkey[ 16 ] - Table/Logical keys
  • GCA_ID gca_name[ 64 ] - Distributed transaction ID, Repeat Query ID, Database Procedure ID, Cursor ID.

In addition to the preceding fixed length strings, translation errors can occur in variable length strings:

  • GCA_TRACE - Trace messages
  • GCA_TDESCR gca_attname - Column names
  • GCA_COPY gca_nuldata_cp - NULL value indicator
  • GCA_COPY gca_fname_cp - File name
  • GCA_COPY gca_logname_cp - Log name
  • GCA_DELETE gca_table_name - Table name
  • GCA_DELETE gca_owner_name - Schema name
  • GCA_EVENT gca_evname - Event name
  • GCA_EVENT gca_evowner - Event owner
  • GCA_EVENT gca_evdb - Event database
  • GCA_INVPROC gca_proc_own - Schema name
  • GCA_INVPROC gca_parname - Database procedure parameter names.
  • GCN - All Name Server strings.
  • GCM - IMA object classes, instances, values

Comments on June 23rd Notes

Steve - June 25, 2009 7:36 PM

I think it might be reasonable to only add the status byte to character types (or at least to a specific list of types rather than every type); as you mention, I don’t see why we would need it for other types at this point, and we can always bump the GCA level to add it at a later date if we need to.
I’m not sure why we do transliteration for most of those fixed and variable strings at the bottom of your mail, and in fact, I think if we performed any kind of mapping on some of them it would cause problems (e.g. I think the table logical key is mean to be a binary value). Can we just set it up so that we don’t transliterate the ones that don’t make sense? How does that affect backwards compatibility?

Gordy - June 29, 2009 11:30 AM

Looks like table/object keys are going to be a real mess:
  • Libq treats table/object keys as character types.
  • SQL manual declares table/object keys to be character values.
  • The DBMS returnes table/object keys as DB_CHA_TYPE when doing SELECT
  • System maintained values for table/object keys appear to be binary values.
The handling of table/object keys contained in GCA_RESPONSE messages has to be the same as how the front-end will return these values to the DBMS so that the round trip processing is the same.
UTF8 translation of binary table/object key values will likely result in errors – either truncation if translating into UTF8 or invalid source characters if translating out of UTF8.
Note that this problem will occur for any storage of binary values in character types, not just table/logical keys.

Steve - June 29, 2009 4:13 PM

I feared this might happen. Now that we actually do some validation checking on character strings that flow in and out of the database all kinds of interesting problems will start to crawl out of the woodwork. We saw the start of this in the 9.2/9.3 UTF8 implementation where we had some catalog fields declared as chars that should have been bytes; we worked around the problem at that time by writing special hacks to defeat the validity check on some catalog fields.
In the past the DBMS engineers have been in the habit of coercing any type that didn’t have a specific GCA equivalent into a DB_CHA_TYPE before sending it out of the DBMS. In reality many of those types should have been coerced to DB_BTYE_TYPE, but back then it didn’t matter so much.
Because we have both a major release and an GCA level bump, I think we may have a rare opportunity to put this right, but I’m worried about how it might affect existing customer applications. I know transliteration has already ballooned into a massive project, but I think it needs to be done right; what I think I’m proposing is that we consider the option of starting a project that will fix all of the things that are DB_CHA_TYPE right now that really should be DB_BYTE_TYPE, this would likely mean changes in the DBMS, GCA, and all of the frontend interfaces.
  • Would this actually fix a bunch of these fixed and variable length string issues (Gordy)?
  • Does anyone have any insight into how this would affect existing customer applications?

Gordy - June 29, 2009 4:56 PM

Table/Object keys and object IDs (GCA_ID, procedure names are an exception) can probably be fixed this way. It will require ESQL applications to use varbyte structures to hold the keys or else they will be sent back to the DBMS as character types. Don’t know if there is any hope for backward compatibility.
The other things are either object names or true text based quanitities. Off hand I don’t think these can be treated as binary because they appear in textual contexts, such as query text, in addition to the individual places I listed. I think that anyone who uses extended characters in identifiers in a mixed charset environment deserves the pain they create for themselves.

Steve - June 29, 2009 5:29 PM

So it looks like the table key may be the only issue, in which case we should just fix it.
For the other strings, would it really be a fatal error if they were truncated? Many of them seem to contain values used only for display or tracing purposes, which are never stored permanently in the database. Are there any values that would be stored permanently? Could we get away with silently truncating most of these strings?

Gordy - July 02, 2009 4:06 PM

The variable length strings won't be truncated because GCC can extend their lengths.
The likely errors for these will be translation errors. Most of these are meta-data

strings and translation errors will result in access errors which would be an expected result of using extended characters in a mixed charset environment. The GCN and GCM strings are primarily exposed only on local connections.

>So it looks like the table key may be the only issue, in which case we should just fix it.
The issues as I see them:
  • GCA_COPY gca_domname_cp[ 64 ] - I think this may only be used in libq for error messages. This could be turned into a variable length string which would require a new copy map message. This value also has issues with the long names project. Someone familiar with copy should review this value to determine if truncation and/or long names warrents a new message.
  • GCA_COPY gca_delim_cp[ 2 ]- This appears to allow for double-byte delimiter characters. This can be increased or made variable which would require a new copy map message. Someone familiar with copy should review this and determine if multi-byte UTF8 delimiter characters should be supported.
  • Logical keys - These are handled in three distinct ways:
    • Out of the DBMS in the GCA_RESPONSE message - GCC treats this as a fixed-length character string.
    • Out of the DBMS astuple values - DBMS sends these as DB_CHA_TYPE.
    • Into the DBMS as query parameters - ESQL documents these as character types, but since there are embedded nulls there are likely special handling requirements (varchar structure?).
To treat these binary values as binary, the following would be needed:
    • GCC treats GCA_RESPONSE values as binary arrays.
    • DBMS begins sending these as DB_BYTE_TYPE - don't know how this would affect applications.
    • To use these properly, ESQL applications would need to used the varbyte structure for query parameter values.
  • GCA_ID - These are passed in two ways:
    • As three distinct query parameters (DB_INT_TYPE 4, DB_INT_TYPE 4, DB_CHA_TYPE 64) associated with the ~Q query marker.
    • As a GCA_ID message object (packed as two 4-byte integers and a 64-byte character array).
The client creates an initial ID (a FE-ID) which is sent to the DBMS on the initial requests. The DBMS returns a modified ID (a BE-ID) which is used for subsequent requests related to the object. There are four distinct types of object ID's:
    • Distributed transaction ID - Only the integer components are significant (FE sends string "?" while BE space fills entire string).
    • Repeat Query ID - String component is generated by ESQL using source filename and instance count. String is only used for comparison so I would recommend that string generation algorithm ensure that extended characters are not used.
    • Database Procedure ID - The initial FE-ID is really only used to pass the procedure name. This also has implications for the long names project. Procedure owner is passed as variable length string. My suggestion here is add a new GCA_INVPROC message which treats procedure name as a variable length string just like procedure owner. DBMS must then generate full GCA_ID for GCA_RETPROC message. FE would use new message for all initial invocations. FE would only use the current GCA_INVPROC messages when sending a GCA_ID from a prior invocation.
    • Cursor ID - String holds the cursor name. This is treated as an identifier by the DBMS.
Treating a GCA_ID as a binary value will have implications with cursor names and backward compatibility for GCA_INVPROC. Depending on how the DBMS handles cursor names, the FE could ensure that extended characters are not used in GCA_ID.

Use cases

Example use case scenarios. The idea here is to define the required behavior whether this is already handled or not and irrespective of how feasible.

Real life OpenROAD 3.5 driver migration

OpenROAD example of latin1 client to utf8 server; insert into a dest varchar column of X bytes with X latin1 bytes where at least one byte is non-ASCII - I think we need specific errors that help find out when/why this occurred.

  varchar(1750) destination column
  Insert from latin1 client to urf8 server with bind parameter that is
  full of 1750 bytes, where a few of those bytes are non-ascii
  Oracle DBMS correctly raised the error, something along the lines
  of, "insert into table TABLENAME column name COLNAME
  truncated/overflows" 

insert string literals

insert 'some non-ascii string, say many euro symbols €€€€€€€€€€€€€€'

EXECUTE IMMEDIATE with select in a bind param

select column name/string literal in where clause at end of query string, e.g.

......
 mystr = char(51);
......
mystr='SELECT 1 from sometable where 1=sometable.booöööööö'; /* NOTE at max length for string */
EXEC SQL EXECUTE IMMEDIATE :mystr;

NOTE end of bind param has a number of non-ASCII characters

un-mappable characters

insert/select of un-mappable characters, ideally implement; error, replacement character, and "ignore/skip" options.

  1. E.g. select a CJK glyph from a utf8 server into a latin1 client - we need a transliteration error to be raised somewhere, who raises that error if errors are on (which I think should be the default)
  2. For insert, there really aren't any invalid utf8 destination characters BUT there are (supposed to be) unused characters in latin1, e.g. 0x91 - what do we do when we see that? Or 0xff for ISO_8859-7
  3. new utf8 client inserting CJK glyph into old latin1 server

Client does not have enough space allocated in destination buffer

select of X+y bytes from DBMS when client only has X bytes in destination

NOTE if implemented probably only addressable for newer client (run times).

Requirements

  1. DBMS resources are expensive, try and off load processing to clients;
  2. new clients talking to new servers - new clients need to be smarter and do more work
  3. new clients talking to old servers - new clients need to be smarter and do more work
  4. Old clients need to talk to new servers - new servers need to be smarter (where server means DBMS and server side gcc, client means client exe and client gcc) so that they can handle older clients... at the expense of server processing
  5. is string truncation allowed to occur in the comms layer at-all/with-warnings. If with warnings what should the warning message include and how is that understood/handled by the client, e.g. old client
  6. EXAMPLE from clach04: silent truncation as a default is not ok - old clients talking to new servers never had to worry about truncation due to comms differences, the old default of silent truncation is not appropriate here
  7. EXAMPLE from clach04: truncation errors/warnings MUST have indicators that help in tracking down the problem. The bare minimum is table and column name, information. Ideally there will be some lengths in the message. See real life OpenROAD use case example
  8. EXAMPLE from clach04: transcoding errors need to be fatal, with the option of relaxing this on a session basis - the old behavior was sort of understandable as you could get back what you put in, even if local users saw garbage.
Personal tools
© 2011 Actian Corporation. All Rights Reserved