The printable and some non-printable characters are given in the ASCII table. The latter can vary by the character set in use (particularly the ones over 127d). You can find a number of these on the internet:
http://www.asciitable.com/
One way to detect non-printable characters is to strip them out of the string and compare that to the untranslated string; if they’re different, then I’ve found one.
I assembled a string of unprintable characters following a space. This is the source format string. The target format string only contains a space.
I also wanted to display the offset in the string of the offending character, and its decimal equivalent value.
DECLARE
v_strip VARCHAR2(32767) := ' ';
v_ascii INTEGER;
CURSOR cur
IS
SELECT c.cust_num
,c.full_name
,TRANSLATE(c.full_name, v_strip, ' ') t_full_name
FROM cust c
WHERE TRANSLATE(c.full_name, v_strip, ' ') != c.full_name
ORDER BY c.cust_num;
BEGIN
-- list of non-printable characters
FOR i IN 0 .. 31 LOOP
v_strip := v_strip || CHR(i);
END LOOP;
FOR i IN 127 .. 255 LOOP
v_strip := v_strip || CHR(i);
END LOOP;
-- display full names with embedded non-printable chars
FOR rec IN cur LOOP
DBMS_OUTPUT.put_line( rec.cust_num
|| ' ['
|| rec.full_name
|| ']['
|| rec.t_full_name
|| ']');
FOR i IN 1 .. LENGTH(rec.full_name) LOOP
v_ascii := ASCII(SUBSTR(rec.full_name, i, 1));
IF (v_ascii <= 31 OR v_ascii >:= 127) THEN
DBMS_OUTPUT.put_line('offset=' || i || ' [' || v_ascii || ']');
END IF;
END LOOP;
DBMS_OUTPUT.put_line(' ');
END LOOP;
END;
/
1009999380 [MARY ELLEN BROTH][MARY ELLEN BROTH]
offset=12 [127]
1009999363 [ANAZZASIO¿¿¿ JOSEPH ][ANAZZASIO JOSEPH ]
offset=10 [225]
offset=11 [225]
offset=12 [225]
1009999901 [SEWELL PATRICK R ][SEWELL PATRICK R ]
offset=7 [9]
1009999385 [WEINMAN JACKIE _ ][WEINMAN JACKIE ]
offset=36 [0]
Various programming bugs in the source system probably accounts for these non-printable characters. For example, a NUL (null) character might have been embedded by an errant C program, that inadvertently included the string termination character in a memcpy call. The Backspace character might have been included when the data entry clerk tried to erase a misspelling, but the keyboard interpretation settings allowed the Backspace to get captured along with the rest of the string. 0x255 is a common maximum code used as a guard value. Most UI’s are programmed to jump to the next field on a Tab command, but that’s not always the case, as seen here.
Once the offending rows have been identified, they can be cleaned up via a batch program by simply stripping them out. Or perhaps human intervention is required to call up the customer and ask how to correctly spell their name!
No comments:
Post a Comment