Quick discussions on any Oracle topic I feel like covering, and anything else going on in my life I care to share.

Tuesday, February 28, 2012

Non-Printable Characters

One of our vendors rejected a file we sent them because it had an non-printable ASCII character in it (one record out of tens of thousands). Once I tracked down the offending customer row from the file offset they provided, it seemed like a good idea to see what other similar data might also have non-printable characters embedded in them.

The printable and some non-printable characters are given in the ASCII table. The latter can vary by the character set in use (particularly the ones over 127d). You can find a number of these on the internet:

http://www.asciitable.com/

One way to detect non-printable characters is to strip them out of the string and compare that to the untranslated string; if they’re different, then I’ve found one.

I assembled a string of unprintable characters following a space. This is the source format string. The target format string only contains a space.

I also wanted to display the offset in the string of the offending character, and its decimal equivalent value.

DECLARE
  v_strip VARCHAR2(32767) := ' ';
  v_ascii INTEGER;

  CURSOR cur
  IS
    SELECT c.cust_num
          ,c.full_name
          ,TRANSLATE(c.full_name, v_strip, ' ') t_full_name
    FROM cust c
    WHERE TRANSLATE(c.full_name, v_strip, ' ') != c.full_name
    ORDER BY c.cust_num;
BEGIN
  -- list of non-printable characters
  FOR i IN 0 .. 31 LOOP
    v_strip := v_strip || CHR(i);
  END LOOP;

  FOR i IN 127 .. 255 LOOP
    v_strip := v_strip || CHR(i);
  END LOOP;

  -- display full names with embedded non-printable chars
  FOR rec IN cur LOOP
    DBMS_OUTPUT.put_line(   rec.cust_num
                         || ' ['
                         || rec.full_name
                         || ']['
                         || rec.t_full_name
                         || ']');

    FOR i IN 1 .. LENGTH(rec.full_name) LOOP
      v_ascii := ASCII(SUBSTR(rec.full_name, i, 1));

      IF (v_ascii <= 31 OR v_ascii >:= 127) THEN
        DBMS_OUTPUT.put_line('offset=' || i || ' [' || v_ascii || ']');
      END IF;
    END LOOP;

    DBMS_OUTPUT.put_line(' ');
  END LOOP;
END;
/

1009999380 [MARY ELLEN BROTH][MARY ELLEN  BROTH]
offset=12 [127]

1009999363 [ANAZZASIO¿¿¿        JOSEPH              ][ANAZZASIO        JOSEPH              ]
offset=10 [225]
offset=11 [225]
offset=12 [225]

1009999901 [SEWELL                PATRICK        R              ][SEWELL             PATRICK        R              ]
offset=7 [9]

1009999385 [WEINMAN             JACKIE         _              ][WEINMAN             JACKIE                       ]
offset=36 [0]

Various programming bugs in the source system probably accounts for these non-printable characters. For example, a NUL (null) character might have been embedded by an errant C program, that inadvertently included the string termination character in a memcpy call. The Backspace character might have been included when the data entry clerk tried to erase a misspelling, but the keyboard interpretation settings allowed the Backspace to get captured along with the rest of the string. 0x255 is a common maximum code used as a guard value. Most UI’s are programmed to jump to the next field on a Tab command, but that’s not always the case, as seen here.

Once the offending rows have been identified, they can be cleaned up via a batch program by simply stripping them out. Or perhaps human intervention is required to call up the customer and ask how to correctly spell their name!