Quick discussions on any Oracle topic I feel like covering, and anything else going on in my life I care to share.

Friday, June 17, 2011

Check for the Existence of a Particular Set of Column Names

How do I check for the existence of a particular set of column names amongst all tables?
It's straight-forward how to find one column name:
SELECT utc.table_name
FROM user_tab_columns utc
WHERE utc.column_name = 'FUND_ID'
ORDER BY utc.table_name;

TABLE_NAME
------------------------------
ANNUITY_ALLOCATION
ANNUITY_FEE
ASSIGNMENT
BENCHMARK_PROCESSING
BENCHMARK_WEIGHT
CATEGORY_LINK
COLLECTION_LINK
COMPOSITE_LINK
DAILY_ASSET
DAILY_PROFILE
DAILY_PROFILE_EXTRA
...
If I need to find two such column names, I suppose I could do a self-join:
SELECT t1.table_name, t1.column_name, t2.column_name
FROM   user_tab_columns t1
     INNER JOIN
       user_tab_columns t2
     ON t1.table_name = t2.table_name
WHERE t1.column_name = 'FUND_ID'
AND   t2.column_name = 'SECURITY_ID'
ORDER BY t1.table_name;

TABLE_NAME                     COLUMN_NAME          COLUMN_NAME
------------------------------ -------------------- --------------------
EXCEPTION_STORE                FUND_ID              SECURITY_ID
SECURITY                       FUND_ID              SECURITY_ID
V_DATA_EXCEPTION               FUND_ID              SECURITY_ID
V_HOLDING_RATE                 FUND_ID              SECURITY_ID
V_HOLDING_REPORT               FUND_ID              SECURITY_ID
V_TRANS_REPORT                 FUND_ID              SECURITY_ID
But how I do extend this to 3, 4 or 5 columns? Simply adding self-joins doesn't seem to scale particularly well.

What I really need is to obtain the set of all column names as a single column per table name, then somehow cross-reference those against a set of desired column names.

Fortunately, Oracle supports the concept of a multiset. This operator allows me to quickly perform set operations between two sets.

To do this, I first need a schema-level collection type:
CREATE OR REPLACE TYPE varchar2_nt IS TABLE OF VARCHAR2(4000);
/
4,000 bytes of course being the maximum allowable string size in SQL.

Now, how do I get the set of all column names for each table? Like this:
SELECT ut.table_name
      ,(SELECT CAST(
                 MULTISET(SELECT utc.column_name
                          FROM user_tab_columns utc
                          WHERE utc.table_name = ut.table_name) AS varchar2_nt
               )
        FROM DUAL)
         column_list
FROM user_tables ut
ORDER BY ut.table_name;

TABLE_NAME                     COLUMN_LIST
------------------------------ ----------------------------------------
ACCOUNT                        VARCHAR2_NT('IS_NONAGS', 'ACCOUNT_ID', '
                               EXTERNAL_ID', 'NAME', 'DESCRIPTION', 'AC
                               COUNT_CATEGORY', 'RECONCILIATION_RULE',
                               'PERIOD_BASED', 'FORCE_SHARE_CLASS')

ACCOUNT_GROUP                  VARCHAR2_NT('ACCOUNT_GROUP_ID', 'GROUP_N
                               AME')

ACCOUNT_GROUP_LINK             VARCHAR2_NT('ACCOUNT_GROUP_ID', 'ACCOUNT
                               _ID', 'SORT_ORDER')
The input to MULTISET in this situation is a query. If the query returns no rows, you get an empty collection. The subquery-in-a-column is called a scalar subquery. It must return precisely one value (or NULL). The collection of column names is that one value!

Now that I have a collection of column names per table, how can I scan it for one, two or more column names all at once?

I can perform the intersection of two such collections. The constructor for my schema-level type can be used to provide a hard-coded list, for example:
varchar2_nt('FUND_ID', 'SECURITY_ID', 'END_DATE')
Now I can do a MULTISET INTERSECT on this set and each column name set:
SELECT ut.table_name
      ,(SELECT CAST(
                 MULTISET(SELECT utc.column_name
                          FROM user_tab_columns utc
                          WHERE utc.table_name = ut.table_name) AS varchar2_nt
               )
        FROM DUAL)
         MULTISET INTERSECT varchar2_nt(
                              'FUND_ID'
                             ,'SECURITY_ID'
                             ,'END_DATE'
                            )
         column_list
FROM user_tables ut
ORDER BY ut.table_name;

TABLE_NAME                     COLUMN_LIST
------------------------------ ----------------------------------------
ACCOUNT                        VARCHAR2_NT()
ACCOUNT_GROUP                  VARCHAR2_NT()
ACCOUNT_GROUP_LINK             VARCHAR2_NT()
ACCOUNT_MAP                    VARCHAR2_NT()
ACCOUNT_SUMMARY                VARCHAR2_NT()
ACCOUNT_VECTORS                VARCHAR2_NT()
ACCOUNT_VIEW                   VARCHAR2_NT()
ALERT                          VARCHAR2_NT()
ALERT_USER                     VARCHAR2_NT()
ANNUITY_ALLOCATION             VARCHAR2_NT('FUND_ID')
ANNUITY_FEE                    VARCHAR2_NT('FUND_ID')
This unfortunately gives me an empty collection for those tables without any of these columns, and partial results for 1, 2, or 3 columns. I need some way to discern the fully populated collections from the rest.

For this, I can use the CARDINALITY function. I want only those tables that have all three columns:
SELECT *
FROM (SELECT ut.table_name
            ,(SELECT CARDINALITY(
                       CAST(
                         MULTISET(SELECT utc.column_name
                                  FROM user_tab_columns utc
                                  WHERE utc.table_name = ut.table_name) AS varchar2_nt
                       )
                         MULTISET INTERSECT varchar2_nt(
                                              'PORTFOLIO_ID'
                                             ,'SECURITY_ID'
                                             ,'END_DATE'
                                            )
                     )
              FROM DUAL)
               num_columns
      FROM user_tables ut) t
WHERE t.num_columns = 3
ORDER BY t.table_name;

TABLE_NAME                     NUM_COLUMNS
------------------------------ -----------
SECURITY_ENHANCEMENT                     3
If I were doing this in a PL/SQL block, I'd avoid the hard-coded constant by defining the collection once:
DECLARE
  arr_collist varchar2_nt
    := varchar2_nt(
         'PORTFOLIO_ID'
        ,'SECURITY_ID'
        ,'END_DATE'
       );
BEGIN
  FOR rec
    IN (SELECT *
        FROM (SELECT ut.table_name
                    ,(SELECT CARDINALITY(
                               CAST(
                                 MULTISET(
                                   SELECT utc.column_name
                                   FROM user_tab_columns utc
                                   WHERE utc.table_name = ut.table_name
                                 ) AS varchar2_nt
                               )
                                 MULTISET INTERSECT arr_collist
                             )
                      FROM DUAL)
                       num_columns
              FROM user_tables ut) t
        WHERE t.num_columns = CARDINALITY(arr_collist)
        ORDER BY t.table_name) LOOP
    DBMS_OUTPUT.put_line(rec.table_name);
  END LOOP;
END;
/
SECURITY_ENHANCEMENT
Now suppose I want to see which tables share the most number of common column names. I can take the intersection of the self-join of all column lists and compare each table's column list to one another; then sort by the number of common column names.
SELECT table_name1
      ,table_name2
      ,(column_list1 MULTISET INTERSECT column_list2) column_list3
      ,CARDINALITY(column_list1 MULTISET INTERSECT column_list2) card
FROM (SELECT ut1.table_name table_name1
            ,(SELECT CAST(
                       MULTISET(SELECT utc.column_name
                                FROM user_tab_columns utc
                                WHERE utc.table_name = ut1.table_name) AS varchar2_nt
                     )
              FROM DUAL)
               column_list1
            ,t2.table_name2
            ,t2.column_list2
      FROM   user_tables ut1
           INNER JOIN
             (SELECT ut2.table_name table_name2
                    ,(SELECT CAST(
                               MULTISET(
                                 SELECT utc.column_name
                                 FROM user_tab_columns utc
                                 WHERE utc.table_name = ut2.table_name
                               ) AS varchar2_nt
                             )
                      FROM DUAL)
                       column_list2
              FROM user_tables ut2) t2
           ON ut1.table_name != t2.table_name2)
WHERE CARDINALITY((column_list1 MULTISET INTERSECT column_list2)) > 0
ORDER BY card DESC;

TABLE_NAME1                    TABLE_NAME2
------------------------------ ------------------------------
COLUMN_LIST3
--------------------------------------------------------------------------------
      CARD
----------
MONTHLY_RETURN                 MONTHLY_RETURN_MATRIX
VARCHAR2_NT('FUND_ID', 'PROFILE_DATE', 'ROR_1M_C_NOLOAD', 'ROR_1M_G_NOLOAD', 'RO
R_2M_C_NOLOAD', 'ROR_3M_C_NOLOAD', 'ROR_4M_C_NOLOAD', 'ROR_5M_C_NOLOAD', 'ROR_6M
_C_NOLOAD', 'ROR_7M_C_NOLOAD', 'ROR_8M_C_NOLOAD', 'ROR_9M_C_NOLOAD', 'ROR_10M_C_
NOLOAD', 'ROR_11M_C_NOLOAD', 'ROR_YTD_C_NOLOAD', 'ROR_YTD_G_NOLOAD', 'ROR_1Y_C_N
OLOAD', 'ROR_2Y_A_NOLOAD', 'ROR_2Y_C_NOLOAD', 'ROR_3Y_A_NOLOAD', 'ROR_3Y_C_NOLOA
D', 'ROR_4Y_A_NOLOAD', 'ROR_4Y_C_NOLOAD', 'ROR_5Y_A_NOLOAD', 'ROR_5Y_C_NOLOAD',
'ROR_6Y_A_NOLOAD', 'ROR_7Y_A_NOLOAD', 'ROR_8Y_A_NOLOAD', 'ROR_9Y_A_NOLOAD', 'ROR
_10Y_A_NOLOAD', 'ROR_10Y_C_NOLOAD', 'ROR_15Y_C_NOLOAD', 'ROR_15Y_A_NOLOAD', 'ROR
_20Y_C_NOLOAD', 'ROR_20Y_A_NOLOAD', 'ROR_25Y_C_NOLOAD', 'ROR_25Y_A_NOLOAD', 'ROR
_INCP_C_NOLOAD', 'ROR_INCP_A_NOLOAD', 'ROR_ADV_INCP_C_NOLOAD', 'ROR_ADV_INCP_A_N
OLOAD', 'ROR_USER_INCP_C_NOLOAD', 'ROR_USER_INCP_A_NOLOAD', 'ROR_PERF_INCP_C_NOL
OAD', 'ROR_PERF_INCP_A_NOLOAD', 'ROR_1M_C_LOAD', 'ROR_2M_C_LOAD', 'ROR_3M_C_LOAD
', 'ROR_4M_C_LOAD', 'ROR_5M_C_LOAD', 'ROR_6M_C_LOAD', 'ROR_7M_C_LOAD', 'ROR_8M_C
_LOAD', 'ROR_9M_C_LOAD', 'ROR_10M_C_LOAD', 'ROR_11M_C_LOAD', 'ROR_YTD_C_LOAD', '
ROR_1Y_C_LOAD', 'ROR_2Y_A_LOAD', 'ROR_2Y_C_LOAD', 'ROR_3Y_A_LOAD', 'ROR_3Y_C_LOA
D', 'ROR_4Y_A_LOAD', 'ROR_4Y_C_LOAD', 'ROR_5Y_A_LOAD', 'ROR_5Y_C_LOAD', 'ROR_6Y_
A_LOAD', 'ROR_7Y_A_LOAD', 'ROR_8Y_A_LOAD', 'ROR_9Y_A_LOAD', 'ROR_10Y_A_LOAD', 'R
OR_10Y_C_LOAD', 'ROR_15Y_C_LOAD', 'ROR_15Y_A_LOAD', 'ROR_20Y_C_LOAD', 'ROR_20Y_A
_LOAD', 'ROR_25Y_C_LOAD', 'ROR_25Y_A_LOAD', 'ROR_INCP_C_LOAD', 'ROR_INCP_A_LOAD'
, 'ROR_ADV_INCP_C_LOAD', 'ROR_ADV_INCP_A_LOAD', 'ROR_USER_INCP_C_LOAD', 'ROR_USE
R_INCP_A_LOAD', 'ROR_PERF_INCP_C_LOAD', 'ROR_PERF_INCP_A_LOAD', 'RECORD_GUID', '
CREATORDB_ID', 'REVISION_NBR', 'GENERATION_ID', 'REFERENCE_COUNT', 'SYNC_SESSION
_ID')
        92
I do get the same pair of tables listed twice, first as table_name1-table_name2 and then as table_name2-table_name1.

I can also use the MULTISET and related operators and functions directly in assignment statements in PL/SQL. For example:
DECLARE
  x varchar2_nt := varchar2_nt();
BEGIN
  x :=
    varchar2_nt(
      'a'
     ,'b'
     ,'b'
     ,'c'
    )
      MULTISET INTERSECT DISTINCT varchar2_nt(
                                    'b'
                                   ,'c'
                                   ,'c'
                                   ,'d'
                                  );

  IF (x.COUNT > 0) THEN
    DBMS_OUTPUT.put_line('There are ' || x.COUNT || ' common elements:');

    FOR i IN x.FIRST .. x.LAST LOOP
      DBMS_OUTPUT.put_line(x(i));
    END LOOP;
  ELSE
    DBMS_OUTPUT.put_line('There are no common elements');
  END IF;
END;
/
There are 2 common elements:
b
c

If you play around with this, you'll see the intersection of a non-empty set and an empty set is an empty set.

In addition to INTERSECT [DISTINCT], there's UNION [DISTINCT] and EXCEPT [DISTINCT].

No comments:

Post a Comment