Quick discussions on any Oracle topic I feel like covering, and anything else going on in my life I care to share.
Showing posts with label Cardinality. Show all posts
Showing posts with label Cardinality. Show all posts

Saturday, July 16, 2011

Addendum to "Check for Existence" Article

Here is another use for the technique shown in Check for Existence.

A coworker recently asserted all of our tables have both a surrogate key (an otherwise meaningless numerical value obtained from a sequence object, used to uniquely identify a row) and a business key (or keys, typically one or more unique key constraints, each consisting of one or more columns, that uniquely identify a row based on meaningful values). To test his assertion, I looked for tables that have a primary key AND at least one unique key constraint.

Find Tables that have a Primary Key AND at least One Unique Key as well

SELECT iv.table_name, iv.constraint_list
FROM (SELECT ut.table_name
            ,(SELECT CAST(
                       MULTISET(SELECT uc.constraint_name
                                FROM user_constraints uc
                                WHERE uc.table_name = ut.table_name
                                AND   uc.constraint_type IN ('P', 'U')) AS varchar2_nt
                     )
              FROM DUAL)
               constraint_list
      FROM   user_tables ut
           INNER JOIN
             user_constraints uc2
           ON uc2.table_name = ut.table_name
      WHERE uc2.constraint_type = 'P') iv -- can be exactly 0 or 1 pkey per table
WHERE CARDINALITY(iv.constraint_list) > 1
ORDER BY iv.table_name;
Out of 540 tables in our schema, I found just 175 tables having both a surrogate and business key.

I also looked for tables that don't have either a unique or primary key defined. A unique key can fill in for a primary key, and even be used as the target for a foreign key constraint. I used this query to find such tables:

Tables that Don't have a Pkey or Ukey

SELECT table_name
FROM user_tables ut
WHERE NOT EXISTS
        (SELECT 1
         FROM user_constraints uc
         WHERE uc.table_name = ut.table_name
         AND   uc.constraint_type IN ('P', 'U'));
Turns out there are 13 tables that don't have a primary or unique key. These are our Global Temporary Tables, which by definition are small and contain extremely volatile data (goes away when the session closes).

Friday, June 17, 2011

Check for the Existence of a Particular Set of Column Names

How do I check for the existence of a particular set of column names amongst all tables?
It's straight-forward how to find one column name:
SELECT utc.table_name
FROM user_tab_columns utc
WHERE utc.column_name = 'FUND_ID'
ORDER BY utc.table_name;

TABLE_NAME
------------------------------
ANNUITY_ALLOCATION
ANNUITY_FEE
ASSIGNMENT
BENCHMARK_PROCESSING
BENCHMARK_WEIGHT
CATEGORY_LINK
COLLECTION_LINK
COMPOSITE_LINK
DAILY_ASSET
DAILY_PROFILE
DAILY_PROFILE_EXTRA
...
If I need to find two such column names, I suppose I could do a self-join:
SELECT t1.table_name, t1.column_name, t2.column_name
FROM   user_tab_columns t1
     INNER JOIN
       user_tab_columns t2
     ON t1.table_name = t2.table_name
WHERE t1.column_name = 'FUND_ID'
AND   t2.column_name = 'SECURITY_ID'
ORDER BY t1.table_name;

TABLE_NAME                     COLUMN_NAME          COLUMN_NAME
------------------------------ -------------------- --------------------
EXCEPTION_STORE                FUND_ID              SECURITY_ID
SECURITY                       FUND_ID              SECURITY_ID
V_DATA_EXCEPTION               FUND_ID              SECURITY_ID
V_HOLDING_RATE                 FUND_ID              SECURITY_ID
V_HOLDING_REPORT               FUND_ID              SECURITY_ID
V_TRANS_REPORT                 FUND_ID              SECURITY_ID
But how I do extend this to 3, 4 or 5 columns? Simply adding self-joins doesn't seem to scale particularly well.

What I really need is to obtain the set of all column names as a single column per table name, then somehow cross-reference those against a set of desired column names.

Fortunately, Oracle supports the concept of a multiset. This operator allows me to quickly perform set operations between two sets.

To do this, I first need a schema-level collection type:
CREATE OR REPLACE TYPE varchar2_nt IS TABLE OF VARCHAR2(4000);
/
4,000 bytes of course being the maximum allowable string size in SQL.

Now, how do I get the set of all column names for each table? Like this:
SELECT ut.table_name
      ,(SELECT CAST(
                 MULTISET(SELECT utc.column_name
                          FROM user_tab_columns utc
                          WHERE utc.table_name = ut.table_name) AS varchar2_nt
               )
        FROM DUAL)
         column_list
FROM user_tables ut
ORDER BY ut.table_name;

TABLE_NAME                     COLUMN_LIST
------------------------------ ----------------------------------------
ACCOUNT                        VARCHAR2_NT('IS_NONAGS', 'ACCOUNT_ID', '
                               EXTERNAL_ID', 'NAME', 'DESCRIPTION', 'AC
                               COUNT_CATEGORY', 'RECONCILIATION_RULE',
                               'PERIOD_BASED', 'FORCE_SHARE_CLASS')

ACCOUNT_GROUP                  VARCHAR2_NT('ACCOUNT_GROUP_ID', 'GROUP_N
                               AME')

ACCOUNT_GROUP_LINK             VARCHAR2_NT('ACCOUNT_GROUP_ID', 'ACCOUNT
                               _ID', 'SORT_ORDER')
The input to MULTISET in this situation is a query. If the query returns no rows, you get an empty collection. The subquery-in-a-column is called a scalar subquery. It must return precisely one value (or NULL). The collection of column names is that one value!

Now that I have a collection of column names per table, how can I scan it for one, two or more column names all at once?

I can perform the intersection of two such collections. The constructor for my schema-level type can be used to provide a hard-coded list, for example:
varchar2_nt('FUND_ID', 'SECURITY_ID', 'END_DATE')
Now I can do a MULTISET INTERSECT on this set and each column name set:
SELECT ut.table_name
      ,(SELECT CAST(
                 MULTISET(SELECT utc.column_name
                          FROM user_tab_columns utc
                          WHERE utc.table_name = ut.table_name) AS varchar2_nt
               )
        FROM DUAL)
         MULTISET INTERSECT varchar2_nt(
                              'FUND_ID'
                             ,'SECURITY_ID'
                             ,'END_DATE'
                            )
         column_list
FROM user_tables ut
ORDER BY ut.table_name;

TABLE_NAME                     COLUMN_LIST
------------------------------ ----------------------------------------
ACCOUNT                        VARCHAR2_NT()
ACCOUNT_GROUP                  VARCHAR2_NT()
ACCOUNT_GROUP_LINK             VARCHAR2_NT()
ACCOUNT_MAP                    VARCHAR2_NT()
ACCOUNT_SUMMARY                VARCHAR2_NT()
ACCOUNT_VECTORS                VARCHAR2_NT()
ACCOUNT_VIEW                   VARCHAR2_NT()
ALERT                          VARCHAR2_NT()
ALERT_USER                     VARCHAR2_NT()
ANNUITY_ALLOCATION             VARCHAR2_NT('FUND_ID')
ANNUITY_FEE                    VARCHAR2_NT('FUND_ID')
This unfortunately gives me an empty collection for those tables without any of these columns, and partial results for 1, 2, or 3 columns. I need some way to discern the fully populated collections from the rest.

For this, I can use the CARDINALITY function. I want only those tables that have all three columns:
SELECT *
FROM (SELECT ut.table_name
            ,(SELECT CARDINALITY(
                       CAST(
                         MULTISET(SELECT utc.column_name
                                  FROM user_tab_columns utc
                                  WHERE utc.table_name = ut.table_name) AS varchar2_nt
                       )
                         MULTISET INTERSECT varchar2_nt(
                                              'PORTFOLIO_ID'
                                             ,'SECURITY_ID'
                                             ,'END_DATE'
                                            )
                     )
              FROM DUAL)
               num_columns
      FROM user_tables ut) t
WHERE t.num_columns = 3
ORDER BY t.table_name;

TABLE_NAME                     NUM_COLUMNS
------------------------------ -----------
SECURITY_ENHANCEMENT                     3
If I were doing this in a PL/SQL block, I'd avoid the hard-coded constant by defining the collection once:
DECLARE
  arr_collist varchar2_nt
    := varchar2_nt(
         'PORTFOLIO_ID'
        ,'SECURITY_ID'
        ,'END_DATE'
       );
BEGIN
  FOR rec
    IN (SELECT *
        FROM (SELECT ut.table_name
                    ,(SELECT CARDINALITY(
                               CAST(
                                 MULTISET(
                                   SELECT utc.column_name
                                   FROM user_tab_columns utc
                                   WHERE utc.table_name = ut.table_name
                                 ) AS varchar2_nt
                               )
                                 MULTISET INTERSECT arr_collist
                             )
                      FROM DUAL)
                       num_columns
              FROM user_tables ut) t
        WHERE t.num_columns = CARDINALITY(arr_collist)
        ORDER BY t.table_name) LOOP
    DBMS_OUTPUT.put_line(rec.table_name);
  END LOOP;
END;
/
SECURITY_ENHANCEMENT
Now suppose I want to see which tables share the most number of common column names. I can take the intersection of the self-join of all column lists and compare each table's column list to one another; then sort by the number of common column names.
SELECT table_name1
      ,table_name2
      ,(column_list1 MULTISET INTERSECT column_list2) column_list3
      ,CARDINALITY(column_list1 MULTISET INTERSECT column_list2) card
FROM (SELECT ut1.table_name table_name1
            ,(SELECT CAST(
                       MULTISET(SELECT utc.column_name
                                FROM user_tab_columns utc
                                WHERE utc.table_name = ut1.table_name) AS varchar2_nt
                     )
              FROM DUAL)
               column_list1
            ,t2.table_name2
            ,t2.column_list2
      FROM   user_tables ut1
           INNER JOIN
             (SELECT ut2.table_name table_name2
                    ,(SELECT CAST(
                               MULTISET(
                                 SELECT utc.column_name
                                 FROM user_tab_columns utc
                                 WHERE utc.table_name = ut2.table_name
                               ) AS varchar2_nt
                             )
                      FROM DUAL)
                       column_list2
              FROM user_tables ut2) t2
           ON ut1.table_name != t2.table_name2)
WHERE CARDINALITY((column_list1 MULTISET INTERSECT column_list2)) > 0
ORDER BY card DESC;

TABLE_NAME1                    TABLE_NAME2
------------------------------ ------------------------------
COLUMN_LIST3
--------------------------------------------------------------------------------
      CARD
----------
MONTHLY_RETURN                 MONTHLY_RETURN_MATRIX
VARCHAR2_NT('FUND_ID', 'PROFILE_DATE', 'ROR_1M_C_NOLOAD', 'ROR_1M_G_NOLOAD', 'RO
R_2M_C_NOLOAD', 'ROR_3M_C_NOLOAD', 'ROR_4M_C_NOLOAD', 'ROR_5M_C_NOLOAD', 'ROR_6M
_C_NOLOAD', 'ROR_7M_C_NOLOAD', 'ROR_8M_C_NOLOAD', 'ROR_9M_C_NOLOAD', 'ROR_10M_C_
NOLOAD', 'ROR_11M_C_NOLOAD', 'ROR_YTD_C_NOLOAD', 'ROR_YTD_G_NOLOAD', 'ROR_1Y_C_N
OLOAD', 'ROR_2Y_A_NOLOAD', 'ROR_2Y_C_NOLOAD', 'ROR_3Y_A_NOLOAD', 'ROR_3Y_C_NOLOA
D', 'ROR_4Y_A_NOLOAD', 'ROR_4Y_C_NOLOAD', 'ROR_5Y_A_NOLOAD', 'ROR_5Y_C_NOLOAD',
'ROR_6Y_A_NOLOAD', 'ROR_7Y_A_NOLOAD', 'ROR_8Y_A_NOLOAD', 'ROR_9Y_A_NOLOAD', 'ROR
_10Y_A_NOLOAD', 'ROR_10Y_C_NOLOAD', 'ROR_15Y_C_NOLOAD', 'ROR_15Y_A_NOLOAD', 'ROR
_20Y_C_NOLOAD', 'ROR_20Y_A_NOLOAD', 'ROR_25Y_C_NOLOAD', 'ROR_25Y_A_NOLOAD', 'ROR
_INCP_C_NOLOAD', 'ROR_INCP_A_NOLOAD', 'ROR_ADV_INCP_C_NOLOAD', 'ROR_ADV_INCP_A_N
OLOAD', 'ROR_USER_INCP_C_NOLOAD', 'ROR_USER_INCP_A_NOLOAD', 'ROR_PERF_INCP_C_NOL
OAD', 'ROR_PERF_INCP_A_NOLOAD', 'ROR_1M_C_LOAD', 'ROR_2M_C_LOAD', 'ROR_3M_C_LOAD
', 'ROR_4M_C_LOAD', 'ROR_5M_C_LOAD', 'ROR_6M_C_LOAD', 'ROR_7M_C_LOAD', 'ROR_8M_C
_LOAD', 'ROR_9M_C_LOAD', 'ROR_10M_C_LOAD', 'ROR_11M_C_LOAD', 'ROR_YTD_C_LOAD', '
ROR_1Y_C_LOAD', 'ROR_2Y_A_LOAD', 'ROR_2Y_C_LOAD', 'ROR_3Y_A_LOAD', 'ROR_3Y_C_LOA
D', 'ROR_4Y_A_LOAD', 'ROR_4Y_C_LOAD', 'ROR_5Y_A_LOAD', 'ROR_5Y_C_LOAD', 'ROR_6Y_
A_LOAD', 'ROR_7Y_A_LOAD', 'ROR_8Y_A_LOAD', 'ROR_9Y_A_LOAD', 'ROR_10Y_A_LOAD', 'R
OR_10Y_C_LOAD', 'ROR_15Y_C_LOAD', 'ROR_15Y_A_LOAD', 'ROR_20Y_C_LOAD', 'ROR_20Y_A
_LOAD', 'ROR_25Y_C_LOAD', 'ROR_25Y_A_LOAD', 'ROR_INCP_C_LOAD', 'ROR_INCP_A_LOAD'
, 'ROR_ADV_INCP_C_LOAD', 'ROR_ADV_INCP_A_LOAD', 'ROR_USER_INCP_C_LOAD', 'ROR_USE
R_INCP_A_LOAD', 'ROR_PERF_INCP_C_LOAD', 'ROR_PERF_INCP_A_LOAD', 'RECORD_GUID', '
CREATORDB_ID', 'REVISION_NBR', 'GENERATION_ID', 'REFERENCE_COUNT', 'SYNC_SESSION
_ID')
        92
I do get the same pair of tables listed twice, first as table_name1-table_name2 and then as table_name2-table_name1.

I can also use the MULTISET and related operators and functions directly in assignment statements in PL/SQL. For example:
DECLARE
  x varchar2_nt := varchar2_nt();
BEGIN
  x :=
    varchar2_nt(
      'a'
     ,'b'
     ,'b'
     ,'c'
    )
      MULTISET INTERSECT DISTINCT varchar2_nt(
                                    'b'
                                   ,'c'
                                   ,'c'
                                   ,'d'
                                  );

  IF (x.COUNT > 0) THEN
    DBMS_OUTPUT.put_line('There are ' || x.COUNT || ' common elements:');

    FOR i IN x.FIRST .. x.LAST LOOP
      DBMS_OUTPUT.put_line(x(i));
    END LOOP;
  ELSE
    DBMS_OUTPUT.put_line('There are no common elements');
  END IF;
END;
/
There are 2 common elements:
b
c

If you play around with this, you'll see the intersection of a non-empty set and an empty set is an empty set.

In addition to INTERSECT [DISTINCT], there's UNION [DISTINCT] and EXCEPT [DISTINCT].