Database Technology: how to check statistic value of table in NPS

use nz_get to get the value of statistic status.

Usage: nz_get [ database [ table ]]

Purpose: Get (and display) the statistics for a user table or a system table.

Statistics are used by the NPS optimizer in order to plan the best
way to run each SQL query (for queries that are run against user data,
and also for queries that are run against the system catalogs).

Some of these statistics are automatically maintained by the system
(i.e., 'always-there' statistics). Other values are dynamically
calculated+saved when a "GENERATE [EXPRESS] STATISTICS" command is
issued.

This script will dump out all of the known statistics for a table.

Inputs: The database and table names are optional. If not specified, then
all tables in all databases will be reported upon.

In lieu of a table name, you may instead specify a synonym name or
materialized view name -- in which case the statistics for the
underlying table will be displayed.

The table name may also be any one of the "SYSTEM TABLE" or
"MANAGEMENT TABLE" names.

o For a list of these names, see 'nz_get_sysmgmt_table_names'

o By default, normal (non-ADMIN) users do not have access
to any of these SYSTEM/MANAGEMENT tables. They are only
allowed to access them indirectly (thru SYSTEM/MANAGEMENT
views).

o Some of these tables are global in nature, so it does not
matter which database name you specify on the command line.
The results will always be the same. e.g. _VT_SPU_ZMAP_INFO

o Some of these tables are local in nature (database specific)
so the database you specify is relevant. e.g. _T_ATTRIBUTE

o If you wanted to look at the statistics for all of the
SYSTEM/MANAGEMENT tables, you could issue a command such as this:

for TABLE in `nz_get_sysmgmt_table_names system`; do nz_get system $TABLE; done

Outputs: The output of this script will include the following information

Database: name, objid

Table: name, objid, distribution clause, row count(statistic)

Per-Column: attnum the logical column number (from 1-1600)
Column Name
Statistics Status the validity/freshness of the statistics
Minimum Value
Maximum Value
# of Unique Values also known as the dispersion value
# of NULLs
MaxLen The MaxLen+AvgLen are only collected for
AvgLen columns of type VARCHAR, NCHAR, NVARCHAR

Regarding the "Statistic Status" ... the following indicates that statistics
have been explicitly generated against this table/column (via a GENSTATs
command) and that they are 100% up-to-date. The ONLY difference between the
three is in how the "# of Unique Values" (the dispersion) is generated.
The disperion value is the most complex statistic to gather, in terms of time +
memory + cpu usage.

Full Similar to doing a "COUNT(DISTINCT columname)". It generates the
most accurate dispersion value, but has the most overhead associated
with it.
Express It uses math to estimate the dispersion value (a hash is generated
for each column value, and then the number of unique hash keys is
added up). Much faster, but less precise.
Basic The dispersion calculation is skipped entirely. All of the other
"basic" statistics are still collected.

So, statistics are either 100% correct and up-to-date, or they're not. Any change
to a table ... even if it involves just 1 row being inserted/updated/deleted ...
results in the statistics no longer being 100% up-to-date. In other words, they
would be outdated. This doesn't mean that they would be bad or wrong. Just outdated.

When rows are nzload'ed/INSERT'ed into a table, NPS automatically compares
each column value against the MIN/MAX statistic for the column, and updates the
table's statistics accordingly. This is also referred to as "always there
statistics". This applies to all columns/all datatypes -- EXCEPT for text
columns (CHAR, NCHAR, VARCHAR, NVARCHAR). So that means that the MIN/MAX values
for the column are still OK -- they are still up-to-date (at the same time, the
table's rowcount statistic is also updated). But the other statistics values
(the dispersion and # of nulls) are not up-to-date as they can only be recomputed
via an explicit GENSTATS. In this case, the "Statistics Status" will show

Full Min/Max OK
Express Min/Max OK
Basic Min/Max OK

As mentioned, this doesn't apply to text columns. So all of the statistics that
are maintained for text columns would be outdated ... and can only be refreshed
via an explicit GENSTATS. In this case, the "Statistics Status" will show

Full Outdated
Express Outdated
Basic Outdated

If you never ever bothered to do a GENSTATS on a particular table, then the only
per-column statistics that would be available are the "always there statistics"
that NPS automatically collects. So this means you will have MIN/MAX statistics
(and only MIN/MAX statistics) for those columns. Which will be displayed as

Min/Max Only

As mentioned, "always there statistics" aren't collected for text columns. So
if you've never done an explicit GENSTATS ... and if NPS has never automatically
collected any statistics ... then there will be no statistics at all available
to the optimizer. In this case the "Statistics Status" will show

Unavailable

Other situations in which "Unavailable" will be used
o if the table is empty
o if you've loaded a compressed external file into an empty table

We don't attempt to collect statistics on very wide text columns, which are
CHAR/VARCHAR columns with a defined length >= 24565
NCHAR/NVARCHAR columns with a defined length >= 6142
The "Statistics Status" for those columns will be displayed as

not maintained

Regarding "always there statistics" ...

when rows are nzload'ed/INSERT'ed into a table we can easily compare the column
values against the MIN+MAX statistics, and update them accordingly (if needed)

when rows are DELETE'd from a table, even though you might be deleting a row
that matches a particular MIN value or a particular MAX value, there is no
way to answer the following
a) is this the only row in the table with that particular value ?
b) and if so, what is the next MIN value or the next MAX value ?
The only way to determine those answers is via a GENSTATS (which processes
every row in the table). Thus, a DELETE operation never shrinks the
MIN/MAX values. But we still know that we can use+trust those statistics ...
that there is no value in the table column that is outside the current
MIN/MAX range.

when rows are UPDATE'd, we actually do an INSERT+DELETE. See above.

Regarding GENERATE [EXPRESS] STATISTICS ...

as of 4.6, there is really only one GENERATE STATISTICS command now. Both
GENERATE STATISTICS and GENERATE EXPRESS STATISTICS do the same thing.
GENERATE EXPRESS STATISTICS will someday be retired.

the only difference between the two has been in how they calculate the
"# of Unique Values" ... the dispersion value ... for a column. That
still occurs, but now NPS decides for itself which method to use.

If the table's rowcount is <= 10 * SPU_COUNT
Then FULL stats will be collected
where we actually try to do a COUNT(DISTINCT on_each_column)
# Of Columns Processed Per Scan: set STATS_COL_GRP_LIMIT = 10;

If the table's rowcount is <= JIT_DISP_MIN_ROWS (the default is 500,000,000)
Then EXPRESS stats will be collected
where we hash the column values to approximate the dispersion value
# Of Columns Processed Per Scan: set SAMPLED_STATS_COL_GRP_LIMIT = 50;

Else we skip the dispersion calculation alltogether (all other statistics
for the column will still be gathered)
# Of Columns Processed Per Scan: currently unbounded

Database Technology

Friday, October 24, 2014

how to check statistic value of table in NPS

No comments:

Post a Comment