- Perhaps paranoid, but created after constantly burning myself while
creating and evaluating datasets.
- Run through the checklist and document all the answers in the evaluation.
- Save a dated version of the run through on quip or the internal wiki.
- Consider automating any tests that can be.
- Consider automatically generating these queries given the table structure.
- Table Characteristics
[ ]
Does the table cover the expected number of <units>?
[ ]
Is the total number of rows expected?
[ ]
Is this direct instrumentation or a derived table?
- Derived Table
[ ]
Manually transform a small sample of rows and check that
the results match.
- Direct Instrumentation
[ ]
Can any numbers be determined from any alternative sources? If
yes, run and document sanity checks.
- Column Relationships
[ ]
Assertions possible on individual columns? Apply them.
- Column Characteristics
- General
[ ]
Are there instrumentation related artifacts in the dataset?
Explain them.
[ ]
How many values are NULL? Are they expected?
[ ]
Is this column derived from other columns in this table? Sanity
check the calculations
[ ]
If calculated, run a manual query comparing the values and the
distribution of the result.
[ ]
If lookup related, check that the numbers on both sides of the
mapping make sense.
- User Ids
[ ]
Do the userids satisfy userid requirements?
[ ]
Is the number of distinct users within reason?
- Normal Columns
[ ]
Does the column have the expected number of distinct values?
[ ]
For enum-like columns, are there corrupted values?
- Numerical columns
[ ]
What are the min/max values? Do they make sense?
[ ]
What does the distribution of the data look like? Normal /
Bimodal / etc.? Does it match the expected distribution?
[ ]
Note the min/p1/p25/p50/mean/mode/p75/p90/p99/max values.
[ ]
Are there physical constraints on the columns (e.g. only
positive values)? Are they satisfied?
[ ]
Should NULL values be coerced to zero or excluded? Are
potential queries updated accordingly?
[ ]
If representing a quantity, like time or energy - are the units
documented in the column? Are values consistent with the units?
- Outliers
[ ]
Explicitly collect samples with outlier values across different
columns, preferably those with the most outliers.
- Time Series Data
[ ]
Is the volume consistent across multiple dates?
[ ]
Is the volume consistent within a day?
[ ]
Do troughs and peak correspond to user behaviour? Compare the
troughs and peaks.
- References Possible Additional Checks