- Perhaps paranoid, but created after constantly burning myself while
creating and evaluating datasets.
- Run through the checklist and document all the answers in the evaluation.
- Save a dated version of the run through on quip or the internal wiki.
- Consider automating any tests that can be.
- Consider automatically generating these queries given the table structure.
- Table Characteristics
[ ] Does the table cover the expected number of <units>?
[ ] Is the total number of rows expected?
[ ] Is this direct instrumentation or a derived table?
- Derived Table
[ ] Manually transform a small sample of rows and check that
the results match.
- Direct Instrumentation
[ ] Can any numbers be determined from any alternative sources? If
yes, run and document sanity checks.
- Column Relationships
[ ] Assertions possible on individual columns? Apply them.
- Column Characteristics
- General
[ ] Are there instrumentation related artifacts in the dataset?
Explain them.
[ ] How many values are NULL? Are they expected?
[ ] Is this column derived from other columns in this table? Sanity
check the calculations
[ ] If calculated, run a manual query comparing the values and the
distribution of the result.
[ ] If lookup related, check that the numbers on both sides of the
mapping make sense.
- User Ids
[ ] Do the userids satisfy userid requirements?
[ ] Is the number of distinct users within reason?
- Normal Columns
[ ] Does the column have the expected number of distinct values?
[ ] For enum-like columns, are there corrupted values?
- Numerical columns
[ ] What are the min/max values? Do they make sense?
[ ] What does the distribution of the data look like? Normal /
Bimodal / etc.? Does it match the expected distribution?
[ ] Note the min/p1/p25/p50/mean/mode/p75/p90/p99/max values.
[ ] Are there physical constraints on the columns (e.g. only
positive values)? Are they satisfied?
[ ] Should NULL values be coerced to zero or excluded? Are
potential queries updated accordingly?
[ ] If representing a quantity, like time or energy - are the units
documented in the column? Are values consistent with the units?
- Outliers
[ ] Explicitly collect samples with outlier values across different
columns, preferably those with the most outliers.
- Time Series Data
[ ] Is the volume consistent across multiple dates?
[ ] Is the volume consistent within a day?
[ ] Do troughs and peak correspond to user behaviour? Compare the
troughs and peaks.
- References Possible Additional Checks