- Perhaps paranoid, but created after constantly burning myself while creating and evaluating datasets.
- Run through the checklist and document all the answers in the evaluation.
- Save a dated version of the run through on quip or the internal wiki.
- Consider automating any tests that can be.
- Consider automatically generating these queries given the table structure.
- Table Characteristics
[ ]Does the table cover the expected number of <units>?
[ ]Is the total number of rows expected?
[ ]Is this direct instrumentation or a derived table?
- Derived Table
[ ]Manually transform a small sample of rows and check that the results match.
- Direct Instrumentation
[ ]Can any numbers be determined from any alternative sources? If yes, run and document sanity checks.
- Column Relationships
[ ]Assertions possible on individual columns? Apply them.
- Column Characteristics
[ ]Are there instrumentation related artifacts in the dataset? Explain them.
[ ]How many values are NULL? Are they expected?
[ ]Is this column derived from other columns in this table? Sanity check the calculations
[ ]If calculated, run a manual query comparing the values and the distribution of the result.
[ ]If lookup related, check that the numbers on both sides of the mapping make sense.
- User Ids
[ ]Do the userids satisfy userid requirements?
[ ]Is the number of distinct users within reason?
- Normal Columns
[ ]Does the column have the expected number of distinct values?
[ ]For enum-like columns, are there corrupted values?
- Numerical columns
[ ]What are the min/max values? Do they make sense?
[ ]What does the distribution of the data look like? Normal / Bimodal / etc.? Does it match the expected distribution?
[ ]Note the min/p1/p25/p50/mean/mode/p75/p90/p99/max values.
[ ]Are there physical constraints on the columns (e.g. only positive values)? Are they satisfied?
[ ]Should NULL values be coerced to zero or excluded? Are potential queries updated accordingly?
[ ]If representing a quantity, like time or energy - are the units documented in the column? Are values consistent with the units?
[ ]Explicitly collect samples with outlier values across different columns, preferably those with the most outliers.
- Time Series Data
[ ]Is the volume consistent across multiple dates?
[ ]Is the volume consistent within a day?
[ ]Do troughs and peak correspond to user behaviour? Compare the troughs and peaks.
- References Possible Additional Checks