Evaluating Datasets

Perhaps paranoid, but created after constantly burning myself while creating and evaluating datasets.
Run through the checklist and document all the answers in the evaluation.
Save a dated version of the run through on quip or the internal wiki.
Consider automating any tests that can be.
Consider automatically generating these queries given the table structure.

Table Characteristics
- [ ] Does the table cover the expected number of <units>?
- [ ] Is the total number of rows expected?
- [ ] Is this direct instrumentation or a derived table?
- Derived Table
  - [ ] Manually transform a small sample of rows and check that the results match.
- Direct Instrumentation
  - [ ] Can any numbers be determined from any alternative sources? If yes, run and document sanity checks.
- Column Relationships
  - [ ] Assertions possible on individual columns? Apply them.
Column Characteristics
- General
  - [ ] Are there instrumentation related artifacts in the dataset? Explain them.
  - [ ] How many values are NULL? Are they expected?
  - [ ] Is this column derived from other columns in this table? Sanity check the calculations
    - [ ] If calculated, run a manual query comparing the values and the distribution of the result.
    - [ ] If lookup related, check that the numbers on both sides of the mapping make sense.
- User Ids
  - [ ] Do the userids satisfy userid requirements?
  - [ ] Is the number of distinct users within reason?
- Normal Columns
  - [ ] Does the column have the expected number of distinct values?
  - [ ] For enum-like columns, are there corrupted values?
- Numerical columns
  - [ ] What are the min/max values? Do they make sense?
  - [ ] What does the distribution of the data look like? Normal / Bimodal / etc.? Does it match the expected distribution?
  - [ ] Note the min/p1/p25/p50/mean/mode/p75/p90/p99/max values.
  - [ ] Are there physical constraints on the columns (e.g. only positive values)? Are they satisfied?
  - [ ] Should NULL values be coerced to zero or excluded? Are potential queries updated accordingly?
  - [ ] If representing a quantity, like time or energy - are the units documented in the column? Are values consistent with the units?
Outliers
- [ ] Explicitly collect samples with outlier values across different columns, preferably those with the most outliers.
Time Series Data
- [ ] Is the volume consistent across multiple dates?
- [ ] Is the volume consistent within a day?
- [ ] Do troughs and peak correspond to user behaviour? Compare the troughs and peaks.
References Possible Additional Checks
- Wikipedia
- Methodology for data validation