review data analysis

This is an old revision of the document!

프레임워크 생각하기.. 그림/도식?
데이터 수집 -
클렌징

다는 필요 없을 듯?

머신러닝프로젝트 책

https://medium.com/@shay.palachy/peer-reviewing-data-science-projects-7bfbc2919724

Data Properties
Approach Assumptions
Past Experience
Objective Alignment
Implementation
Scaling
Compose-/Break-ability
Information Requirements
Domain Adaptation
Noise/Bias/Missing Data Resilience

Data Properties
Regarding the initial dataset:

How was the generated? How was it samples? Was it updated?
E.g. 10% of last month data was sampled uniformly from each of the existing five clients.

What noise, sampling bias and missing data did this introduce?
E.g. one of the clients only integrated with our service two week ago, introducing a down-sampling bias of his data in the dataset.

Can you modify sampling/generation to reduce or eliminate noise? Sampling bias? Missing data?
E.g. either upsample the under-sampled client by a factor of two, or use data from only the last two weeks for all clients.

Can you explicitly model noise, independently from an approach?

If the dataset is labeled, how was it labeled?

What label bias did this introduce? Can it be measured?
E.g. the label might come from semi-attentive users. Labelling a very small (but representative) set by hand using experts/analysts might be tractable, and will enable us to measure label bias/error on this set.

Can you modify or augment the labelling process to compensate for existing bias?

How similar is the initial dataset to input data expected in production, structure and schema-wise?
E.g. the content of some items changes dynamically in production. Or perhaps different fields are missing, depending on time of creation, or on the source domain, and are later completed or extrapolated.

How representative is the initial dataset of production data?
E.g. The distribution of data among clients constantly changes. Or perhaps it was sampled over two month of spring but the model will go up when winter starts. Or it might have been collected before a major client/service/data source was integrated.

What is the best training dataset you could hope for?
- Define it very explicitly.
- Estimate: By how much will it improve performance?
- How possible and costly is it to generate?
E.g. tag the sentiment of 20,000 posts using three annotators and give the mode/average score as the label of each. This will require 400 man hours, expected to cost two or three thousand dollars and is expected to increase accuracy by at least 5% (this number is usually extremely hard to provide).

draft, tag1, tag2