review data analysis
프레임워크 생각하기.. 그림/도식?
데이터 수집 -
클렌징
다는 필요 없을 듯?
머신러닝프로젝트 책
https://onezero.medium.com/how-big-data-fails-4381a5ddeda8
Steve Null, Null Island
https://medium.com/@shay.palachy/peer-reviewing-data-science-projects-7bfbc2919724
Data Properties
Approach Assumptions
Past Experience
Objective Alignment
Implementation
Scaling
Compose-/Break-ability
Information Requirements
Domain Adaptation
Noise/Bias/Missing Data Resilience
Data Properties
Regarding the initial dataset:
어떻게 생성된 데이터인가?
- 샘플링 된 것인지?
- 예전 자료인지, 업데이트 된 자료인지?
E.g. 10% of last month data was sampled uniformly from each of the existing five clients.
noise나 sampling bias, 결측치는 없나?
E.g. one of the clients only integrated with our service two week ago, introducing a down-sampling bias of his data in the dataset.
- 이를 줄이거나 제거하기 위해서 시도 할 수 있는 방법은?
E.g. either upsample the under-sampled client by a factor of two, or use data from only the last two weeks for all clients.
Can you explicitly model noise, independently from an approach?
따로 라벨링이 된 데이터라면, 어떻게 라벨링 되었나?
- 라벨링에 bias는 없나? bias를 측정할 수 있을까?
E.g. the label might come from semi-attentive users. Labelling a very small (but representative) set by hand using experts/analysts might be tractable, and will enable us to measure label bias/error on this set.
- bias를 줄이기 위해서 시도할 수 있는 방법은?
사용된 데이터가 실제 활용될 때 구조나 내용적으로 다른 점은 없나?
E.g. the content of some items changes dynamically in production. Or perhaps different fields are missing, depending on time of creation, or on the source domain, and are later completed or extrapolated.
사용된 데이터가 실제를 대표할 수 있나?
E.g. The distribution of data among clients constantly changes. Or perhaps it was sampled over two month of spring but the model will go up when winter starts. Or it might have been collected before a major client/service/data source was integrated.
What is the best training dataset you could hope for?
- Define it very explicitly.
- Estimate: By how much will it improve performance?
- How possible and costly is it to generate?
E.g. tag the sentiment of 20,000 posts using three annotators and give the mode/average score as the label of each. This will require 400 man hours, expected to cost two or three thousand dollars and is expected to increase accuracy by at least 5% (this number is usually extremely hard to provide).
Discussion