Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
blog:draft:review_data_analysis [2020/03/23 13:49] prgramblog:draft:review_data_analysis [2025/07/07 14:12] (current) – external edit 127.0.0.1
Line 9: Line 9:
 머신러닝프로젝트 책 머신러닝프로젝트 책
  
 +https://onezero.medium.com/how-big-data-fails-4381a5ddeda8 
 +Steve Null, Null Island
  
 https://medium.com/@shay.palachy/peer-reviewing-data-science-projects-7bfbc2919724 https://medium.com/@shay.palachy/peer-reviewing-data-science-projects-7bfbc2919724
Line 29: Line 30:
 Regarding the initial dataset: Regarding the initial dataset:
  
-How was the generatedHow was it samplesWas it updated?+ 
 +어떻게 생성된 데이터인가? 
 + - 샘플링 된 것인지? 
 + - 예전 자료인지, 업데이트 된 자료인지?
 E.g. 10% of last month data was sampled uniformly from each of the existing five clients. E.g. 10% of last month data was sampled uniformly from each of the existing five clients.
  
-What noisesampling bias and missing data did this introduce?+noise나 sampling bias, 결측치는 없나?
 E.g. one of the clients only integrated with our service two week ago, introducing a down-sampling bias of his data in the dataset. E.g. one of the clients only integrated with our service two week ago, introducing a down-sampling bias of his data in the dataset.
- + - 이를 줄이거나 제거하기 위해서 시도 할 수 있는 방법은?
-Can you modify sampling/generation to reduce or eliminate noise? Sampling bias? Missing data?+
 E.g. either upsample the under-sampled client by a factor of two, or use data from only the last two weeks for all clients. E.g. either upsample the under-sampled client by a factor of two, or use data from only the last two weeks for all clients.
  
 Can you explicitly model noise, independently from an approach? Can you explicitly model noise, independently from an approach?
  
-If the dataset is labeledhow was it labeled+따로 라벨링이 된 데이터라면어떻게 라벨링 되었나
- + - 라벨링에 bias는 없나bias를 측정할 수 있을까?
-What label bias did this introduceCan it be measured?+
 E.g. the label might come from semi-attentive users. Labelling a very small (but representative) set by hand using experts/analysts might be tractable, and will enable us to measure label bias/error on this set. E.g. the label might come from semi-attentive users. Labelling a very small (but representative) set by hand using experts/analysts might be tractable, and will enable us to measure label bias/error on this set.
 + - bias를 줄이기 위해서 시도할 수 있는 방법은?
  
-Can you modify or augment the labelling process to compensate for existing bias? +사용된 데이터가 실제 활용될 때 구조나 내용적으로 다른 점은 없나
- +
-How similar is the initial dataset to input data expected in production, structure and schema-wise?+
 E.g. the content of some items changes dynamically in production. Or perhaps different fields are missing, depending on time of creation, or on the source domain, and are later completed or extrapolated. E.g. the content of some items changes dynamically in production. Or perhaps different fields are missing, depending on time of creation, or on the source domain, and are later completed or extrapolated.
  
-How representative is the initial dataset of production data?+사용된 데이터가 실제를 대표할 수 있나?
 E.g. The distribution of data among clients constantly changes. Or perhaps it was sampled over two month of spring but the model will go up when winter starts. Or it might have been collected before a major client/service/data source was integrated. E.g. The distribution of data among clients constantly changes. Or perhaps it was sampled over two month of spring but the model will go up when winter starts. Or it might have been collected before a major client/service/data source was integrated.
  
  • blog/draft/review_data_analysis.1584971390.txt.gz
  • Last modified: 2025/07/07 14:12
  • (external edit)