Differences

This shows you the differences between two versions of the page.

--- blog:draft:review_data_analysis [2020/03/23 13:49] – prgram
+++ blog:draft:review_data_analysis [2025/07/07 14:12] (current) – external edit 127.0.0.1
@@ Line 9: / Line 9: @@
 머신러닝프로젝트 책
+https://onezero.medium.com/how-big-data-fails-4381a5ddeda8
+Steve Null, Null Island
 https://medium.com/@shay.palachy/peer-reviewing-data-science-projects-7bfbc2919724
@@ Line 29: / Line 30: @@
 Regarding the initial dataset:
-How was the generated? How was it samples? Was it updated?
+어떻게 생성된 데이터인가?
+ - 샘플링 된 것인지?
+ - 예전 자료인지, 업데이트 된 자료인지?
 E.g. 10% of last month data was sampled uniformly from each of the existing five clients.
-What noise, sampling bias and missing data did this introduce?
+noise나 sampling bias, 결측치는 없나?
 E.g. one of the clients only integrated with our service two week ago, introducing a down-sampling bias of his data in the dataset.
+ - 이를 줄이거나 제거하기 위해서 시도 할 수 있는 방법은?
-Can you modify sampling/generation to reduce or eliminate noise? Sampling bias? Missing data?
 E.g. either upsample the under-sampled client by a factor of two, or use data from only the last two weeks for all clients.
 Can you explicitly model noise, independently from an approach?
-If the dataset is labeled, how was it labeled?
+따로 라벨링이 된 데이터라면, 어떻게 라벨링 되었나?
+ - 라벨링에 bias는 없나? bias를 측정할 수 있을까?
-What label bias did this introduce? Can it be measured?
 E.g. the label might come from semi-attentive users. Labelling a very small (but representative) set by hand using experts/analysts might be tractable, and will enable us to measure label bias/error on this set.
+ - bias를 줄이기 위해서 시도할 수 있는 방법은?
-Can you modify or augment the labelling process to compensate for existing bias?
+사용된 데이터가 실제 활용될 때 구조나 내용적으로 다른 점은 없나?
-How similar is the initial dataset to input data expected in production, structure and schema-wise?
 E.g. the content of some items changes dynamically in production. Or perhaps different fields are missing, depending on time of creation, or on the source domain, and are later completed or extrapolated.
-How representative is the initial dataset of production data?
+사용된 데이터가 실제를 대표할 수 있나?
 E.g. The distribution of data among clients constantly changes. Or perhaps it was sampled over two month of spring but the model will go up when winter starts. Or it might have been collected before a major client/service/data source was integrated.