Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
software_development:python_pandas [2022/08/04 14:42]
prgram
software_development:python_pandas [2023/05/16 15:09]
prgram [encoding_errors - 'ignore']
Line 1: Line 1:
 ====== python pandas ====== ====== python pandas ======
 {{INLINETOC}} {{INLINETOC}}
 +
 +=== etc : list ===
 +<code python>
 +set( [list] ) # unique value
 +[list].sort() #​자동적용?​
 +[list1] + [list2] ​ #​list합치기
 +</​code>​
 +
 +===== shape of df =====
 +=== Pivot_table ===
 +<code python>
 +df.pivot_table(index=[인덱스컬럼],​
 +               ​columns=[컬럼1,​컬럼2],​
 +               ​values=[값],​
 +               ​aggfunc='​sum'​).reset_index()
 +</​code>​
 +  * string일 때는 aggfunc='​max'​
 +  * index에 NULL 있으면 안됨
 +== fillna ==
 +<code python>
 +df[['​a','​b','​c'​]] = df[['​a','​b','​c'​]].fillna(''​)
 +</​code>​
 +
 +=== group by ===
 +<code python>
 +df.groupby([컬럼들]).agg({'​컬럼':​sum}).reset_index()
 +
 +df.groupby([COLUMNS])['​COLUMN'​].max().reset_index()
 +
 +df = df.assign(date=pd.to_numeric(df['​date'​],​ errors='​coerce'​)).groupby(['​코드',​ '​종목명'​]).agg({'​date':​np.min}).reset_index().drop_duplicates()
 +
 +df = df[['​코드',​ '​date'​]].groupby(['​코드'​]).agg({'​date':​ [np.min, np.max]}).reset_index(level='​종목코드'​)
 +df.columns = df.columns.droplevel()
 +</​code>​
 +
 +=== rank ===
 +<code python>
 +df['​rank'​] = df.groupby('​code'​)['​value'​].rank(ascending=False)
 +</​code>​
 +
 +=== merge ===
 +<code python>
 +df_out = df_out.merge(df,​ on=['​no',​ '​name'​],​ how='​outer'​) #​left_on right_on
 +</​code>​
 +
 +
 +===== modify =====
 +=== Series to DF ===
 +<code python>
 +df = df.append(pd.DataFrame(pd.Series(dict_row)).transpose(),​ ignore_index=True)
 +</​code>​
 +
 +
 +=== rename ===
 +<code python>
 +df.rename(columns = {'​컬럼':'​new name'​},​ index={0:'​new index'​})
 +df.rename({1:​ 2, 2: 4}, axis='​index'​)
 +
 +df.columns = [컬럼들..]
 +df.columns = ['​1'​] + df.columns[1:​].tolist()
 +</​code>​
 +
 +=== order of columns ===
 +<code python>
 +#1
 +df = df.sort_index(axis='​columns',​ level = '​MULTILEVEL INDEX NAME/​no'​)
 +#2
 +df.columns
 +col_order = ['​a','​b','​c'​]
 +df = df.reindex(col_order,​ axis='​columns'​)
 +</​code>​
 +
 +
 +=== map ===
 +<code python>
 +df['​코드'​] = '​A'​ + df['​코드'​].map(lambda x: f'​{x:​0>​6}'​) ​ #6글자로
 +</​code>​
 +
  
 ===== get info ===== ===== get info =====
Line 29: Line 107:
 iloc: Select by position iloc: Select by position
 loc: Select by label loc: Select by label
 +  ​
 +df.loc[:,​~df.columns.isin(['​a','​b'​])]  ​
 +
 +df[~( df['​a'​].isin(['​1','​2','​3'​]) & df['​b'​]=='​3'​ )] #​row-wise
 +df.loc[~( df['​a'​].isin(['​1','​2','​3'​]) & df['​b'​]=='​3'​ ), 8] #​row-wise & column
 </​code>​ </​code>​
  
Line 36: Line 119:
     print(idx, row)     print(idx, row)
 </​code>​ </​code>​
- 
  
   ​   ​
   ​   ​
 =====I/O file===== =====I/O file=====
 +
 +=== encoding_errors - '​ignore'​===
 +Encoding 제대로 했는데도 안되면..
 +공공데이터가 이런 경우가 많음.
 +
 +<code python>
 +import chardet
 +with open(file, '​rb'​) as rawdata:
 +    result = chardet.detect(rawdata.read(100000))
 +result
 +
 +
 +data = pd.read_csv( file, encoding='​cp949',​ encoding_errors='​ignore'​)
 +# on_bad_lines='​skip'​
 +# error_bad_lines=False
 +</​code>​
 +
 +=== to_numberic ===
 +<code python>
 +#1
 +df = pd.read_csv('​file.csv',​ encoding='​utf-8',​ index_col=0,​ converters={'​col':​int,​ '​col2':​str})
 +#2
 +df['​col'​] = pd.to_numeric(df[col].str.replace(',',''​),​ errors='​coerce'​)
 +</​code>​
  
 === Excel === === Excel ===
Line 80: Line 186:
  
  
-=== Series to DF === 
-<code python> 
-df = df.append(pd.DataFrame(pd.Series(dict_row)).transpose(),​ ignore_index=True) 
-</​code>​ 
- 
-=== rename === 
-<code python> 
-df.rename(columns = {'​컬럼':'​new name'​},​ index={0:'​new index'​}) 
-df.rename({1:​ 2, 2: 4}, axis='​index'​) 
- 
-df.columns = [컬럼들..] 
-df.columns = ['​1'​] + df.columns[1:​].tolist() 
-</​code>​ 
- 
-=== map === 
-<code python> 
-df['​코드'​] = '​A'​ + df['​코드'​].map(lambda x: f'​{x:​0>​6}'​) ​ #6글자로 
-</​code>​ 
- 
-=== Pivot_table === 
-<code python> 
-df.pivot_table(index=[인덱스컬럼],​ 
-               ​columns=[컬럼1,​컬럼2],​ 
-               ​values=[값],​ 
-               ​aggfunc='​sum'​).reset_index() 
-</​code>​ 
- 
-=== group by === 
-<code python> 
-df.groupby([컬럼들]).agg({'​컬럼':​sum}).reset_index() 
- 
-df = df.assign(date=pd.to_numeric(df['​date'​],​ errors='​coerce'​)).groupby(['​코드',​ '​종목명'​]).agg({'​date':​np.min}).reset_index().drop_duplicates() 
- 
-df = df[['​코드',​ '​date'​]].groupby(['​코드'​]).agg({'​date':​ [np.min, np.max]}).reset_index(level='​종목코드'​) 
-df.columns = df.columns.droplevel() 
-</​code>​ 
- 
-=== merge === 
-<code python> 
-df_out = df_out.merge(df,​ on=['​no',​ '​name'​],​ how='​outer'​) 
-</​code>​ 
- 
-=== rank === 
-<code python> 
-df['​rank'​] = df.groupby('​code'​)['​value'​].rank(ascending=False) 
-</​code>​ 
-=== to_numberic === 
-<code python> 
-#1 
-df = pd.read_csv('​file.csv',​ encoding='​utf-8',​ index_col=0,​ converters={'​col':​int,​ '​col2':​str}) 
-#2 
-df['​col'​] = pd.to_numeric(df[col].str.replace(',',''​),​ errors='​coerce'​) 
-</​code>​