Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
software_development:python_pandas [2022/08/04 14:44]
prgram
software_development:python_pandas [2023/05/16 15:17] (current)
prgram [encoding_errors - 'ignore']
Line 1: Line 1:
 ====== python pandas ====== ====== python pandas ======
 {{INLINETOC}} {{INLINETOC}}
 +
 +=== etc : list ===
 +<code python>
 +set( [list] ) # unique value
 +[list].sort() #​자동적용?​
 +[list1] + [list2] ​ #​list합치기
 +</​code>​
  
 ===== shape of df ===== ===== shape of df =====
Line 9: Line 16:
                ​values=[값],​                ​values=[값],​
                ​aggfunc='​sum'​).reset_index()                ​aggfunc='​sum'​).reset_index()
 +</​code>​
 +  * string일 때는 aggfunc='​max'​
 +  * index에 NULL 있으면 안됨
 +== fillna ==
 +<code python>
 +df[['​a','​b','​c'​]] = df[['​a','​b','​c'​]].fillna(''​)
 </​code>​ </​code>​
  
Line 14: Line 27:
 <code python> <code python>
 df.groupby([컬럼들]).agg({'​컬럼':​sum}).reset_index() df.groupby([컬럼들]).agg({'​컬럼':​sum}).reset_index()
 +
 +df.groupby([COLUMNS])['​COLUMN'​].max().reset_index()
  
 df = df.assign(date=pd.to_numeric(df['​date'​],​ errors='​coerce'​)).groupby(['​코드',​ '​종목명'​]).agg({'​date':​np.min}).reset_index().drop_duplicates() df = df.assign(date=pd.to_numeric(df['​date'​],​ errors='​coerce'​)).groupby(['​코드',​ '​종목명'​]).agg({'​date':​np.min}).reset_index().drop_duplicates()
Line 47: Line 62:
 df.columns = ['​1'​] + df.columns[1:​].tolist() df.columns = ['​1'​] + df.columns[1:​].tolist()
 </​code>​ </​code>​
 +
 +=== order of columns ===
 +<code python>
 +#1
 +df = df.sort_index(axis='​columns',​ level = '​MULTILEVEL INDEX NAME/​no'​)
 +#2
 +df.columns
 +col_order = ['​a','​b','​c'​]
 +df = df.reindex(col_order,​ axis='​columns'​)
 +</​code>​
 +
  
 === map === === map ===
Line 81: Line 107:
 iloc: Select by position iloc: Select by position
 loc: Select by label loc: Select by label
 +  ​
 +df.loc[:,​~df.columns.isin(['​a','​b'​])]  ​
 +
 +df[~( df['​a'​].isin(['​1','​2','​3'​]) & df['​b'​]=='​3'​ )] #​row-wise
 +df.loc[~( df['​a'​].isin(['​1','​2','​3'​]) & df['​b'​]=='​3'​ ), 8] #​row-wise & column
 </​code>​ </​code>​
  
Line 92: Line 123:
   ​   ​
 =====I/O file===== =====I/O file=====
 +
 +=== encoding_errors - '​ignore'​===
 +Encoding 제대로 했는데도 안되면..
 +공공데이터가 이런 경우가 많음.
 +
 +Error tokenizing data. C error: EOF inside string starting at row 0 | 판다스 에러
 +https://​con2joa.tistory.com/​m/​60
 +quoting=csv.QUOTE_NONE 파라미터
 +
 +<code python>
 +import chardet
 +with open(file, '​rb'​) as rawdata:
 +    result = chardet.detect(rawdata.read(100000))
 +result
 +
 +
 +data = pd.read_csv( file, encoding='​cp949',​ encoding_errors='​ignore'​)
 +# on_bad_lines='​skip'​
 +# error_bad_lines=False
 +</​code>​
  
 === to_numberic === === to_numberic ===