====== Facebook API 포스팅 가져오기 #2 python으로 정리 ======
{{tag>blogs 페이스북API Facebook API 자동화 Python}}
{{section>blog:facebook_api_get_posting_1_api#intro &noindent&nolink&nouser&noeditbtn&nocomments&notags&nodate}}

===== Why Python? =====
본 홈페이지의 플랫폼인 dokuwiki는 PHP 기반으로 만들어졌기 때문에, PHP를 통해서 포스팅을 긁어올 수 있겠지만,
  - 저가 웹호스팅을 쓰기 때문에 트래픽 문제가 발생할 수 있음.
  - 접속할 때마다 데이터를 불러오면 로딩 시간이 늘어남
  - 오프라인에도 자료를 저장하고 싶음
  - 데이터 정리나 라이브러리 사용이 용이함. 익숙하기 때문에 개발 시간이 단축
이라는 이유 때문에 데스크톱에서 주기적으로 실행시켜줘야 하지만, Python을 이용하기로 하고,
PHP는 최신 현황만 불러오는 간단한 스크립트를 작성하기로 했다.

  - 데이터 호출 및 테이블로 정리
  - wiki 문법으로 변환
  - 포스팅에 연관된 이미지 다운로드
  - FTP를 통한 이미지 업로드
  - wiki 문서 수정하기

이번 편에서는 1-2까지를 기술해 보도록 하겠다.

<WRAP center round important 60%>
함수 선언부를 제외하고는 코드 순서대로 설명.
</WRAP>


===== 데이터 받아오기 =====

==== API 호출 ====

[[blog:facebook_api_get_posting_1_api|1편(API 사용하기)]] 에서 받아온 cURL 과 액세스 토큰을 통해서 데이터를 받아온다.
requests 라이브러리의 get 함수를 사용해서 데이터를 받아서 json을 dict 형식으로 변환한다.

(Line 1의 utils 는 nested_json, get_images, UTCtoKST 등 아래에서 설명할 함수들이 포함된 파일) 
<code python; highlight: [1,11,22] >
from utils import *
import pandas as pd
import requests
from urllib.parse import unquote
import re

output = pd.DataFrame()
token = "[액세스토큰]"
url = f"https://graph.facebook.com/v6.0/100300361542753/posts?fields=created_time%2Cfull_picture%2Cicon%2Cid%2Cmessage%2Cmessage_tags%2Cpicture%2Cattachments.limit(10)%7Burl%2Cmedia%2Cunshimmed_url%2Cmedia_type%2Ctitle%2Cdescription%2Cdescription_tags%2Csubattachments%2Ctype%2Ctarget%7D%2Cshares&limit=100&pretty=0&access_token={token}"

resp = requests.get(url=url)
json_out = resp.json()

output = nested_json(json_out)
</code>


==== 다음페이지 호출 ====

[[#json을_DataFrame으로|json구조-다음section]]는 1-depth 에서 data와 paging 으로 구분되어 있고, paging > next 에 다음페이지 URL이 있기 때문에 존재하면 계속 받아오도록 한다.

<code python; highlight:[5]>
while True:
    if 'next' in json_out['paging'].keys():
        print('===============Next Page==================')
        
        url2 = json_out['paging']['next']
        resp = requests.get(url=url2)

        json_out = resp.json()
        output2 = nested_json(json_out)

        output = output.append(output2)
    else:
        break
output.to_excel('facebook.xlsx')
</code>


==== json을 DataFrame으로 ====
우선 Archiving과 추후 페이스북 포스팅 효과 분석 등을 위해서 눈으로 볼 수 있는 DataFrame 형식으로 변환을 하고 싶었다.
하지만 받아오는 json이 아래와 같이 attachments 가 다시 dict 구조로 되어 있기 때문에 일반적인 방법으로는 DataFrame 형식으로 바꾸기 어려워 보였다.

<code python; highlight: [5-13]; title: Facebook API JSON 구조>
json_out = dict({
 ㄴdata : 
   [ 첫번째포스팅({
     	title, created_time, message 등
        attachments : dict( { 
           data: [ 첫번째attachment{
             ㄴ url
             ㄴ media
             ㄴ image : height, width, src
             ㄴ source
             ㄴ ...
             }, 두번째attachment{} ]
         })
   }),두번째포스팅,..]
 ㄴpaging : ...
       ㄴ next : 다음페이지 주소
)}
</code>


그래서 recursion을 이용해서 nested json 형식을 DataFrame으로 바꿀 수 있도록 11번 라인의 nested_json 함수를 다음과 같이 만들어 보았다.
<code python>
def nested_json(json_out):
    df_out = pd.DataFrame()
    
    #리스트 값 추출
    if json_out == list:
        data = json_out
    else:
        data = json_out[next(iter(json_out))]
    
    # 한줄씩 Series로 만들어서 DataFrame에 추가한다.
    for i,_ in enumerate(data):
        output = nested_json_row(data[i])
        df_out = df_out.append(output.to_frame().T, ignore_index=True)
    return df_out
</code>

nested_json 에서 데이터의 리스트 element 하나(DataFrame에서 한 행이 될 부분)에 대해서 nested_json_row 를 실행해서 DataFrame을 만들게 된다.

nested_json_row 에서는 아래와 같이 dict나 list가 아닌 '값'이 나올 때까지 recursion을 이용해서 column명-값 을 가져오도록 했다.
dict일 때는 dict의 값을 컬럼명으로, list일 때는 index를 컬럼명에 추가해서 구분이 가능하도록 하였다.


<code python; highlight: [5-6,10-11]>
def nested_json_row(dict_data):
    out = pd.Series()
    for k,v in dict_data.items():
        if type(v) == dict:
            out_child = nested_json_row(v)
            out_child.index = k + '_' + out_child.index
            out = out.append(out_child)
        elif type(v) == list:
            for idx,it in enumerate(v):
                out_child = nested_json_row(it)
                out_child.index = f"{k}_{idx}_" + out_child.index
                out = out.append(out_child)
        else:
            out[k] = v
    return out
</code>


엑셀로 저장해서 보면, 컬럼명이 다음과 같이 잘 들어간 것을 볼 수 있다.

{{blog:pasted:20200218-210029.png?200}}


===== wiki문법으로 변환 =====
데이터들을 사용가능하도록 정리하고, 업로드가 가능한 형태로 수정하는 부분.
본 사이트는 [[https://www.dokuwiki.org/dokuwiki|dokuwiki]] 를 사용하여 만들었기 때문에, 이 테이블을 wiki 문법으로 바꾸어줘야 했다.

아래 코드에서 wiki 문법을 HTML 이나 Markdown 문법 등으로 수정해서 사용하면 다른 곳에서도 사용할 수 있을 것이다. 


==== 기본정보 처리 ====
<code python>
from tqdm import tqdm
output = output.fillna('!!None!!')
output['img_base64']=''
contents = ''
#for loop 시작
for idx,row in tqdm(output.iterrows()):
    con = {
            'title': row['attachments_data_0_title'],
            'message': row['message'].replace('!!None!!', ''),
            'desc': row['attachments_data_0_description'].replace('!!None!!',''),
            'url': "[[https://www.facebook.com/data.triviaz/posts/"+row['id'].split('_')[1]+"|페이스북에서 보기]]",
            'picture': row['attachments_data_0_media_image_src'],
            'type': row['attachments_data_0_media_type'],
           }
    
    #제목 처리 : 없으면 내용 or No title
    if con['title'] == '!!None!!':
        if con['desc'] == '' and con['message'] == '':
            con['title'] = 'No title'
        else:
            con['title'] = con['message'][:20] + '...'
</code>

  * 제목title : 주로 링크를 공유하는 포스팅을 올리기 때문에 attachment에 있는 title을 사용
    * Line 15의 제목처리 부분에서 링크 공유가 아닌 경우 내용의 앞 20 글자를 가져오도록 처리
  * 내용message : message 컬럼
  * Description : 페이스북에서 링크 내용을 자동스크랩해서 보여주는 부분
  * url : id는 ''페이지id_포스팅id'' 형태로 되어 있기 때문에 포스팅id 부분을 통해서 facebook 링크를 생성
  * picture : 링크 내의 이미지 또는 제가 업로드한 이미지 URL
  * type : 포스팅 종류 (link, picutre, video 등)


==== Image 처리 ====

API를 통해서 정보는 original URL이 아닌, Facebook CDN 서버를 통하는 URL로 제공.
이 이미지 URL은 다음의 두 가지 경우가 존재한다.
  - Facebook 내부 이미지(line3) : ''https://scontent.xx.fbcdn.net'' 형식은 바로 접근이 가능하기 때문에 URL을 그대로 사용.
  - 외부 이미지(line6~) : ''https://external.xx.fbcdn.net/safe_image.php'' 의 ''&url='' 부분이 외부 이미지 URL.
여기에 몇가지 예외 처리를 추가하여 다음과 같이 이미지 URL 을 추출하였다.

<code python>
    # image 처리 ################
    img_url = ''
    if 'scontent' in con['picture']:    #Facebook 내부
        img_url = row['picture']
    else:                               #외부 이미지
        if '&url=' in con['picture']:   #url 부터 cfs 까지
            img_url = con['picture'][con['picture'].index('&url')+5:con['picture'].index('&cfs')]
            img_url = unquote(img_url)
            if img_url[-3:] in ['jpg','png'] or 'daum' in img_url:  #daum은 확장자 없음
                img_url = img_url
            else:
                img_url = img_url.split('?')[0]         #가끔 ? 붙은게 있음
            img_url = img_url.replace('%3A',':').replace('%2F',"/")
</code>


그리고 아래와 같이 ''get_images 함수'' ([[blog:facebook_api_get_posting_3_image_ftp_write|다음편(이미지 및 FTP)]]에서 설명)를 호출하여
Archiving을 위한,
 - 이미지를 다운로드
 - 이미지를 HTML에서 바로 사용할 수 있도록 base64로 인코딩하여 DataFrame에 추가
하도록 했다.
<code python>
    if img_url != '':
        img_base64 = get_images(img_url, row['id'])
        output.loc[idx,'img_base64'] = img_base64
        con['picture'] = "{{blogs_facebook_upload:" + row['id']+ ".png?100}}"
</code>


==== 기타 처리 ====
  * 포스팅의 대표 링크는 페이스북에서 보이는 것처럼 도메인 이름만 추출하도록 하고,
  * message 내에 있는 링크가 길이가 길 때는 말 줄임표로 줄이도록 하고.
  * 또한 참고로 dokuwiki문법 적용시 개행을 의미하는 ''가 f-string에서는 적용되지 않기 때문에 따로 처리하였다.
<code python>
    con['link'] = "[[" + row['attachments_data_0_unshimmed_url'] + "|"+ row['attachments_data_0_unshimmed_url'].replace("https://","").replace("http://","").split("/")[0] + "]]"
  
    #내용 안에 있는 링크는 줄임표로 줄인다
    urlfound = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', con['message'])
    for urlf in urlfound:
        if len(urlf) > 30:
            con['message'] = con['message'].replace(urlf, f"[[{urlf}|{urlf[:30]}...]]")

    con['message'] = con['message'].replace('\n', '  ')   # SyntaxError: f-string expression part cannot include a backslash
</code>


==== 날짜 처리 및 Template 완성 ====

API를 통해서 제공하는 날짜 정보는 모두 [[https://ko.wikipedia.org/wiki/%ED%98%91%EC%A0%95_%EC%84%B8%EA%B3%84%EC%8B%9C|UTC]] 기준으로 되어 있다.
한국은 UTC+9 시간이기 때문에 아래와 같이 timezone 라이브러리를 이용하여 변환해주는 작업이 필요하다.

<code python>
from pytz import timezone
import datetime

def UTCtoKST(timestr):
    KST = timezone('Asia/Seoul')    #https://blog.kimkevin.net/python-utc-to-kst/
    return datetime.datetime.strptime(timestr, '%Y-%m-%dT%H:%M:%S%z').astimezone(KST).strftime(
        '%Y-%m-%d %H:%M:%S')
</code>


지금까지 작업을 바탕으로 아래처럼 각 포스팅별로 wiki문법을 적용한 Template을 만들었다.

<code python>
    content = f"""
=== {con['title']} ===
 | {con['type'].upper()} | {UTCtoKST(row['created_time'])} | {con['url']} | 

{con['message']}
> <wrap group>
<wrap column>
{con['picture'].replace('!!None!!', '(No image)')}
</wrap>
<wrap column>
{con['desc']} 
{con['link']}
</wrap>
</wrap>

----
"""
    contents += content
# END of for ##################################
</code>


지금까지 for loop 를 통해서 각 포스팅 들을 만들었고, 아래처럼 Header와 Footer 를 추가하고, txt 파일로 저장하기까지 완성

<code python; highlight: [1]>
iframe = "{{url>fb_newest.php?date="+max(output['created_time']).replace('+','%2B')+"&format=m/d%20H&front=[최신:%20&mid=%EC%8B%9C%EA%B9%8C%EC%A7%80%20&end=%20%ED%8F%AC%EC%8A%A4%ED%8C%85%EC%9D%B4%20%EB%8D%94%20%EC%9E%88%EC%8A%B5%EB%8B%88%EB%8B%A4]&style=font-size:11pt;font-color:%23333333;font-family:Helvetica,Arial,sans-serif; 100%,30 noscroll noborder left|no iframe error}}"
contents = """====== Facebook Posting Archive ======
{{tag> blog Facebook 페이스북 페이지}}
""" + f"""
> {UTCtoKST(max(output['created_time']))} 까지 총 {len(output)} 개 포스팅 Archived
> {iframe}
> 최신 포스팅과 더 많은 소식은 [[https://facebook.com/data.triviaz|Data.triviaz]] 좋아요, 팔로잉 해주세요

----
[[weblog:facebook_api_포스팅_가져오기_1_api사용|API사용,Python데이터정리,PHP최신현황 방법]]
----

""" + contents + """
~~~DISCUSSION~~~
"""
# end of HEADER ###################################


f = open('facebook_posting.txt','w+', encoding="utf-8")
f.write(contents)
f.close()
</code>


===== 추후 작업 / 다음 편 =====

[[blog:facebook_api_get_posting_3_image_ftp_write|다음편(이미지 및 FTP)]] 에서
  - Image 처리부분에서 호출한 get_images 함수 부분 : Image 다운로드, base64 로 인코딩
  - FTP 를 통한 Image 업로드
  - txt파일로 저장한 글로 웹 상의 문서 생성/수정
을 설명하도록 하겠다.


또한, [[blog:facebook_api_get_posting_4_php|4편(PHP최신현황)]]에서는, 바로 위 코드의 Line 1에서 ifram으로 가져오는 PHP 페이지에 대한 설명을 해보도록 하겠다.


~~DISCUSSION~~