[Python] 한글 전처리 모음

2019. 9. 7. 00:21

python에서 한글 전처리를 하는 모음

from collections import Counter

special_chars = ['\n', '?', '.', '+', '~', '-', '_', ',', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '{', '}', '[', ']' ,'/', '=', '`', '|']

def string_cleanup(x, notwanted):
    # import re
    for item in notwanted:
        x = x.replace(item, ' ')
        # x = re.sub(item, '', x)
    return x

def multiple_spaces_to_one(sentence):
    import re
    return re.sub(' +', ' ', sentence)

def remove_duplicated_words(sentence):

    return ' '.join(set(text.split(' ')))

def preprocessing(sentence):
    sentence = string_cleanup(sentence, special_chars) 
    sentence = re.compile('[0-9|ㄱ-ㅎ|ㅏ-ㅣ]+').sub('',sentence) # 'ㅋㅋㅋ', 'ㅏㅏ 제거'
    sentence = sentence.strip()
    sentence = sentence.lower()
    sentence = multiple_spaces_to_one(sentence)
    sentence = ' '.join(Counter(text.split(' ')).keys())
    return sentence

def preprocessing_udf(x):
  text = preprocessing(x['context'])
  return text  

result_df.head(2).apply(preprocessing_udf, axis=1)

저작자표시 비영리 변경금지

'우리는 개발자 > Data Science' 카테고리의 다른 글

[Python] collections.Counter를 이용해 리스트의 값 개수세기 (0)	2019.09.07
[Python] 정규식 (Regex)를 이용해 한글만 추출하는 방법 (모음, 자음 구분) (0)	2019.09.07
[Python] datetime timedelta를 이용해 날짜 더하고 빼는 방법 (0)	2019.09.07
[Python] Python3 SimpleHTTPServer, http.server (0)	2019.09.07
[Python] Hive 테이블 데이터 가져오기 (subprocess, commands) (0)	2019.09.07

더블리의 12층

[Python] 한글 전처리 모음

'우리는 개발자 > Data Science' 카테고리의 다른 글

+ Recent posts

티스토리툴바