'우리는 개발자/Data Science' 카테고리의 글 목록

우리는 개발자/Data Science

[Jupyter] 노트북과 함께 사용하면 좋은 extensions 2019.12.08
[Python] seaborn을 이용해 시각화를 아름답게! Statistical Data Visualization 2019.12.08
Jupyter에서 한글 깨짐 배달의 민족 글씨체로 설정 2019.12.02
[Python] embedding vector를 하나로 합치는 방법 2019.09.07
[Python] 두개의 벡터(vector) cosine similarity 계산하는 방법 2019.09.07
[Python] collections.Counter를 이용해 리스트의 값 개수세기 2019.09.07
[Python] 정규식 (Regex)를 이용해 한글만 추출하는 방법 (모음, 자음 구분) 2019.09.07
[Python] 한글 전처리 모음 2019.09.07

[Jupyter] 노트북과 함께 사용하면 좋은 extensions

2019. 12. 8. 21:19

https://ipywidgets.readthedocs.io/en/latest/

노트북에서 widgets을 이용해서 변수를 변경할때 사용하면 좋은 (그래프 할때 사용하면 좋을듯)

ipywidgets — Jupyter Widgets 7.5.1 documentation

ipywidgets.readthedocs.io

Tensorflow 를 하는 ML 유저라며 고려해볼만한 ml-tooling

(tensorflow, tensorboard, docker,, anaconda, pytorch, .... 관련된게 모두~ 설치가 되어있음

https://github.com/ml-tooling/ml-workspace

Build software better, together

GitHub is where people build software. More than 40 million people use GitHub to discover, fork, and contribute to over 100 million projects.

github.com

git을 사용하는 유저라면

https://github.com/jupyterlab/jupyterlab-git

jupyterlab/jupyterlab-git

A Git extension for JupyterLab. Contribute to jupyterlab/jupyterlab-git development by creating an account on GitHub.

github.com

메모장으로 markdown을 많이 작성한다면

https://github.com/jupyterlab/jupyterlab-toc

jupyterlab/jupyterlab-toc

Table of Contents extension for JupyterLab. Contribute to jupyterlab/jupyterlab-toc development by creating an account on GitHub.

github.com

tensorboard 함께 사용한다면

https://github.com/chaoleili/jupyterlab_tensorboard

chaoleili/jupyterlab_tensorboard

Tensorboard extension for jupyterlab. Contribute to chaoleili/jupyterlab_tensorboard development by creating an account on GitHub.

github.com

지도를 사용한다면 (geospatial visualization)

https://github.com/OpenGeoscience/geonotebook

OpenGeoscience/geonotebook

A Jupyter notebook extension for geospatial visualization and analysis - OpenGeoscience/geonotebook

github.com

저작자표시 비영리 변경금지 (새창열림)

'우리는 개발자 > Data Science' 카테고리의 다른 글

[Python] seaborn을 이용해 시각화를 아름답게! Statistical Data Visualization (0)	2019.12.08
Jupyter에서 한글 깨짐 배달의 민족 글씨체로 설정 (0)	2019.12.02
[Python] embedding vector를 하나로 합치는 방법 (0)	2019.09.07
[Python] 두개의 벡터(vector) cosine similarity 계산하는 방법 (0)	2019.09.07
[Python] collections.Counter를 이용해 리스트의 값 개수세기 (0)	2019.09.07

[Python] seaborn을 이용해 시각화를 아름답게! Statistical Data Visualization

2019. 12. 8. 21:06

파이썬에서 데이터 시각화를 자주 사용한다. 그래도 좀더... 폼나는 그래프를 그리는게 좋겠지? 우리가 그래프를 그리는 이유는 보통 정보를 공유할때 사용하니까! 나만 본다면 사실 그렇게 크지 않지만 그래도 예쁘게 잘 보는게 중요하니까 시각화인만큼!

https://seaborn.pydata.org/

seaborn: statistical data visualization — seaborn 0.9.0 documentation

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. For a brief introduction to the ideas behind the library, you can read the introductory note

seaborn.pydata.org

seaborn은 matplotlib를 기반으로 작성한 라이브러리이고, pandas와 함께 사용할때 훌륭하다. 일단 시각화를 할때는 여러가지의 변수들의 관계를 어떻게 표현을 쉽게 하느냐가 중요한데 그런 dataset-oriented API를 제공하기 때문에 쉽게 작성이 가능하다.

https://seaborn.pydata.org/introduction.html

An introduction to seaborn — seaborn 0.9.0 documentation

Seaborn is a library for making statistical graphics in Python. It is built on top of matplotlib and closely integrated with pandas data structures. Seaborn aims to make visualization a central part of exploring and understanding data. Its dataset-oriented

seaborn.pydata.org

에서 여러가지 어떤 상황에 어떤 방식으로 차트를 나타내는지 나오니... 살펴보면 좋을듯! 데이터를 표현하는것도 중요하지만 어떻게 잘 표현하는지도 연구가 있는거 보면 이렇게 이미 그려진 차트를 보면서 어떻게 표현을 해야할지 인사이트를 얻는것도 중요할듯

저작자표시 비영리 변경금지 (새창열림)

'우리는 개발자 > Data Science' 카테고리의 다른 글

[Jupyter] 노트북과 함께 사용하면 좋은 extensions (0)	2019.12.08
Jupyter에서 한글 깨짐 배달의 민족 글씨체로 설정 (0)	2019.12.02
[Python] embedding vector를 하나로 합치는 방법 (0)	2019.09.07
[Python] 두개의 벡터(vector) cosine similarity 계산하는 방법 (0)	2019.09.07
[Python] collections.Counter를 이용해 리스트의 값 개수세기 (0)	2019.09.07

Jupyter에서 한글 깨짐 배달의 민족 글씨체로 설정

2019. 12. 2. 23:34

Jupyter 에서 한글이 네모로 표시되면 Jupyter에 한글 글씨체가 설정이 되어 있지 않다는 의미다. 한글 글씨체를 설정해보자! 그래도 이왕 보는거 차트 볼때 예쁘게하기 위해서 배달의 민족 글씨체를 적용해보았다.

글씨체 폰트는 아래에서 다운로드 받는다

http://font.woowahan.com/jua/

배달의민족 폰트 주아체 다운로드

배달의민족 주아체 다운로드 배달의민족 주아체는 붓으로 직접 그려서 만든 손글씨 간판을 모티브로 만들었습니다. 붓으로 그려 획의 굵기가 일정하지 않고 동글동글한 느낌을 주는 서체로 옛날 간판의 푸근함과 정겨움이 묻어나는 것이 특징입니다.

font.woowahan.com

아래 코드를 실행하고, 차트를 실행하면 글씨체가 적용된다.

from matplotlib import font_manager, rc
font_name = font_manager.FontProperties(fname="/Users/direcision/Library/Fonts/BMDOHYEON_otf.otf").get_name()
rc('font', family=font_name)

저작자표시 비영리 변경금지 (새창열림)

'우리는 개발자 > Data Science' 카테고리의 다른 글

[Jupyter] 노트북과 함께 사용하면 좋은 extensions (0)	2019.12.08
[Python] seaborn을 이용해 시각화를 아름답게! Statistical Data Visualization (0)	2019.12.08
[Python] embedding vector를 하나로 합치는 방법 (0)	2019.09.07
[Python] 두개의 벡터(vector) cosine similarity 계산하는 방법 (0)	2019.09.07
[Python] collections.Counter를 이용해 리스트의 값 개수세기 (0)	2019.09.07

[Python] embedding vector를 하나로 합치는 방법

2019. 9. 7. 00:27

embedding 벡터를 합치기 위해서는 np.zeros()를 통해 초기화를 진행하고
초기화된 embed에 누적해서 벡터의 값을 더해주고
마지막으로 합치는 벡터의 개수로 나눠준다.

def agg_embed(terms):
  embed = np.zeros(128) 
  for term in terms: 
      embed += np.array(term['embedding'])
  embed /= len(terms)
  return embed

저작자표시 비영리 변경금지 (새창열림)

'우리는 개발자 > Data Science' 카테고리의 다른 글

[Python] seaborn을 이용해 시각화를 아름답게! Statistical Data Visualization (0)	2019.12.08
Jupyter에서 한글 깨짐 배달의 민족 글씨체로 설정 (0)	2019.12.02
[Python] 두개의 벡터(vector) cosine similarity 계산하는 방법 (0)	2019.09.07
[Python] collections.Counter를 이용해 리스트의 값 개수세기 (0)	2019.09.07
[Python] 정규식 (Regex)를 이용해 한글만 추출하는 방법 (모음, 자음 구분) (0)	2019.09.07

[Python] 두개의 벡터(vector) cosine similarity 계산하는 방법

2019. 9. 7. 00:25

sentence embedding을 얻었다면, 두개의 유사도를 계산하기 위해서는 cosine similarity를 이용해서 계산을 해야한다.

pandas에서 udf를 통해 계산하는 방법은 아래와 같다.
keyword와 context에는 문자열이 들어가면 된다.

ex: keyword: 안녕, context: 잘가요. 멀리 안가요


import numpy as np
from scipy import spatial

def sim(x, y): 
  embed1 = get_embed(x)  
  embed2 = get_embed(y) 
  return 1 - spatial.distance.cosine(embed1, embed2)

def sim_udf(x): 
  sim_value = sim(x['keyword'], x['context'])  
  return sim_value 


df['cosim'] = df.apply(sim_udf, axis=1)

저작자표시 비영리 변경금지 (새창열림)

'우리는 개발자 > Data Science' 카테고리의 다른 글

Jupyter에서 한글 깨짐 배달의 민족 글씨체로 설정 (0)	2019.12.02
[Python] embedding vector를 하나로 합치는 방법 (0)	2019.09.07
[Python] collections.Counter를 이용해 리스트의 값 개수세기 (0)	2019.09.07
[Python] 정규식 (Regex)를 이용해 한글만 추출하는 방법 (모음, 자음 구분) (0)	2019.09.07
[Python] 한글 전처리 모음 (0)	2019.09.07

[Python] collections.Counter를 이용해 리스트의 값 개수세기

2019. 9. 7. 00:23

python에서 Counter를 이용하면 list로 받은 값들의 개수를 계산한 해서 아래와 같이 결과를 받을 수 있다.

from collections import Counter
Counter(['apple','red','apple','red','red','pear'])
Counter({'red': 3, 'apple': 2, 'pear': 1})

저작자표시 비영리 변경금지 (새창열림)

'우리는 개발자 > Data Science' 카테고리의 다른 글

[Python] embedding vector를 하나로 합치는 방법 (0)	2019.09.07
[Python] 두개의 벡터(vector) cosine similarity 계산하는 방법 (0)	2019.09.07
[Python] 정규식 (Regex)를 이용해 한글만 추출하는 방법 (모음, 자음 구분) (0)	2019.09.07
[Python] 한글 전처리 모음 (0)	2019.09.07
[Python] datetime timedelta를 이용해 날짜 더하고 빼는 방법 (0)	2019.09.07

[Python] 정규식 (Regex)를 이용해 한글만 추출하는 방법 (모음, 자음 구분)

2019. 9. 7. 00:22

파이썬에서 한글, 영문, 숫자가 포함된 문자열에서 한글만 추출하는 방법에 대해서 설명한다.
python을 이용해서 한글처리를 하다보면 ㅋㅋㅋ, ㅎㅎㅎ와 같은 모음, 자음이 따로 있는 경우가 있는데 보통은 의미가 없다. 감정을 나타내는 문제에서는 의미가 있으려나...
모/자음만 있는 한글을 추려내는 방법은 정규식을 사용하면 쉽게 추출, 제거 할 수 있다.

정규식에서 일치되는 부분을 리스트로 저장

import re

text = "ㅋㅋㅋ 안녕하세요"
# 정규식에서 일치되는 부분을 리스트 형태로 저장
re.compile('[ㄱ-ㅎ]+').findall(text) # 출력 ['ㅋㅋㅋ']

import re

text = "ㅋㅋㅋ 안녕하ㅏ세요"
# 정규식에서 일치되는 부분을 리스트 형태로 저장
re.compile('[ㄱ-ㅎ|ㅏ-ㅣ]+').findall(text) # 출력 ['ㅋㅋㅋ', 'ㅏ']

import re

text = "ㅋㅋㅋ 안녕하세요"
# 정규식에서 일치되는 부분을 리스트 형태로 저장
re.compile('[가-힣]+').findall(text) # 출력 ['안녕하세요']

정규식에서 일치되는 부분을 제외하고 추출

import re

text = "ㅋㅋㅋ 안녕하세요"
# 한글과 띄어쓰기을 제외하고 모든 글자 (자음, 모음만 있는경우 제외)
re.compile('[ |가-힣]+').sub('', text) # 출력 'ㅋㅋㅋ'


text = "하이 ㅋㅋㅋ 안녕하ㅏ세요"
# 정규식에서 일치되는 부분을  제외하고 저장
re.compile('[ |ㄱ-ㅎ|ㅏ-ㅣ]+').sub('',text) # 출력 '안녕하세요'

주의해야할 점

주의해야 할 점은 두개의 결과가 리스트와 str으로 반환된다는 점이다.
아래 예제를 통해서 내가 언제 어떤 상황에서 어떻게 처리해야할지 판단하면 된다.

import re

text = "ㅋㅋㅋ 안녕하ㅏ세요"
# 정규식에서 일치되는 부분을 리스트 형태로 저장
re.compile('[가-힣]+').findall(text) # 출력 ['안녕하', '세요']
text = "하이 ㅋㅋㅋ 안녕하ㅏ세요"
# 정규식에서 일치되는 부분을  제외하고 저장
re.compile('[ |ㄱ-ㅎ|ㅏ-ㅣ]+').sub('',text) # 출력 '안녕하세요'

저작자표시 비영리 변경금지 (새창열림)

'우리는 개발자 > Data Science' 카테고리의 다른 글

[Python] 두개의 벡터(vector) cosine similarity 계산하는 방법 (0)	2019.09.07
[Python] collections.Counter를 이용해 리스트의 값 개수세기 (0)	2019.09.07
[Python] 한글 전처리 모음 (0)	2019.09.07
[Python] datetime timedelta를 이용해 날짜 더하고 빼는 방법 (0)	2019.09.07
[Python] Python3 SimpleHTTPServer, http.server (0)	2019.09.07

[Python] 한글 전처리 모음

2019. 9. 7. 00:21

python에서 한글 전처리를 하는 모음

from collections import Counter

special_chars = ['\n', '?', '.', '+', '~', '-', '_', ',', '!', '@', '#', '$', '%', '^', '&', '*', '(', ')', '{', '}', '[', ']' ,'/', '=', '`', '|']

def string_cleanup(x, notwanted):
    # import re
    for item in notwanted:
        x = x.replace(item, ' ')
        # x = re.sub(item, '', x)
    return x

def multiple_spaces_to_one(sentence):
    import re
    return re.sub(' +', ' ', sentence)

def remove_duplicated_words(sentence):

    return ' '.join(set(text.split(' ')))

def preprocessing(sentence):
    sentence = string_cleanup(sentence, special_chars) 
    sentence = re.compile('[0-9|ㄱ-ㅎ|ㅏ-ㅣ]+').sub('',sentence) # 'ㅋㅋㅋ', 'ㅏㅏ 제거'
    sentence = sentence.strip()
    sentence = sentence.lower()
    sentence = multiple_spaces_to_one(sentence)
    sentence = ' '.join(Counter(text.split(' ')).keys())
    return sentence

def preprocessing_udf(x):
  text = preprocessing(x['context'])
  return text  

result_df.head(2).apply(preprocessing_udf, axis=1)

저작자표시 비영리 변경금지 (새창열림)

'우리는 개발자 > Data Science' 카테고리의 다른 글

[Python] collections.Counter를 이용해 리스트의 값 개수세기 (0)	2019.09.07
[Python] 정규식 (Regex)를 이용해 한글만 추출하는 방법 (모음, 자음 구분) (0)	2019.09.07
[Python] datetime timedelta를 이용해 날짜 더하고 빼는 방법 (0)	2019.09.07
[Python] Python3 SimpleHTTPServer, http.server (0)	2019.09.07
[Python] Hive 테이블 데이터 가져오기 (subprocess, commands) (0)	2019.09.07

PREV 1 2 3 4 NEXT

더블리의 12층