'우리는 개발자' 카테고리의 글 목록 (8 Page)

우리는 개발자

maven build project. goal command 정리.

2019. 9. 6. 00:18

요즘은 java 프로젝트를 gradle로 많이 하지만 나는 maven을 사용 중이다.
맨날 쓰는 maven command line이 있지만, 정확히 어떤 의미인지를 파악해봐야겠다.
~~인텔리제이 로컬 환경에서 package명령어는 성공하나 deploy가 안되는데 그 이유가 궁금하여 확인해보게되었다.~~

아파치 공식문서

https://maven.apache.org/guides/introduction/introduction-to-the-lifecycle.html

For the person building a project, this means that it is only necessary to learn a small set of commands to build any Maven project, and the POM will ensure they get the results they desired.

maven이란.

maven은 pom.xml에 description된 정보를 기반으로 프로젝트를 build한다.
goal이란 maven이 행할 수 있는 여러가지 동작을 수행하는 명령어를 의미.

어떤 goal이 있을까?

clean

컴파일 결과물인 target 디렉토리 삭제.

compile

compile the source code of the project
모든 소스코드를 컴파일하고 리소스파일은 target/classes 디렉토리에 복사.

package

take the compiled code and package it in its distributable format, such as a JAR.
compile 수행 후 pom에 있는 정보에 따라 패키징을 수행.

description example.

<executions>
               <execution>
                   <id>package</id>
                   <phase>package</phase>
                   <goals>
                       <goal>run</goal>
                   </goals>
                   <configuration>
                       <tasks>
                           <copy file="${project.build.directory}/${project.build.finalName}.jar.original"
                                 tofile="./deploy/my-project/${project.build.finalName}.jar"/>
                           <copy todir="./deploy/my-project/bin">
                               <fileset dir="bin"/>
                           </copy>
                           <copy todir="./deploy/my-project/conf">
                               <fileset dir="conf"/>
                           </copy>
                           <copy todir="./deploy/my-project/lib">
                               <fileset dir="${project.build.directory}/dependency/"/>
                           </copy>
                       </tasks>
                   </configuration>
               </execution>
               ....
             <executions>
               <execution>
                   <phase>package</phase>
                   <goals>
                       <goal>copy-dependencies</goal>
                   </goals>
                   <configuration>
                       <outputDirectory>${project.build.directory}/dependency/</outputDirectory>
                   </configuration>
               </execution>
           </executions>

install

install the package into the local repository, for use as a dependency in other projects locally
package 수행 후 local repo에 pakage를 설치. 로컬 다른 프로젝트에서 사용가능함.

validate

validate the project is correct and all necessary information is available
프로젝트가 사용 가능한지 확인.

test

test the compiled source code using a suitable unit testing framework. These tests should not require the code be packaged or deployed
unit test 수행

deploy

done in the build environment, copies the final package to the remote repository for sharing with other developers and projects.
환경구성을 마치고 remote repository에 package들을 copy. 실제 릴리즈할때의 배포.

실제로 내가 배포할때 사용하는 명령어.

mvn -U clean --update-snapshots dependency:copy-dependencies package -Dmaven.test.skip=true -Dmaven.test.skip=true

저작자표시 비영리 변경금지 (새창열림)

'우리는 개발자 > Computer' 카테고리의 다른 글

github에서 ssh key 이용하기 (0)	2019.09.07
Mysql8 spring boot hibernate 사용하기. (0)	2019.09.07
spring boot servlet-api.jar runtime error / container failed during start / LifecycleException (0)	2019.09.07
[AOP] http request에 끼어들기. proceedjointpoint에 parameter전달하기. (0)	2019.09.04
Apache httpd 버전 업그레이드 하기. (0)	2019.08.31

[Pandas] DataFrame 필터링과 동시에 데이터 읽기 (chunksize, iterator=True)

2019. 9. 6. 00:18

pandas에서 데이터를 읽을때 특정조건을 필터할필요가 있다. 전체를 읽고 필터링하기 보다는 내가 필요한 데이터만 읽고 필터링을 하자! 읽으면서 filtering을 하기 위해서는 chunksize를 사용해야 한다.

import pandas as pd
iter_csv = pd.read_csv('file.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['field'] > constant] for chunk in iter_csv])

def read_result():
  lines = open('/tmp/query_result.tsv', 'r').readlines()
  data = []
  cols = lines[0][:-1].split("\t")
  len_cols = len(cols)

  for line in lines[1:]:
    vals = line[:-1].split("\t")  
    if len(vals) != len_cols:
      # print (line[:-1])
      continue
    data.append(vals)  
  return pd.DataFrame(data, columns=cols)

저작자표시 비영리 변경금지 (새창열림)

'우리는 개발자 > Data Science' 카테고리의 다른 글

[Pandas] DataFrame apply 함수를 Paralleization 하는 방법 (병렬처리 하는 방법) (1)	2019.09.06
[Pandas] 에서 apply의 얼마나 처리되었는지 진행상황을 확인하는 방법 (tqdm 사용) (0)	2019.09.06
[Ipython/JupyterNotebook] Linux 환경변수 추가/설정 하는 방법 (PYTHONPATH, LD_LIBRARY_PATH) (0)	2019.09.04
[Ipython/JupyterNotebook] Pandas의 DataFrame의 결과 화면을 설정하는 방법 pd.option.display, pd.set_option (0)	2019.09.04
[Pandas] DataFrame에서 str.split을 이용해 하나의 컬럼을 두개의 컬럼으로 나누는 방법 (0)	2019.09.04

data lake / datawarehouse / data mart 의 뜻

2019. 9. 6. 00:11

의 ㄸdata lake / dataware house / data mart

data lake 는 비정형화된 로우 로그 수준의 모든 데이터를 저장.
datawarehouse 는 모델링되고 구조화된 데이터를 저장.
data mart 는 datawarehouse에서 특정 목적이 뚜렷한 성격의 데이터를 따로 가져가는 것으로 datawarehouse에 일부분이 될 수 있다.

data engineering 포지션에 4-6년차 경력직 면접이라면 나올 수 있는 기본적인 정의들이다.
뿐만 아니라 기본적인 용어 정리는 잘 정립해놓는 것이 커뮤니케이션에 좋다.

저작자표시 비영리 변경금지 (새창열림)

'우리는 개발자 > Data Engineering' 카테고리의 다른 글

[elasticsearch] nested type, nested type aggregation. (0)	2020.01.11
[elasticsearch] fielddata, doc_values에 대한 이해. (0)	2020.01.11
[elasticsearch] cluster update setting. persistent, transient, default. (0)	2019.12.20
[elasticsearch] java heap memory 설정 하기 + es node 재시작. (0)	2019.12.18
[elasticsearch] kibana 설치, 연동하기 + filebeat설치하기. (2)	2019.12.17

[Ipython/JupyterNotebook] Linux 환경변수 추가/설정 하는 방법 (PYTHONPATH, LD_LIBRARY_PATH)

2019. 9. 4. 23:49

jupyter notebook --generate-config의 명령을 통해 기본 경로 ~/.jupyter에 config파일을 생성할수 있다.

만약 이미 생성이 되어 있다면 jupyter --config-dir을 통해 경로를 확인할 수 있다. 경로를 확인하고 아래 코드를 통해 환경변수를 추가하자

import os
c = get_config()
os.environ['LD_LIBRARY_PATH'] = '/home1/jslee/library/lib'
os.environ['PYTHONPATH'] = '${PYTHONPATH}:/home1/jslee/library/binding/python'

c.Spawner.env.update('LD_LIBRARY_PATH')
c.Spawner.env.update('PYTHONPATH')

관련이슈
- https://github.com/jupyter/notebook/issues/1290

저작자표시 비영리 변경금지 (새창열림)

'우리는 개발자 > Data Science' 카테고리의 다른 글

[Pandas] 에서 apply의 얼마나 처리되었는지 진행상황을 확인하는 방법 (tqdm 사용) (0)	2019.09.06
[Pandas] DataFrame 필터링과 동시에 데이터 읽기 (chunksize, iterator=True) (0)	2019.09.06
[Ipython/JupyterNotebook] Pandas의 DataFrame의 결과 화면을 설정하는 방법 pd.option.display, pd.set_option (0)	2019.09.04
[Pandas] DataFrame에서 str.split을 이용해 하나의 컬럼을 두개의 컬럼으로 나누는 방법 (0)	2019.09.04
[Pandas] DataFrame을 Elasticsearch Index로 삽입하는 방법, DataFrame2EsIndex (0)	2019.09.04

[Ipython/JupyterNotebook] Pandas의 DataFrame의 결과 화면을 설정하는 방법 pd.option.display, pd.set_option

2019. 9. 4. 23:46

Ipython, JupyterNotebook을 사용하다보면,
df.head(100)의 결과를 출력할 경우가 있다. (하지만? 10개정도 보일것이다.)
df.head(1)의 결과를 출력하니 컬럼에 ...으로 나올때도 있다.
df.head(1)의 결과에서 dataframe의 폭이 좁을때가 있다.

이런 여러가지 상황에서 dataframe의 출력 결과를 설정하는게 필요하다.
아래 pd.option.display 또는 pd.set_option을 통해 변경이 가능하다.

pd.options.display.max_columns = 30
pd.options.display.max_rows = 20

pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

저작자표시 비영리 변경금지 (새창열림)

'우리는 개발자 > Data Science' 카테고리의 다른 글

[Pandas] DataFrame 필터링과 동시에 데이터 읽기 (chunksize, iterator=True) (0)	2019.09.06
[Ipython/JupyterNotebook] Linux 환경변수 추가/설정 하는 방법 (PYTHONPATH, LD_LIBRARY_PATH) (0)	2019.09.04
[Pandas] DataFrame에서 str.split을 이용해 하나의 컬럼을 두개의 컬럼으로 나누는 방법 (0)	2019.09.04
[Pandas] DataFrame을 Elasticsearch Index로 삽입하는 방법, DataFrame2EsIndex (0)	2019.09.04
[Pandas] DataFrame에서 mean()의 결과가 inf? inf값을 찾고, 값을 변경해보자 (0)	2019.09.04

[Pandas] DataFrame에서 str.split을 이용해 하나의 컬럼을 두개의 컬럼으로 나누는 방법

2019. 9. 4. 23:42

Hive에서의 파티션의 결과는 ymd=201807/hh24=03의 형태로 값이 넘어온다.
하나의 컬럼에 다음과 같이 들어오기 때문에 로우를 파싱해야한다.
내가 원하는 결과는 ymd=201807, hh24=03의 두개의 컬럼으로 나누고,
나눈 결과를 다시 한번더 처리해서 ymd의 컬럼에 201807, hh24의 컬럼에 03이 들어 가도록 처리하고 싶다.

str.split(delimiter', expand=True)를 통해서 하나의 컬럼을 두개의 컬럼으로 나눌 수 있다.

df[['First','Last']] = df.Name.str.split("_",expand=True) 

def parse_partition(df):
  df[['ymd', 'hh24']] = df['partition'].str.split("/", expand=True)
  df[['ymd', 'ymd_v']] = df['ymd'].str.split("=", expand=True)
  df[['hh24', 'hh24_v']] = df['hh24'].str.split("=", expand=True)
  df = df[['ymd_v','hh24_v']]
  df.columns = ['ymd', 'hh24']
  return df

저작자표시 비영리 변경금지 (새창열림)

'우리는 개발자 > Data Science' 카테고리의 다른 글

[Ipython/JupyterNotebook] Linux 환경변수 추가/설정 하는 방법 (PYTHONPATH, LD_LIBRARY_PATH) (0)	2019.09.04
[Ipython/JupyterNotebook] Pandas의 DataFrame의 결과 화면을 설정하는 방법 pd.option.display, pd.set_option (0)	2019.09.04
[Pandas] DataFrame을 Elasticsearch Index로 삽입하는 방법, DataFrame2EsIndex (0)	2019.09.04
[Pandas] DataFrame에서 mean()의 결과가 inf? inf값을 찾고, 값을 변경해보자 (0)	2019.09.04
[Pandas] DataFrame을 Spark의 DataFrame으로 변환 PandasDataFrame To SparkDataFrame (0)	2019.09.04

[Pandas] DataFrame을 Elasticsearch Index로 삽입하는 방법, DataFrame2EsIndex

2019. 9. 4. 23:36

DataFrame의 결과를 Elasticsearch의 Index로 넣어야 했다.
물론? python에서도 elasticsearch의 패키지가 있다.

아래와 같이 es_client를 정의 할때, 내가 넣고자 하는 ES_HOST를 파라미터로 넘겨주면 된다.

예: Elasticsearch('localhost:9200')

use_these_keys에는 dataframe의 여러 컬럼중에서 내가 es에 넣을 필드의 리스트를 넣어 주면 된다.
아래 helpers.bulk를 이용하여 doc_generator에 정의한 index, type, _id, _source 의 형태로 값이 들어간다.

from elasticsearch import Elasticsearch
from elasticsearch import helpers

es_client = Elasticsearch(http_compress=True)
def doc_generator(df):
    df_iter = df.iterrows()
    for index, document in df_iter:
        yield {
                "_index": 'your_index',
                "_type": "_doc",
                "_id" : f"{document['id']}",
                "_source": filterKeys(document),
            }
    raise StopIteration

use_these_keys = ['id', 'value', 'value1']

def filterKeys(document):
    return {key: document[key] for key in use_these_keys }    


helpers.bulk(es_client, doc_generator(your_dataframe))

참고
- https://towardsdatascience.com/exporting-pandas-data-to-elasticsearch-724aa4dd8f62

저작자표시 비영리 변경금지 (새창열림)

'우리는 개발자 > Data Science' 카테고리의 다른 글

[Ipython/JupyterNotebook] Pandas의 DataFrame의 결과 화면을 설정하는 방법 pd.option.display, pd.set_option (0)	2019.09.04
[Pandas] DataFrame에서 str.split을 이용해 하나의 컬럼을 두개의 컬럼으로 나누는 방법 (0)	2019.09.04
[Pandas] DataFrame에서 mean()의 결과가 inf? inf값을 찾고, 값을 변경해보자 (0)	2019.09.04
[Pandas] DataFrame을 Spark의 DataFrame으로 변환 PandasDataFrame To SparkDataFrame (0)	2019.09.04
[Pandas] Json 파일 DataFrame으로 변환하는 방법 Dictionary2DataFrame (0)	2019.09.04

[Pandas] DataFrame에서 mean()의 결과가 inf? inf값을 찾고, 값을 변경해보자

2019. 9. 4. 23:31

전체 컬럼에서 mean()을 계산하는데 계속 inf의 값이 나왔다.
분명히 NaN의 값을 fillna(0.0)으로 했지만 계속 문제가 나옴.
head(100).tail(50).head(25) 이런식으로 원식적으로... 접근해보니 inf의 값이 있었다.
아래 방법을 통해서 np.inf로 찾아내고, nan으로 변경하고 fillna(0.0)을 하자

import numpy as np

df.replace([np.inf, -np.inf], np.nan).dropna(subset=["col1", "col2"], how="all")

저작자표시 비영리 변경금지 (새창열림)

'우리는 개발자 > Data Science' 카테고리의 다른 글

[Pandas] DataFrame에서 str.split을 이용해 하나의 컬럼을 두개의 컬럼으로 나누는 방법 (0)	2019.09.04
[Pandas] DataFrame을 Elasticsearch Index로 삽입하는 방법, DataFrame2EsIndex (0)	2019.09.04
[Pandas] DataFrame을 Spark의 DataFrame으로 변환 PandasDataFrame To SparkDataFrame (0)	2019.09.04
[Pandas] Json 파일 DataFrame으로 변환하는 방법 Dictionary2DataFrame (0)	2019.09.04
[Pandas] DataFrame CSV 파일 읽을때 iterator를 이용해 필요한 데이터만 메모리에 올리는 방법 (0)	2019.09.04

PREV 1 ···5 6 7 8 9 NEXT

더블리의 12층