기계학습 모델 최적화를 위한 GridSearch, RandomSearch Cross Validation 알아보기

파이썬(Python)으로 기계학습을 구현할 때 고려해야 할 사항은 아주 많습니다. 이번에는 GridSearch, RandomSearch Cross Validation를 이용해서 기계학습 모델을 최적화하는 작업을 GridSearch, RandomSearch Cross Validation을 이용해서 알아보도록 하겠습니다. 두 교차검증 역시 사이킷런에서 sklearn.model_selection 모듈에서 지원하고 있습니다.

최적화(Optimization)

기계학습 모델을 만들 때는, 모델 종류 및 입력 데이터만 선정하는게 아니라 초매개변수(하이퍼파라미터, Hyper-parameter)를 설정하는 작업이 필요하며, 이를 최적화라고 합니다. 하이퍼파라미터란, 학습용으로 이용되지 않는 매개변수입니다. 이 값이 달라지면 모델의 특성에 차이가 생기기 때문에, 하이퍼파라미터 튜닝이라고 하는 모델 세부 조정 작업이 필요하지요. 그런데, 이 값을 어떻게 바꿔야 할지 모를 때 GridSearch, RandomSearch Cross Validation 등이 고려됩니다.

GridSearch Cross Validation

GridSearch Cross Validation은 가능한 모든 하이퍼파라미터의 조합 중 최적의 조합을 선택하는 방식입니다. 해당 방법의 특성상, 가능한 모든 조합을 테스트해 보기 때문에 항상 최적의 결과를 찾을 수 있다는 장점이 있습니다. 다만, 시간과 컴퓨팅 파워가 많이 소모된다는 점이 큰 단점으로 다가오기 때문에 하이퍼파라미터의 조합을 일부분만 확인해 보는 방식으로 최적화를 진행하는 경우도 많습니다.

사이킷런(scikit-learn)에서는 sklearn.model_selection.GridSearchCV로 구현할 수 있으니, 아래 코드를 참조해 의사결정나무 분류기(Decision Tree Classifier) 모델의 최적화를 진행해 보도록 하겠습니다.

# import package
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# input data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=.2, random_state=12345)
dt_clf = DecisionTreeClassifier()

# hyperparameter
parameters = {'max_depth': range(1,5), 'min_samples_split': range(2,5)}

이번에 사용할 의사결정나무 모델은 max_depth가 1~4까지의 범위, min_samples_split이 2~4까지의 범위를 입력받을 수 있도록 Grid Search 설정을 진행해 봤습니다. Grid Search에 사용할 하이퍼파라미터는 딕셔너리 형태로 설정을 해 주는데요, 하이퍼파라미터의 이름을 key, 가능한 범위를 value 값으로 저장하면 됩니다.

다음으로는, 3-fold CV를 이용해서 Grid Search 결과를 표현하는 작업입니다. GridSearch Cross Validation 결과는 cv_results_딕셔너리에 저장되며, 이를 조회해서 최적의 하이퍼파라미터 조합을 알아볼 수 있습니다. 만약 가장 정확도가 높게 나오는 조합을 자동으로 확정하고 싶다면, GridSearchCV를 실행할 때 refit=True 매개변수를 입력해 주도록 합시다.

# GridSearch Cross Validation
grid_dt = GridSearchCV(dt_clf, param_grid=parameters, cv=3, refit=True, return_train_score=True)
grid_dt.fit(X_train, y_train)

# GridSearchCV result
scores_df = pd.DataFrame(grid_dt.cv_results_)
scores_df[['params', 'mean_test_score', 'rank_test_score',
           'split0_test_score', 'split1_test_score', 'split2_test_score']]

	params	mean_test_score	rank_test_score	split0_test_score	split1_test_score	split2_test_score
0	{'max_depth': 1, 'min_samples_split': 2}	0.683333	10	0.7	0.675	0.675
1	{'max_depth': 1, 'min_samples_split': 3}	0.683333	10	0.7	0.675	0.675
2	{'max_depth': 1, 'min_samples_split': 4}	0.683333	10	0.7	0.675	0.675
3	{'max_depth': 2, 'min_samples_split': 2}	0.958333	7	1.0	0.950	0.925
4	{'max_depth': 2, 'min_samples_split': 3}	0.958333	7	1.0	0.950	0.925
5	{'max_depth': 2, 'min_samples_split': 4}	0.958333	7	1.0	0.950	0.925
6	{'max_depth': 3, 'min_samples_split': 2}	0.966667	1	1.0	0.950	0.950
7	{'max_depth': 3, 'min_samples_split': 3}	0.966667	1	1.0	0.950	0.950
8	{'max_depth': 3, 'min_samples_split': 4}	0.966667	1	1.0	0.950	0.950
9	{'max_depth': 4, 'min_samples_split': 2}	0.966667	1	1.0	0.950	0.950
10	{'max_depth': 4, 'min_samples_split': 3}	0.966667	1	1.0	0.950	0.950
11	{'max_depth': 4, 'min_samples_split': 4}	0.966667	1	1.0	0.950	0.950

# best params
print(f'best params:', grid_dt.best_params_)
print(f'best score {grid_dt.best_score_:.2%}')

best params: {'max_depth': 3, 'min_samples_split': 2}
best score 96.67%

RandomSearch Cross Validation

GridSearch Cross Validation이 이론적으로 최적의 하이퍼파라미터 조합을 찾을 수 있는 방법이지만, 이 방법은 모든 조합을 알아본다는 특성상 시간과 컴퓨팅 파워의 소모가 심하다는 단점이 있습니다. 이 때문에, 무작위 하이퍼파라미터의 조합을 이용해서 최적의 조합을 찾는 Random Search 방식을 이용하는 경우도 있는데요, 이는 무작위성에 의존하기 때문에 최적의 조합을 찾아낼 수 없는 경우도 있지만, 경우에 따라서 최소한의 시간과 컴퓨팅 파워의 투자로 만족스러운 결과를 얻어낼 수도 있습니다.

사이킷런(scikit-learn)에서는 sklearn.model_selection.RandomizedSearchCV로 해당 기능을 구현할 수 있습니다. 아래 코드를 참조해 GridSearchCV로 구현했던 의사결정나무 분류기 모델의 최적화 작업을 RandomizedSearchCV로 진행해 보도록 하겠습니다.

# import package
from sklearn.model_selection import RandomizedSearchCV

random_dt = RandomizedSearchCV(estimator=dt_clf, param_distributions=parameters, n_iter=10,
                               cv=3, random_state=12345, n_jobs=-1)
random_dt.fit(X_train,y_train)

# RandomizedSearchCV result
scores_df = pd.DataFrame(random_dt.cv_results_)
scores_df[['params', 'mean_test_score', 'rank_test_score',
           'split0_test_score', 'split1_test_score', 'split2_test_score']]

	params	mean_test_score	rank_test_score	split0_test_score	split1_test_score	split2_test_score
0	{'min_samples_split': 4, 'max_depth': 3}	0.966667	1	1.0	0.950	0.950
1	{'min_samples_split': 2, 'max_depth': 1}	0.683333	9	0.7	0.675	0.675
2	{'min_samples_split': 2, 'max_depth': 2}	0.958333	7	1.0	0.950	0.925
3	{'min_samples_split': 3, 'max_depth': 3}	0.966667	1	1.0	0.950	0.950
4	{'min_samples_split': 4, 'max_depth': 4}	0.966667	1	1.0	0.950	0.950
5	{'min_samples_split': 3, 'max_depth': 4}	0.966667	1	1.0	0.950	0.950
6	{'min_samples_split': 2, 'max_depth': 3}	0.966667	1	1.0	0.950	0.950
7	{'min_samples_split': 2, 'max_depth': 4}	0.966667	1	1.0	0.950	0.950
8	{'min_samples_split': 3, 'max_depth': 2}	0.958333	7	1.0	0.950	0.925
9	{'min_samples_split': 3, 'max_depth': 1}	0.683333	9	0.7	0.675	0.675

# best params
print(f'best params:', random_dt.best_params_)
print(f'best score {random_dt.best_score_:.2%}')

best params: {'min_samples_split': 4, 'max_depth': 3}
best score 96.67%

저작자표시 비영리 동일조건 (새창열림)

'Python > 기계학습' 카테고리의 다른 글

[imbalanced-learn] 기계학습에서 불균형 데이터를 처리하는 샘플링 알아보기 (0)	2024.03.15
[scikit-learn] 기계학습 모델 평가용 k-Fold Cross Validation 알아보기 (0)	2024.03.14
[scikit-learn] 사이킷런 레이블인코더, 원-핫 인코더로 기계학습 데이터 전처리하기 (0)	2024.03.13

ABOUT ME

아리엘의 블로그 코딩시작반 아리엘의 블로그 코딩시작반

기계학습 모델 최적화를 위한 GridSearch, RandomSearch Cross Validation 알아보기

최적화(Optimization)

GridSearch Cross Validation

RandomSearch Cross Validation

'Python > 기계학습' 카테고리의 다른 글

티스토리툴바

ABOUT ME

기계학습 모델 최적화를 위한 GridSearch, RandomSearch Cross Validation 알아보기

최적화(Optimization)

GridSearch Cross Validation

RandomSearch Cross Validation

'Python > 기계학습' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바