기계학습 모델 최적화를 위한 GridSearch, RandomSearch Cross Validation 알아보기

파이썬(Python)으로 기계학습을 구현할 때 고려해야 할 사항은 아주 많습니다. 이번에는 GridSearch, RandomSearch Cross Validation를 이용해서 기계학습 모델을 최적화하는 작업을 GridSearch, RandomSearch Cross Validation을 이용해서 알아보도록 하겠습니다. 두 교차검증 역시 사이킷런에서 sklearn.model_selection 모듈에서 지원하고 있습니다.

최적화(Optimization)

기계학습 모델을 만들 때는, 모델 종류 및 입력 데이터만 선정하는게 아니라 초매개변수(하이퍼파라미터, Hyper-parameter)를 설정하는 작업이 필요하며, 이를 최적화라고 합니다. 하이퍼파라미터란, 학습용으로 이용되지 않는 매개변수입니다. 이 값이 달라지면 모델의 특성에 차이가 생기기 때문에, 하이퍼파라미터 튜닝이라고 하는 모델 세부 조정 작업이 필요하지요. 그런데, 이 값을 어떻게 바꿔야 할지 모를 때 GridSearch, RandomSearch Cross Validation 등이 고려됩니다.

GridSearch Cross Validation

GridSearch Cross Validation은 가능한 모든 하이퍼파라미터의 조합 중 최적의 조합을 선택하는 방식입니다. 해당 방법의 특성상, 가능한 모든 조합을 테스트해 보기 때문에 항상 최적의 결과를 찾을 수 있다는 장점이 있습니다. 다만, 시간과 컴퓨팅 파워가 많이 소모된다는 점이 큰 단점으로 다가오기 때문에 하이퍼파라미터의 조합을 일부분만 확인해 보는 방식으로 최적화를 진행하는 경우도 많습니다.

사이킷런(scikit-learn)에서는 sklearn.model_selection.GridSearchCV로 구현할 수 있으니, 아래 코드를 참조해 의사결정나무 분류기(Decision Tree Classifier) 모델의 최적화를 진행해 보도록 하겠습니다.

# import package
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

# input data
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=.2, random_state=12345)
dt_clf = DecisionTreeClassifier()

# hyperparameter
parameters = {'max_depth': range(1,5), 'min_samples_split': range(2,5)}

이번에 사용할 의사결정나무 모델은 max_depth가 1~4까지의 범위, min_samples_split이 2~4까지의 범위를 입력받을 수 있도록 Grid Search 설정을 진행해 봤습니다. Grid Search에 사용할 하이퍼파라미터는 딕셔너리 형태로 설정을 해 주는데요, 하이퍼파라미터의 이름을 key, 가능한 범위를 value 값으로 저장하면 됩니다.

다음으로는, 3-fold CV를 이용해서 Grid Search 결과를 표현하는 작업입니다. GridSearch Cross Validation 결과는 cv_results_딕셔너리에 저장되며, 이를 조회해서 최적의 하이퍼파라미터 조합을 알아볼 수 있습니다. 만약 가장 정확도가 높게 나오는 조합을 자동으로 확정하고 싶다면, GridSearchCV를 실행할 때 refit=True 매개변수를 입력해 주도록 합시다.

# GridSearch Cross Validation
grid_dt = GridSearchCV(dt_clf, param_grid=parameters, cv=3, refit=True, return_train_score=True)
grid_dt.fit(X_train, y_train)

# GridSearchCV result
scores_df = pd.DataFrame(grid_dt.cv_results_)
scores_df[['params', 'mean_test_score', 'rank_test_score',
           'split0_test_score', 'split1_test_score', 'split2_test_score']]

	params	mean_test_score	rank_test_score	split0_test_score	split1_test_score	split2_test_score
0	{'max_depth': 1, 'min_samples_split': 2}	0.683333	10	0.7	0.675	0.675
1	{'max_depth': 1, 'min_samples_split': 3}	0.683333	10	0.7	0.675	0.675
2	{'max_depth': 1, 'min_samples_split': 4}	0.683333	10	0.7	0.675	0.675
3	{'max_depth': 2, 'min_samples_split': 2}	0.958333	7	1.0	0.950	0.925
4	{'max_depth': 2, 'min_samples_split': 3}	0.958333	7	1.0	0.950	0.925
5	{'max_depth': 2, 'min_samples_split': 4}	0.958333	7	1.0	0.950	0.925
6	{'max_depth': 3, 'min_samples_split': 2}	0.966667	1	1.0	0.950	0.950
7	{'max_depth': 3, 'min_samples_split': 3}	0.966667	1	1.0	0.950	0.950
8	{'max_depth': 3, 'min_samples_split': 4}	0.966667	1	1.0	0.950	0.950
9	{'max_depth': 4, 'min_samples_split': 2}	0.966667	1	1.0	0.950	0.950
10	{'max_depth': 4, 'min_samples_split': 3}	0.966667	1	1.0	0.950	0.950
11	{'max_depth': 4, 'min_samples_split': 4}	0.966667	1	1.0	0.950	0.950

# best params
print(f'best params:', grid_dt.best_params_)
print(f'best score {grid_dt.best_score_:.2%}')

best params: {'max_depth': 3, 'min_samples_split': 2}
best score 96.67%

RandomSearch Cross Validation

GridSearch Cross Validation이 이론적으로 최적의 하이퍼파라미터 조합을 찾을 수 있는 방법이지만, 이 방법은 모든 조합을 알아본다는 특성상 시간과 컴퓨팅 파워의 소모가 심하다는 단점이 있습니다. 이 때문에, 무작위 하이퍼파라미터의 조합을 이용해서 최적의 조합을 찾는 Random Search 방식을 이용하는 경우도 있는데요, 이는 무작위성에 의존하기 때문에 최적의 조합을 찾아낼 수 없는 경우도 있지만, 경우에 따라서 최소한의 시간과 컴퓨팅 파워의 투자로 만족스러운 결과를 얻어낼 수도 있습니다.

사이킷런(scikit-learn)에서는 sklearn.model_selection.RandomizedSearchCV로 해당 기능을 구현할 수 있습니다. 아래 코드를 참조해 GridSearchCV로 구현했던 의사결정나무 분류기 모델의 최적화 작업을 RandomizedSearchCV로 진행해 보도록 하겠습니다.

# import package
from sklearn.model_selection import RandomizedSearchCV

random_dt = RandomizedSearchCV(estimator=dt_clf, param_distributions=parameters, n_iter=10,
                               cv=3, random_state=12345, n_jobs=-1)
random_dt.fit(X_train,y_train)

# RandomizedSearchCV result
scores_df = pd.DataFrame(random_dt.cv_results_)
scores_df[['params', 'mean_test_score', 'rank_test_score',
           'split0_test_score', 'split1_test_score', 'split2_test_score']]

	params	mean_test_score	rank_test_score	split0_test_score	split1_test_score	split2_test_score
0	{'min_samples_split': 4, 'max_depth': 3}	0.966667	1	1.0	0.950	0.950
1	{'min_samples_split': 2, 'max_depth': 1}	0.683333	9	0.7	0.675	0.675
2	{'min_samples_split': 2, 'max_depth': 2}	0.958333	7	1.0	0.950	0.925
3	{'min_samples_split': 3, 'max_depth': 3}	0.966667	1	1.0	0.950	0.950
4	{'min_samples_split': 4, 'max_depth': 4}	0.966667	1	1.0	0.950	0.950
5	{'min_samples_split': 3, 'max_depth': 4}	0.966667	1	1.0	0.950	0.950
6	{'min_samples_split': 2, 'max_depth': 3}	0.966667	1	1.0	0.950	0.950
7	{'min_samples_split': 2, 'max_depth': 4}	0.966667	1	1.0	0.950	0.950
8	{'min_samples_split': 3, 'max_depth': 2}	0.958333	7	1.0	0.950	0.925
9	{'min_samples_split': 3, 'max_depth': 1}	0.683333	9	0.7	0.675	0.675

# best params
print(f'best params:', random_dt.best_params_)
print(f'best score {random_dt.best_score_:.2%}')

best params: {'min_samples_split': 4, 'max_depth': 3}
best score 96.67%

저작자표시 비영리 동일조건 (새창열림)

'Python > 기계학습' 카테고리의 다른 글

[imbalanced-learn] 기계학습에서 불균형 데이터를 처리하는 샘플링 알아보기 (0)	2024.03.15
[scikit-learn] 기계학습 모델 평가용 k-Fold Cross Validation 알아보기 (0)	2024.03.14
[scikit-learn] 사이킷런 레이블인코더, 원-핫 인코더로 기계학습 데이터 전처리하기 (0)	2024.03.13

이 위젯은 쿠팡 파트너스 활동으로, 수수료를 제공받습니다

아리엘의 블로그 코딩시작반 아리엘의 블로그 코딩시작반

기계학습 모델 최적화를 위한 GridSearch, RandomSearch Cross Validation 알아보기

최적화(Optimization)

GridSearch Cross Validation

RandomSearch Cross Validation

'Python > 기계학습' 카테고리의 다른 글

티스토리툴바

기계학습 모델 최적화를 위한 GridSearch, RandomSearch Cross Validation 알아보기

최적화(Optimization)

GridSearch Cross Validation

RandomSearch Cross Validation

'Python > 기계학습' 카테고리의 다른 글

함께 보면 좋은 글 더보기

티스토리툴바