JUST DO IT!

jupyter์—์„œ ๋จธ์‹ ๋Ÿฌ๋‹ End to End ์‹ค์Šตํ•ด๋ณด๊ธฐ - TIL230717 ๋ณธ๋ฌธ

TIL

jupyter์—์„œ ๋จธ์‹ ๋Ÿฌ๋‹ End to End ์‹ค์Šตํ•ด๋ณด๊ธฐ - TIL230717

sunhokimDev 2023. 7. 20. 00:56

๐Ÿ“š KDT WEEK 16 DAY 1 TIL

  • ๋จธ์‹ ๋Ÿฌ๋‹ E2E ์‹ค์Šต
    • 0. ์‹ค์Šต ํ™˜๊ฒฝ ์„ค์ •
    • 1. ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ ๋งŒ๋“ค๊ธฐ
    • 2. ๋ฐ์ดํ„ฐ ์ •์ œ
    • 3. ๋ชจ๋ธ ํ›ˆ๋ จ
    • 4. ๋ชจ๋ธ ์„ธ๋ถ€ ํŠœ๋‹
    • 5. ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ ํ‰๊ฐ€
    • 6. ์ƒ์šฉํ™”๋ฅผ ์œ„ํ•œ ๋ชจ๋ธ ์ €์žฅ

๐ŸŸฅ ๋จธ์‹  ๋Ÿฌ๋‹ End to End ์‹ค์Šต

Machine Learning(๊ธฐ๊ณ„ํ•™์Šต) : ๊ฒฝํ—˜์„ ํ†ตํ•ด ์ž๋™์œผ๋กœ ๊ฐœ์„ ํ•˜๋Š” ์ปดํ“จํ„ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์—ฐ๊ตฌ

 

์บ˜๋ฆฌํฌ๋‹ˆ์•„ ์ธ๊ตฌ์กฐ์‚ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด ์บ˜๋ฆฌํฌ๋‹ˆ์•„ ์ฃผํƒ ๊ฐ€๊ฒฉ ์˜ˆ์ธก๋ชจ๋ธ ๋งŒ๋“ค๊ธฐ

 

โš’๏ธ 0. ์‹ค์Šต ํ™˜๊ฒฝ ์„ค์ •

 

virtualenv๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ€์ƒํ™˜๊ฒฝ์•ˆ์—์„œ ์ž‘์—…ํ•˜๋„๋ก ํ•œ๋‹ค.

 

# ์›ํ•˜๋Š” ํด๋”์•ˆ์—์„œ ์ž‘์—…
pip install virtualenv # ๊ฐ€์ƒํ™˜๊ฒฝ
virtualenv my_env # ์›ํ•˜๋Š” ์ด๋ฆ„ ์„ค์ •

# ๊ฐ€์ƒํ™˜๊ฒฝ ์‹คํ–‰
source my_env/bin/activate # on Linux or macOS
.\my_env\Scripts\activate # on Windows

 

๋‚˜์˜ ๊ฒฝ์šฐ์—๋Š” ์œˆ๋„์šฐ Powershell์„ ํ†ตํ•ด ์ž‘์—…ํ–ˆ๋Š”๋ฐ, activate๋ฅผ ์‚ฌ์šฉํ•ด๋„ ๊ฐ€์ƒํ™˜๊ฒฝ์ด ์‹คํ–‰๋˜์ง€ ์•Š์•˜๋‹ค.

์ด์œ ๋Š” Powershell์ด ๊ธฐ๋ณธ์ ์œผ๋กœ ๊ฐ€์ƒํ™˜๊ฒฝ ์‹คํ–‰์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

๋‹ค์Œ์˜ ๋ช…๋ น์–ด๋ฅผ Powershell์—์„œ ์‚ฌ์šฉํ•˜๋ฉด ์ด ์„ค์ •์„ ์ œํ•œ์—†์Œ์œผ๋กœ ๋ฐ”๊ฟ”์ค„ ์ˆ˜ ์žˆ๋‹ค.

Set-ExecutionPolicy -ExecutionPolicy Unrestricted -Scope CurrentUser

 

์ •์ƒ์ ์œผ๋กœ ๊ฐ€์ƒํ™˜๊ฒฝ์ด ์‹คํ–‰๋œ๋‹ค๋ฉด, (my_env)์ฒ˜๋Ÿผ ์„ค์ •ํ•œ ๊ฐ€์ƒํ™˜๊ฒฝ์ด๋ฆ„์œผ๋กœ ๊ด„ํ˜ธ์•ˆ์— ๋“ค์–ด๊ฐ€ ์ ‘์†๋œ๋‹ค.

 

Powershell ์„ค์ • ํ›„ ๊ฐ€์ƒํ™˜๊ฒฝ ์‹คํ–‰๋ชจ์Šต

 

๊ทธ๋ฆฌ๊ณ  ๊ฐ€์ƒํ™˜๊ฒฝ์—์„œ ๋‹ค์Œ์˜ ํŒจํ‚ค์ง€๋“ค์„ ์„ค์น˜ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์‹ค์Šต์€ jupyter์—์„œ ์ง„ํ–‰ํ•œ๋‹ค.

 

python -m pip install -U jupyter matplotlib numpy pandas scipy scikit-learn

# ์„ค์น˜๊ฐ€ ๋๋‚˜๋ฉด jupyter ๋…ธํŠธ๋ถ ์‹คํ–‰
jupyter notebook

 

๋ฐ์ดํ„ฐ์…‹์€ ์บ˜๋ฆฌํฌ๋‹ˆ์•„ ์ธ๊ตฌ์กฐ์‚ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. (๊ต‰์žฅํžˆ ์˜ค๋ž˜๋œ ๋ฐ์ดํ„ฐ๋กœ, ํ•™์Šต์šฉ์ด๋‹ค)

 

import pandas as pd

# housing.csv ํŒŒ์ผ pandas๋กœ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

 

datasets/housing/housing.csv

 

์œ„๋„์™€ ๊ฒฝ๋„๋ณ„ ๊ฐ€๊ตฌ์˜ ํ‰๊ท ๋‚˜์ด, ๋ฐฉ์˜ ๊ฐœ์ˆ˜๋ถ€ํ„ฐ ํ‰๊ท  ์ˆ˜์ž…, ํ‰๊ท  ์ฃผํƒ ๊ฐ€์น˜๋“ฑ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

ocean_proximity๋Š” ํ•ด์•ˆ๊ฐ€์™€ ๊ฐ€๊นŒ์šด์ง€ ๋‚˜ํƒ€๋‚ด๋Š” ์ปฌ๋Ÿผ์ด๋‹ค.

 

๐Ÿ—๏ธ 1. ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ ๋งŒ๋“ค๊ธฐ

 

์ข‹์€ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ›ˆ๋ จ์— ์‚ฌ์šฉ๋˜์ง€ ์•Š๊ณ  ๋ชจ๋ธ ํ‰๊ฐ€๋งŒ์„ ์œ„ํ•œ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ๋ถ„ํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.

 

import numpy as np
np.random.seed(42) # ๋žœ๋ค ํ…Œ์ŠคํŠธ ์‹œ๋“œ ์ƒ์„ฑ

# ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋น„์œจ์„ ์ž…๋ ฅ๋ฐ›์•„ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ
def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data)) # ๋ฐ์ดํ„ฐ๋ฅผ ๋žœ๋ค์œผ๋กœ ์„ž์–ด์คŒ
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]
    
# housing ๋ฐ์ดํ„ฐ๋ฅผ 20ํผ์„ผํŠธ ๋น„์œจ๋กœ ํ…Œ์Šคํ„ฐ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ
# housing์€ csvํŒŒ์ผ์„ pandas.read_csv๋กœ ๋ถˆ๋Ÿฌ์˜จ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ
train_set, test_set = split_train_test(housing, 0.2)

 

or

 

sklearn์—์„œ ์ œ๊ณตํ•˜๋Š” ๋ถ„ํ• ํ•จ์ˆ˜ ์‚ฌ์šฉ

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

 

์—ฌ๊ธฐ์„œ ๊ณ„์ธต์  ์ƒ˜ํ”Œ๋ง์„ ํ•ด์ฃผ๋Š” ๊ฒƒ์ด ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๊ฐ€ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ๋Œ€ํ‘œํ•˜๊ธฐ์— ์ข‹๋‹ค.

์ด ๊ฒฝ์šฐ์—๋„ sklearn์—์„œ ์ œ๊ณตํ•˜๋Š” StratifiedShuffleSplit์„ ์‚ฌ์šฉํ•œ๋‹ค.

 

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

 

 

์ด์ œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์…‹์€ strat_train_set์„ ์‚ฌ์šฉํ•˜๊ณ , ํ…Œ์ŠคํŠธ์…‹์€ strat_test_set์„ ์‚ฌ์šฉํ•œ๋‹ค.

 

๐Ÿช„ 2. ๋ฐ์ดํ„ฐ ์ •์ œ

 

1) ๋ˆ„๋ฝ๋œ ํŠน์„ฑ(missing features) ๋‹ค๋ฃจ๊ธฐ

 

  1. ํ•ด๋‹น ๊ตฌ์—ญ์„ ์ œ๊ฑฐ(ํ–‰์„ ์ œ๊ฑฐ)
  2. ํ•ด๋‹น ํŠน์„ฑ์„ ์ œ๊ฑฐ(์—ด์„ ์ œ๊ฑฐ)
  3. ์–ด๋–ค ๊ฐ’์œผ๋กœ ์ฑ„์šฐ๊ธฐ(0, ํ‰๊ท , ์ค‘๊ฐ„๊ฐ’ ๋“ฑ)

 

ํ›ˆ๋ จ์—๋Š” ๊ฐ’์ด ์ฑ„์›Œ์ง€๋Š” ๊ฒƒ์ด ์ข‹์œผ๋‚˜ ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ์— ๋”ฐ๋ผ ์ ์ ˆํ•˜๊ฒŒ ์ฑ„์›Œ์•ผ๋งŒ ํ•œ๋‹ค.

์™œ ๊ทธ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ˆ„๋ฝ๋ฌ๋Š”์ง€๋ฅผ ํŒŒ์•…ํ•ด์•ผ ์ ์ ˆํ•œ ๊ฐ’์„ ์ฑ„์šธ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.

 

๋จผ์ €, ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹์˜ ๋ผ๋ฒจ์„ ๋”ฐ๋กœ ๋ณต์‚ฌํ•ด๋‘”๋‹ค.

 

housing = strat_train_set.drop("median_house_value", axis=1) # drop labels for training set
housing_labels = strat_train_set["median_house_value"].copy()

 

๊ฐ’์˜ ๋ˆ„๋ฝ ํ™•์ธ

 

sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head() # True if there is a null feature
sample_incomplete_rows

 

NaN์œผ๋กœ ๋‚˜ํƒ€๋‚œ ๊ฐ’์ด ๋ˆ„๋ฝ๋œ ๊ฐ’์ด๋‹ค.

 

total_bedrooms ํ–‰์˜ ๊ฐ’์ด ๋ˆ„๋ฝ๋œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ณด์ธ๋‹ค.

 

์ด ๋ฐ์ดํ„ฐ์˜ ์ •์ œ๋ฅผ ์œ„ํ•ด ์•„๊นŒ์˜ ์„ค๋ช…ํ•œ ์„ธ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ด๋ณด์ž.

 

sample_incomplete_rows.dropna(subset=["total_bedrooms"])    # total_bedrooms๊ฐ€ ๋ˆ„๋ฝ๋œ ํ–‰ ์ œ๊ฑฐ
sample_incomplete_rows.drop("total_bedrooms", axis=1)       # total_bedrooms ํ–‰ ์ž์ฒด๋ฅผ ์ œ๊ฑฐ

# total_bedrooms๊ฐ€ ๋ˆ„๋ฝ๋˜์ง€ ์•Š์€ ๊ฐ’๋“ค ์ค‘์—์„œ ์ค‘๊ฐ„ ๊ฐ’์„ ๊ณ„์‚ฐํ•˜์—ฌ ์ฑ„์›Œ๋„ฃ๊ธฐ
# .fillna๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ˆ„๋ฝ๋œ ๊ฐ’์— ํ•ด๋‹น ๊ฐ’์ด ์ฑ„์›Œ์ง„๋‹ค.
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)

 

sklearn์ด ์ œ๊ณตํ•˜๋Š” SimpleImputer ์‚ฌ์šฉํ•˜๊ธฐ

 

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")

# ์ค‘๊ฐ„๊ฐ’์€ ์ˆ˜์น˜ํ˜• ํŠน์„ฑ์—์„œ๋งŒ ๊ณ„์‚ฐ๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ…์ŠคํŠธ ํŠน์„ฑ์„ ์ œ์™ธํ•œ ๋ณต์‚ฌ๋ณธ์„ ์ƒ์„ฑ
housing_num = housing.drop("ocean_proximity", axis=1)

imputer.fit(housing_num) # .fit ํ•จ์ˆ˜๋กœ ์ธ์ž๋กœ ๋„ฃ์€ ๋ฐ์ดํ„ฐ์˜ ์„ค์ •๊ฐ’(median)์ด imputer์— ๋“ค์–ด๊ฐ„๋‹ค.
X = imputer.transform(housing_num) # ๋ˆ„๋ฝ๋œ ๊ฐ’ ์ฑ„์›Œ๋„ฃ๊ธฐ, X๋Š” NumPy array๊ฐ€ ๋œ๋‹ค.
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing.index) # pandas DataFrame์œผ๋กœ ๋˜๋Œ๋ฆฌ๊ธฐ

 

๊ฒฐ๊ณผ

๋ฐ์ดํ„ฐ๋ฅผ ์ฑ„์šด housing_tr์—๋Š” NaN๊ฐ’์ด ์—†๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ๋‹ค.

 

์ด๋Ÿฐ ์ˆซ์žํ˜• ๋ฐ์ดํ„ฐ๋Š” ์ด๋ ‡๊ฒŒ ๋‹ค๋ฃจ๋ฉด ๋˜์ง€๋งŒ, ํ…์ŠคํŠธ์™€ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋Š” ์–ด๋–ป๊ฒŒ ๋‹ค๋ฃจ๊ณ  ํ•™์Šต์‹œ์ผœ์•ผํ• ๊นŒ?

 

2) ํ…์ŠคํŠธ์™€ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ ๋‹ค๋ฃจ๊ธฐ

 

๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ„๋‹จํ•˜๊ฒŒ ํ…์ŠคํŠธ๋ฅผ ์ˆซ์žํ˜•ํƒœ๋กœ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ๋‹ค.

 

from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat) # fit_transform ํ•จ์ˆ˜๋Š” ๊ฐ๊ฐ์˜ ํ…์ŠคํŠธ๋ฅผ ์ˆซ์ž๋กœ ๋งคํ•‘ํ•˜๊ณ  ๋ณ€ํ™˜ํ•œ๋‹ค.
housing_cat_encoded[:10]

 

๊ฒฐ๊ณผ

 

ํ•˜์ง€๋งŒ ๋‹จ์ˆœํžˆ ๋ฐฐ์—ด๋กœ ๋‚˜ํƒ€๋‚ด์„œ ํ•ด๋‹นํ•˜๋Š” ์ˆœ์„œ๋ฅผ ๋„ฃ์—ˆ์„ ๋ฟ์ด๋‹ค.

๋ชจ๋ธ์ด ๋” ๋‚˜์€ ํ•™์Šต์„ ํ•˜๋ ค๋ฉด ๋ฐ”๋‹ค์— ๊ฐ€๊นŒ์šด ์œ„์น˜๋ถ€ํ„ฐ ๋‚˜์—ดํ•ด์„œ ์ˆซ์ž๊ฐ€ ๋‚ฎ์„์ˆ˜๋ก ๋ฐ”๋‹ค์™€ ๊ฐ€๊น๋‹ค๋Š” ์˜๋ฏธ๋ฅผ ๋„ฃ์œผ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.

 

๊ฐ ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„๋กœ ์ปฌ๋Ÿผ์œผ๋กœ ๋‚˜๋ˆ ๋ฒ„๋ฆฌ๋Š” ๋ฐฉ๋ฒ•๋„ ์žˆ๋‹ค. (One-hot encoding)

 

from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot.toarray()

 

ํ•œ ์—ด์˜ ๊ฐ’๋งŒ 1๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค.

 

๋ชจ๋ธ์—๊ฒŒ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ ์ฃผ๋Š” ๊ฒƒ์ด ํ•™์Šต์— ์ข‹๋‹ค๊ณ  ํ•˜๋‹ˆ, ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•๋„ ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค.

 

3) ๋‚˜๋งŒ์˜ ๋ณ€ํ™˜๊ธฐ(Custom Transformer)๋กœ ๋งŒ๋“ค๊ธฐ

 

sklearn์ด ์ œ๊ณตํ•˜๋Š” BaseEstimator, TransformerMixin์„ ์‚ฌ์šฉํ•ด์„œ ์ปค์Šคํ…€์œผ๋กœ ์ œ์ž‘ํ•ด๋ณด์ž.

๋ณ€ํ™˜๊ธฐ์—๋Š” fit()๊ณผ transform()์„ ๊ผญ ๊ตฌํ˜„ํ•ด์•ผํ•œ๋‹ค.

 

rooms_per_household, population_per_household ๋‘ ๊ฐœ์˜ ์ƒˆ๋กœ์šด ํŠน์„ฑ์„ ๋ฐ์ดํ„ฐ์…‹์— ์ถ”๊ฐ€ํ•˜๋ฉฐ add_bedrooms_per_room = True๋กœ ์ฃผ์–ด์ง€๋ฉด bedrooms_per_room ํŠน์„ฑ๊นŒ์ง€ ์ถ”๊ฐ€ํ•˜๋„๋ก ๊ตฌํ˜„ํ–ˆ๋‹ค.

from sklearn.base import BaseEstimator, TransformerMixin

# column index
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix] # ํ•ด๋‹น ์—ด๋ผ๋ฆฌ ๋‚˜๋ˆ„๊ธฐ > ๊ฐ ์›์†Œ๋ผ๋ฆฌ ๋‚˜๋ˆ ์ง
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

 

์ž…๋ ฅ์ด Numpy ํ˜•ํƒœ(housing.values)๋กœ ๋“ค์–ด๊ฐ€์„œ Numpyํ˜•ํƒœ๋กœ ๋ฆฌํ„ด๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— DataFrame์œผ๋กœ ๋ณ€ํ™˜ํ•ด์•ผ ํ•œ๋‹ค.

 

housing_extra_attribs = pd.DataFrame(
    housing_extra_attribs,
    columns=list(housing.columns)+["rooms_per_household", "population_per_household"],
    index=housing.index)
housing_extra_attribs.head()

 

๋‘ ๊ฐœ์˜ ์—ด์ด ๋” ์ถ”๊ฐ€๋˜์—ˆ๋‹ค.

 

์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ณ€ํ™˜์ด ์ˆœ์ฐจ์ ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ์•ผ ํ•  ๊ฒฝ์šฐ Pipeline class๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ํŽธํ•˜๋‹ค.

 

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# StandardScaler๋Š” ๊ฐ’์˜ ํ‘œ์ค€ํ™”(ํ‰๊ท ์ด 0, ๋ถ„์‚ฐ์ด 1)
num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()), 
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)

 

Pipeline ํŠน์„ฑ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • ์ด๋ฆ„, ์ถ”์ •๊ธฐ ์Œ์˜ ๋ชฉ๋กํ˜•ํƒœ๋กœ Pipeline์„ ๊ตฌํ˜„ํ•œ๋‹ค.
  • ๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„๋ฅผ ์ œ์™ธํ•˜๊ณ  ๋ชจ๋‘ ๋ณ€ํ™˜๊ธฐ์—ฌ์•ผ ํ•œ๋‹ค.(fit_transform() method๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด์•ผ ํ•จ)
  • ํŒŒ์ดํ”„๋ผ์ธ์˜ fit() method๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด ๋ชจ๋“  ๋ณ€ํ™˜๊ธฐ์˜ fit_transform() method๋ฅผ ์ˆœ์„œ๋Œ€๋กœ ํ˜ธ์ถœํ•˜๋ฉด์„œ ํ•œ ๋‹จ๊ณ„์˜ ์ถœ๋ ฅ์„ ๋‹ค์Œ ๋‹จ๊ณ„์˜ ์ž…๋ ฅ์œผ๋กœ ์ „๋‹ฌํ•˜๊ณ , ๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„์—์„œ๋Š” fit() method๋งŒ ํ˜ธ์ถœํ•œ๋‹ค.

 

 

Pipeline์„ ๊ฑฐ์นœ ๋ฐ์ดํ„ฐ์˜ ๋ชจ์Šต

 

๋ชจ๋“  ์—ด์ด ์•„๋‹Œ ๊ฐ ์—ด๋งˆ๋‹ค ๋‹ค๋ฅธ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ ์šฉํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

sklearn์˜ ColumnTransformer๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค.

 

from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num) # housing_num์˜ ๋ชจ๋“  ํŠน์„ฑ ๋ฆฌ์ŠคํŠธ๋ฅผ ์ €์žฅ
cat_attribs = ["ocean_proximity"] # ocean_proximity ํŠน์„ฑ๋งŒ์„ ๋ฆฌ์ŠคํŠธ๋กœ ์ €์žฅ

# ์„ธ๋ฒˆ์งธ ์ธ์ž๋กœ ํŠน์„ฑ์ด ๋‹ด๊ธด ๋ฆฌ์ŠคํŠธ๋ฅผ ๋„˜๊น€์œผ๋กœ์จ ๊ฐ ์—ด๋งˆ๋‹ค ๋‹ค๋ฅธ Transform ํ•จ์ˆ˜ ์ ์šฉ ๊ฐ€๋Šฅ
full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)

 

๐Ÿƒ‍โ™‚๏ธ 3. ๋ชจ๋ธ ํ›ˆ๋ จ

 

1) ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ

 

 

sklearn์˜ LinearRegression์„ ์‚ฌ์šฉํ•˜๋ฉด ์‰ฝ๊ฒŒ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

 

ํ•™์Šต๋œ ๋ชจ๋ธ์˜ w๊ฐ’, ์ฆ‰ ์˜ˆ์ธก์— ํ•„์š”ํ•œ ์ˆ˜์น˜๋ฅผ ์ง์ ‘ ํ™•์ธํ•ด ๋ณผ ์ˆ˜๋„ ์žˆ๋‹ค.

 

 

ํ•™์Šต๋œ ๋ชจ๋ธ์„ ํ† ๋Œ€๋กœ ๋ช‡ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์˜ˆ์ธก์‹œ์ผœ ์–ผ๋งˆ๋‚˜ ์ž˜ ์˜ˆ์ธก๋˜๋Š”์ง€ ํ™•์ธํ•ด๋ณด์ž.

 

# ๋ช‡ ๊ฐœ์˜ ์ƒ˜ํ”Œ์— ๋Œ€ํ•ด ๋ฐ์ดํ„ฐ๋ณ€ํ™˜ ๋ฐ ์˜ˆ์ธก์„ ํ•ด๋ณด์ž
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)


# ์„ ํ˜• ํšŒ๊ท€ ๋ชจ๋ธ์˜ Predict ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์˜ˆ์ธก
print("Predictions:", lin_reg.predict(some_data_prepared).round(decimals=1))
print("Labels:", list(some_labels)) # ์ •๋‹ต

 

์˜ˆ์ธก๊ฐ’๊ณผ ์ •๋‹ต(๋ผ๋ฒจ)๊ฐ’์ด ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค.

 

์ „์ฒด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์˜ˆ์ธก๊ฐ’์ด ์–ผ๋งˆ๋‚˜ ๋ฒ—์–ด๋‚ฌ๋Š”์ง€ ํ™•์ธํ•  ์—๋Ÿฌ๊ฐ’(RMSE)๋„ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

# ํ‰๊ท  ์—๋Ÿฌ๊ฐ’ ๊ณ„์‚ฐ (ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด)
from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)

 

๊ฝค ํฐ ์ˆซ์ž๊ฐ€ ๋‚˜์™”๋‹ค.

 

RMSE๊ฐ’์ด ํฌ๊ฒŒ ๋‚˜์˜ค๋Š” ๊ฒฝ์šฐ, ์ด๋ฅผ ๊ณผ์†Œ์ ํ•ฉ์ด๋ผ๊ณ  ํ•œ๋‹ค.

ํŠน์„ฑ๋“ค์ด ์ถฉ๋ถ„ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜์ง€ ๋ชปํ–ˆ๊ฑฐ๋‚˜, ๋ชจ๋ธ์ด ๊ฐ•๋ ฅํ•˜์ง€ ๋ชปํ•˜๋ฉด ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜จ๋‹ค.

 

2) DecisionTreeRegressor

์ด๋ฒˆ์—๋Š” ๋น„์„ ํ˜•๋ชจ๋ธ์ธ DecisionTreeRegressor ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด์„œ ํ•™์Šต์‹œ์ผœ๋ณด์ž.

์ด ๋ชจ๋ธ๋„ sklearn์„ ์‚ฌ์šฉํ•˜๋ฉด ๊ฐ„๋‹จํ•œ ์ฝ”๋“œ๋กœ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜๊ฐ€ ์žˆ๋‹ค.

 

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)

 

์ด๋ฒˆ์—๋„ ํ•™์Šต๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์—๋Ÿฌ๊ฐ’์„ ์ธก์ •ํ•ด๋ณด์ž.

 

# ์˜ˆ์ธก
housing_predictions = tree_reg.predict(housing_prepared)

# ์—๋Ÿฌ, ํ•™์Šต๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋Š” ์—๋Ÿฌ x
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)

 

0์ด ๋‚˜์™”๋‹ค.

 

0์ด ๋‚˜์™”์ง€๋งŒ ์ด๋Š” ํ•™์Šต๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์—๋Ÿฌ๊ฐ’์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ๋ƒฅ ๋ชจ๋ธ์ด ํ•™์Šต๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋งŒ ์™„๋ฒฝํžˆ ์ ์‘ํ•œ ๊ฒƒ์ด๋‹ค.

์ข€ ๋” ์ •ํ™•ํ•œ ํ…Œ์ŠคํŠธ๋ฅผ ์œ„ํ•ด์„œ๋Š” ํ•™์Šต์— ์‚ฌ์šฉํ•˜์ง€ ์•Š์€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ ๊ฒ€์ฆ์ด๋‚˜, ๊ต์ฐจ ๊ฒ€์ฆ์„ ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค.

 

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ ๊ฒ€์ฆ์€ ์œ„์™€ ๋™์ผํ•˜๊ฒŒ ์ˆ˜ํ–‰ํ•˜๋ฉด ๋˜๋ฏ€๋กœ ํŒจ์Šคํ•˜๊ณ , ๊ต์ฐจ ๊ฒ€์ฆ์„ ํ•ด๋ณด์ž.

 

๊ต์ฐจ ๊ฒ€์ฆ์€ ํ•™์Šต๋ฐ์ดํ„ฐ์…‹์˜ ์ผ๋ถ€๋ฅผ ํ•™์Šต์—์„œ ์ œ์™ธ์‹œํ‚จ ๋’ค ์—๋Ÿฌ ๊ฒ€์ฆ์— ์‚ฌ์šฉํ•˜๋Š” ์ž‘์—…์„ ๋ฐ˜๋ณตํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

์•„๋ž˜ ๊ทธ๋ฆผ์—์„œ๋Š” ๋งค Iteration๋งˆ๋‹ค ๊ฐ๊ธฐ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ 5๋ถ„์˜ 1์„ ๊ฒ€์ฆ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋“ค์˜ ํ‰๊ท  ์—๋Ÿฌ๊ฐ’์„ ๊ณ„์‚ฐํ•œ๋‹ค.

 

๊ต์ฐจ ๊ฒ€์ฆ(Cross Validation)

 

sklearn์—์„œ ์ œ๊ณตํ•˜๋Š” cross_val_score์„ ์‚ฌ์šฉํ•˜๋ฉด ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค.

cv๊ฐ’์„ ์กฐ์ •ํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์กฐ๊ฐ ๊ฐœ์ˆ˜(๋ฐ˜๋ณต ํšŸ์ˆ˜)๋ฅผ ์ง์ ‘ ์ •ํ•˜๋ฉด ๋œ๋‹ค.

 

# ๊ต์ฐจ๊ฒ€์ฆ
# sklearn์˜ cross_val_score ์‚ฌ์šฉ, cv๋กœ ๋ฐ์ดํ„ฐ ์กฐ๊ฐ ๊ฐœ์ˆ˜ ์ •ํ•˜๊ธฐ

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())
    
from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels, scoring="neg_mean_squared_error", cv=10)

tree_rmse_scores = np.sqrt(-scores)
display_scores(tree_rmse_scores)

 

Mean๊ฐ’์ด ๋†’๊ฒŒ ๋‚˜์™”๋‹ค.

 

๊ต์ฐจ ๊ฒ€์ฆ์„ ํ–ˆ๋”๋‹ˆ ์—๋Ÿฌ๊ฐ’์ด ๊ฝค ํฐ ๊ฐ’์œผ๋กœ ๋‚˜์™”๋‹ค.

 

3) RandomForestPrediction ๋ชจ๋ธ

์ด ๋ชจ๋ธ์€ ํŠธ๋ฆฌ ๊ตฌ์กฐ์˜ ์˜ˆ์ธก๊ธฐ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ ์‚ฌ์šฉํ•˜์—ฌ ๊ทธ ์˜ˆ์ธก๊ฐ’์˜ ํ‰๊ท ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค.

 

 

RandomForestPrediction

 

์ผ๋ฐ˜์ ์œผ๋กœ ํ•˜๋‚˜์˜ ํŠธ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ๋ณด๋‹ค ์—ฌ๋Ÿฌ ๊ฐœ๋ฅผ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ๋†’๋‹ค๊ณ  ํ•œ๋‹ค.

์ด ๋ชจ๋ธ ๋˜ํ•œ sklearn์„ ์‚ฌ์šฉํ•˜๋ฉด ๋‹จ์ˆœํ•˜๊ฒŒ ๊ตฌํ˜„์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

 

# Tree๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ์ธ ํ˜•ํƒœ
from sklearn.ensemble import RandomForestRegressor

# 100๊ฐœ์˜ ํŠธ๋ฆฌ๋ฅผ ์‚ฌ์šฉ
forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
forest_reg.fit(housing_prepared, housing_labels)

 

๊ทธ๋ฆฌ๊ณ  ์„ฑ๋Šฅ ๊ฒ€์ฆ์„ ์œ„ํ•ด ๊ต์ฐจ ๊ฒ€์ฆ์„ ์ˆ˜ํ–‰ํ•ด๋ณด์•˜๋‹ค.

 

RandomForerstPrediction ๋ชจ๋ธ์˜ ๊ต์ฐจ ๊ฒ€์ฆ ์ˆ˜์น˜

 

์œ„์—์„œ DecisionTreeRegressor ๋ชจ๋ธ์„ ๊ต์ฐจ ๊ฒ€์ฆํ–ˆ๋˜ ์ˆ˜์น˜(71629)๋ณด๋‹ค ๋” ์ ๊ฒŒ(50435) ๋‚˜์™”๋‹ค.

๊ทธ๋ ‡๋‹ค๋Š”๊ฑด, ๋” ์ข‹์€ ๋ชจ๋ธ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค.

 

๐Ÿช› 4. ๋ชจ๋ธ ์„ธ๋ถ€ ํŠœ๋‹ 

 

1) GridSearhCV

๋ชจ๋ธ์ด ์ˆ˜์น˜๋ฅผ ์กฐ์ ˆํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์‚ฌ๋žŒ์ด ์ง์ ‘ ์ˆ˜์น˜๋ฅผ ์กฐ์ ˆํ•˜์—ฌ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์˜ฌ๋ฆด ์ˆ˜๋„ ์žˆ๋‹ค.

์ด๋•Œ, ์‚ฌ๋žŒ์ด ์ง์ ‘ ์„ธํŒ…ํ•œ ๊ฐ’์„ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ผ๊ณ  ํ•œ๋‹ค. 

 

ํ•˜์ง€๋งŒ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ˆ˜์น˜๋ฅผ ๋งค๋ฒˆ ์กฐ์ ˆํ•˜์—ฌ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๊ธฐ์—๋Š” ๊ต‰์žฅํ•œ ๋…ธ๋ ฅ์ด ํ•„์š”ํ•  ๊ฒƒ์ด๋‹ค.

์ด๋Ÿฌํ•œ ์ˆ˜๊ณ ๋ฅผ ๋œ์–ด์ฃผ๋Š” ๊ธฐ๋Šฅ๋„ sklearn์˜ GridSearchCV์„ ์‚ฌ์šฉํ•˜๋ฉด ์‰ฝ๋‹ค.

 

# ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹
# sklearn์˜ GridSearchCV์„ ์‚ฌ์šฉํ•˜๋ฉด ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์กฐํ•ฉ์„ ๋ชจ๋ธ์— ์ „๋‹ฌํ•ด์ฃผ์–ด ์•Œ์•„์„œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ํ•ด์คŒ
from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

# ์•„๊นŒ ์‚ฌ์šฉํ•œ RandomForestRegressor ๋ชจ๋ธ๊ณผ ๋น„๊ต๋ฅผ ์œ„ํ•ด ๋™์ผํ•œ ์ˆ˜์น˜
forest_reg = RandomForestRegressor(random_state=42)

# ์„ค์ •ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•œ ์—ฌ๋Ÿฌ ์‹œ๋„์˜ ํ•™์Šต๊ณผ, cv=5 ๋งŒํผ์˜ ๊ต์ฐจ๊ฒ€์ฆ๊นŒ์ง€ ์ˆ˜ํ–‰
grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)

 

params_grid ๋ฆฌ์ŠคํŠธ์•ˆ์˜ dictionary ๋ณ„๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐํ•ฉํ•˜์—ฌ ๋ชจ๋ธ์„ ๋งŒ๋“ค์–ด๋ณธ๋‹ค.๋”ฐ๋ผ์„œ ์ฒซ ๋ฒˆ์งธ dictionary์—์„œ 3x4 = 12(n_estimators ์ˆ˜ x max_features ์ˆ˜)๊ฐœ์˜ ์กฐํ•ฉ์„ ๋งŒ๋“ค์–ด๋‚ด๊ณ ,๋‘ ๋ฒˆ์งธ dictionary์—์„œ bootstrap = False์ธ ์ƒํƒœ์—์„œ 2x3 = 6๊ฐœ์˜ ์กฐํ•ฉ์„ ๋งŒ๋“ค์–ด๋‚ด์–ด,์ด 12+6 = 18๊ฐœ์˜ ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ฒŒ๋œ๋‹ค.

 

๋˜ํ•œ ๊ฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋ณ„ ๋ชจ๋ธ์˜ ๊ต์ฐจ ๊ฒ€์ฆ๊นŒ์ง€ ์ˆ˜ํ–‰ํ•˜์—ฌ ์–ด๋–ค ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ์ข‹์€์ง€ ํŒ๋‹จํ•˜๊ฒŒ ๋œ๋‹ค.์ด๋Š” GridSearchCV ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ ์ธ์ž๋กœ cv ๊ฐ’์„ ์ž…๋ ฅํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์กฐ๊ฐ์˜ ๊ฐœ์ˆ˜๋ฅผ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋‹ค.

 

ํ•™์Šต์ด ์™„๋ฃŒ๋˜์—ˆ๋‹ค๋ฉด ๋ฒ ์ŠคํŠธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ฝ‘์•„๋ณด์ž.

 

 

๋˜ํ•œ ์–ด๋–ค ์กฐํ•ฉ์—์„œ ์–ด๋–ค ์Šค์ฝ”์–ด๊ฐ€ ๋‚˜ํƒ€๋‚ฌ๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

 

 

2) Randomized Search

ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์กฐํ•ฉ์˜ ์ˆ˜๊ฐ€ ํฐ ๊ฒฝ์šฐ์—๋Š” ์ผ์ • ๋ฒ”์œ„์•ˆ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ง€์ •ํ•œ ํšŸ์ˆ˜๋งŒํผ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

 

# ๋žœ๋คํ•˜๊ฒŒ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์„ค์ •ํ•ด์„œ ํ•™์Šตํ•˜๋„๋ก ํ•จ
# ๋ช‡ ๋ฒˆ์˜ ์‹œ๋„ํ›„์— ๊ฐ€์žฅ ์ข‹์€ ํŒŒ๋ผ๋ฏธํ„ฐ ์ฐพ๊ธฐ

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

# n_iter๋Š” ์กฐํ•ฉ์˜ ์ˆ˜, 10๊ฐœ์˜ ๋žœ๋คํ•œ ์กฐํ•ฉ์„ ์‹œ๋„ํ•ด๋ด„
forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)

 

๊ทธ๋ฆฌ๊ณ  ์ด๋ฒˆ์—๋„ ์–ด๋–ค ์กฐํ•ฉ์—์„œ ์–ด๋–ค ์Šค์ฝ”์–ด๊ฐ€ ๋‚˜ํƒ€๋‚ฌ๋Š”์ง€ ํ™•์ธํ•ด๋ณด์ž.

 

7, 180์˜ ์กฐํ•ฉ์ผ ๋•Œ ๊ฐ€์žฅ ํšจ๊ณผ์ ์ธ ๋ชจ๋ธ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

๐Ÿ•ถ๏ธ 5. ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹ ํ‰๊ฐ€

๋งˆ์ง€๋ง‰์œผ๋กœ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์ตœ์ข… ํ‰๊ฐ€๋ฅผ ํ•ด๋ณด์ž.

 

# ๊ฐ€์žฅ ์ข‹์€ ๋ชจ๋ธ์„ final_model๋กœ ์ง€์ •
final_model = grid_search.best_estimator_

# ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์˜ ๋ผ๋ฒจ ์ œ๊ฑฐ : X_test
# ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์˜ ์ •๋‹ต : y_test
X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

# ๋ชจ๋ธ์— ๋„ฃ๊ธฐ์œ„ํ•ด ํ•™์Šต๋ฐ์ดํ„ฐ์™€ ๊ฐ™์ด ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ transform ์ž‘์—…
# transformํ›„ final_model๋กœ ์˜ˆ์ธก์ž‘์—…
X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

# ์—๋Ÿฌ ๊ณ„์‚ฐ
final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

 

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ์—๋Ÿฌ๊ฐ’(RMSE)

 

๐Ÿ“ฆ 6. ์ƒ์šฉํ™”๋ฅผ ์œ„ํ•œ ๋ชจ๋ธ ์ €์žฅ

์ „์ฒ˜๋ฆฌ์™€ ์˜ˆ์ธก์ด ํ•œ๋ฒˆ์— ๋™์ž‘ํ•  ์ˆ˜ ์žˆ๋Š” ํ•˜๋‚˜์˜ ํŒŒ์ดํ”„๋ผ์ธ์„ ๋งŒ๋“ค์–ด, pkl ํŒŒ์ผํ˜•ํƒœ๋กœ ์ €์žฅํ•œ๋‹ค.

 

# ์ „์ฒ˜๋ฆฌ+์˜ˆ์ธก ๋‘˜ ๋‹ค ์žˆ๋Š” ํ•˜๋‚˜์˜ ํŒŒ์ดํ”„๋ผ์ธ ๋งŒ๋“ค๊ธฐ

full_pipeline_with_predictor = Pipeline([
        ("preparation", full_pipeline),
        ("linear", LinearRegression())
    ])

# ์‚ฌ์šฉ๋ฐฉ๋ฒ•
# full_pipeline_with_predictor.fit(housing, housing_labels)
# full_pipeline_with_predictor.predict(some_data)

my_model = full_pipeline_with_predictor

# ๋ชจ๋ธ ํŒŒ์ผ ์ €์žฅ
import joblib
joblib.dump(my_model, "my_model.pkl")

# ๋ชจ๋ธ ํŒŒ์ผ ์‚ฌ์šฉ
my_model_loaded = joblib.load("my_model.pkl")
my_model_loaded.predict(some_data)

 

 

+) ์ƒˆ๋กœ์šด ํŠน์„ฑ์„ ๋„ฃ์—ˆ์„ ๋•Œ, ์–ด๋–ค ํŠน์„ฑ์ด ์ค‘์š”ํ•œ์ง€ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

# ์ƒˆ๋กœ์šด ํŠน์„ฑ์„ ๋„ฃ์—ˆ์„ ๋•Œ ์–ด๋–ค ํŠน์„ฑ์ด ์ค‘์š”ํ•œ์ง€ ํŒŒ์•…ํ•˜๊ธฐ
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances

extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
#cat_encoder = cat_pipeline.named_steps["cat_encoder"] # old solution
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)

 

๊ฐ ํŠน์„ฑ๋ณ„ ์˜ํ–ฅ๋„๊ฐ€ ๋ˆˆ์— ๋“ค์–ด์˜จ๋‹ค.