파이썬 머신러닝 (3)

Study/ML 실습 - Python 기초

파이썬 머신러닝 (3)

minung14 2017. 1. 25. 01:45

gradient descent를 사용하여 cost function 값을 최소화 해보자

gradient descent를 사용해서 가중치w 값을 업데이트 할 수 있다.
w := w + ∆w
여기서 w의 변화량(∆w)은 (-) 기울기 값에 learning rate η을 곱한 값으로 정의된다.
∆w = −η∆J(w)
cost function의 기울기를 계산하려면 각각의 w와 관련하여 cost function의 편미분을 계산해야한다.
그리고 w 업데이트 값 w_j을 아래와 같은 수식으로 나타낼 수 있다.

모든 w를 동시에 업데이트하기 때문에 Adaline learning rule은 다음과 같다.
w := w + ∆w

파이썬으로 Adaptive Linear Neuron 구현하기
퍼셉트론과 Adaline은 매우 유사하기 때문에 앞에서 정의한 퍼셉트론 구현을 택하고 fit 함수를 변경하여, gradient descent를 통해 cost function의 최소값을 구함으로써 w를 업데이트 된다.

class AdalineGD(object):
    """ADAptive LInear NEuron classifier.

    Parameters
    ------------
    eta : float
        Learning rate (between 0.0 and 1.0)
    n_iter : int
        Passes over the training dataset.

    Attributes
    -----------
    w_ : 1d-array
        Weights after fitting.
    cost_ : list
        Sum-of-squares cost function value in each epoch.

    """
    def __init__(self, eta=0.01, n_iter=50):
        self.eta = eta
        self.n_iter = n_iter

    def fit(self, X, y):
        """ Fit training data.

        Parameters
        ----------
        X : {array-like}, shape = [n_samples, n_features]
            Training vectors, where n_samples is the number of samples and
            n_features is the number of features.
        y : array-like, shape = [n_samples]
            Target values.

        Returns
        -------
        self : object

        """
        self.w_ = np.zeros(1 + X.shape[1])
        self.cost_ = []

        for i in range(self.n_iter):
            net_input = self.net_input(X)
            # Please note that the "activation" method has no effect
            # in the code since it is simply an identity function. We
            # could write `output = self.net_input(X)` directly instead.
            # The purpose of the activation is more conceptual, i.e.,  
            # in the case of logistic regression, we could change it to
            # a sigmoid function to implement a logistic regression classifier.
            output = self.activation(X)
            errors = (y - output)
            self.w_[1:] += self.eta * X.T.dot(errors)
            self.w_[0] += self.eta * errors.sum()
            cost = (errors**2).sum() / 2.0
            self.cost_.append(cost)
        return self

    def net_input(self, X):
        """Calculate net input"""
        return np.dot(X, self.w_[1:]) + self.w_[0]

    def activation(self, X):
        """Compute linear activation"""
        return self.net_input(X)

    def predict(self, X):
        """Return class label after unit step"""
        return np.where(self.activation(X) >= 0.0, 1, -1)

퍼셉트론에서 각 트레이닝 샘플을 계산한 이후 w값을 업데이트 하는 대신에, zero-weight에 대한 self.eta * errors.sum ()과 weight가 1부터 m에 대한 self.eta * X.T.dot(errors)를 통해 전체 트레이닝 셋으로 기울기를 계산할 수 있다.
(X.T.dot(errors)는 feature 행렬과 error 벡터의 행렬 벡터 곱셈이다.)

이전의 퍼셉트론 구현과 유사하게, 트레이닝 이후 알고리즘이 수렴했는지 확인하기 위해 self.cost 목록에서 cost 값을 모을 수 있다.

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8, 4))

ada1 = AdalineGD(n_iter=10, eta=0.01).fit(X, y)
ax[0].plot(range(1, len(ada1.cost_) + 1), np.log10(ada1.cost_), marker='o')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('log(Sum-squared-error)')
ax[0].set_title('Adaline - Learning rate 0.01')

ada2 = AdalineGD(n_iter=10, eta=0.0001).fit(X, y)
ax[1].plot(range(1, len(ada2.cost_) + 1), ada2.cost_, marker='o')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Sum-squared-error')
ax[1].set_title('Adaline - Learning rate 0.0001')

plt.tight_layout()
# plt.savefig('./adaline_1.png', dpi=300)
plt.show()

learning rate를 각각 다르게 주고 출력한 결과 왼쪽 그래프는 cost function이 최소화되는 대신 error가 점점 커지는 것을 볼 수 있다.
이처럼 learning rate를 크게 준다면, cost값은 발산하게 된다.
많은 머신러닝 알고리즘은 최적의 성능을 위해 feature scaling이 필요한데 이는 3장에서 다시 설명하겠다.
Gradient descent는 feature scaling의 이점을 제공하는 알고리즘 중 하나이다. 여기서 우리는 standardization이라고 하는 feature scaling 함수를 사용할 것이다.(각 feature의 평균은 0이고 표준편차는 1이다.)
예를 들어, j번째 feature를 표준화하기 위해 우리는 단순히 모든 트레이닝 샘플로부터 샘플 평균 µ_j를 추출하고 표준편차 σ_j로 나눠야한다.

x_j는 모든 트레이닝 샘플 n의 j번째 feature 값이 벡터로 구성된 것이다.
Standardization은 NumPy함수의 mean과 std로 쉽게 구현할 수 있다.

# standardize features
X_std = np.copy(X)
X_std[:, 0] = (X[:, 0] - X[:, 0].mean()) / X[:, 0].std()
X_std[:, 1] = (X[:, 1] - X[:, 1].mean()) / X[:, 1].std()

ada = AdalineGD(n_iter=15, eta=0.01)
ada.fit(X_std, y)

plot_decision_regions(X_std, y, classifier=ada)
plt.title('Adaline - Gradient Descent')
plt.xlabel('sepal length [standardized]')
plt.ylabel('petal length [standardized]')
plt.legend(loc='upper left')
plt.tight_layout()
# plt.savefig('./adaline_2.png', dpi=300)
plt.show()

plt.plot(range(1, len(ada.cost_) + 1), ada.cost_, marker='o')
plt.xlabel('Epochs')
plt.ylabel('Sum-squared-error')

plt.tight_layout()
# plt.savefig('./adaline_3.png', dpi=300)
plt.show()

위의 코드를 실행한 후 결과 그래프를 보면 cost가 점점 줄어드는 것을 확인할 수 있다.
learning rate η = 0.01을 사용하여 표준화된 feature 트레이닝 이후에 Adaline이 수렴한다. 그러나 모든 샘플이 정확히 분류되었다고 하더라도 SSE(Sum-squared-error)는 0이 아니다.

Large scale machine learning and stochastic gradient descent

fit 함수 내에서 샘플을 트레이닝 한 이후에 가중치를 업데이트할 수 있다. 게다가 우리는 가중치를 다시 초기화하지 않는 추가적인 partial_fit 함수를 구현할 것이다.
트레이닝 이후에 알고리즘이 수렴하는지 확인하기 위해 각 단계에서 트레이닝 샘플의 평균 cost값을 계산할 것이다. 그리고 cost function을 최적화할때, 순환을 피하기 위해 각 단계 이전에 트레이닝 데이터를 섞는 옵션을 추가할 것이다.

from numpy.random import seed

class AdalineSGD(object):
    """ADAptive LInear NEuron classifier.

    Parameters
    ------------
    eta : float
        Learning rate (between 0.0 and 1.0)
    n_iter : int
        Passes over the training dataset.

    Attributes
    -----------
    w_ : 1d-array
        Weights after fitting.
    cost_ : list
        Sum-of-squares cost function value averaged over all
        training samples in each epoch.
    shuffle : bool (default: True)
        Shuffles training data every epoch if True to prevent cycles.
    random_state : int (default: None)
        Set random state for shuffling and initializing the weights.

    """
    def __init__(self, eta=0.01, n_iter=10, shuffle=True, random_state=None):
        self.eta = eta
        self.n_iter = n_iter
        self.w_initialized = False
        self.shuffle = shuffle
        if random_state:
            seed(random_state)

    def fit(self, X, y):
        """ Fit training data.

        Parameters
        ----------
        X : {array-like}, shape = [n_samples, n_features]
            Training vectors, where n_samples is the number of samples and
            n_features is the number of features.
        y : array-like, shape = [n_samples]
            Target values.

        Returns
        -------
        self : object

        """
        self._initialize_weights(X.shape[1])
        self.cost_ = []
        for i in range(self.n_iter):
            if self.shuffle:
                X, y = self._shuffle(X, y)
            cost = []
            for xi, target in zip(X, y):
                cost.append(self._update_weights(xi, target))
            avg_cost = sum(cost) / len(y)
            self.cost_.append(avg_cost)
        return self

    def partial_fit(self, X, y):
        """Fit training data without reinitializing the weights"""
        if not self.w_initialized:
            self._initialize_weights(X.shape[1])
        if y.ravel().shape[0] > 1:
            for xi, target in zip(X, y):
                self._update_weights(xi, target)
        else:
            self._update_weights(X, y)
        return self

    def _shuffle(self, X, y):
        """Shuffle training data"""
        r = np.random.permutation(len(y))
        return X[r], y[r]

    def _initialize_weights(self, m):
        """Initialize weights to zeros"""
        self.w_ = np.zeros(1 + m)
        self.w_initialized = True

    def _update_weights(self, xi, target):
        """Apply Adaline learning rule to update the weights"""
        output = self.net_input(xi)
        error = (target - output)
        self.w_[1:] += self.eta * xi.dot(error)
        self.w_[0] += self.eta * error
        cost = 0.5 * error**2
        return cost

    def net_input(self, X):
        """Calculate net input"""
        return np.dot(X, self.w_[1:]) + self.w_[0]

    def activation(self, X):
        """Compute linear activation"""
        return self.net_input(X)

    def predict(self, X):
        """Return class label after unit step"""
        return np.where(self.activation(X) >= 0.0, 1, -1)

AdalineSGD classifier에서 지금 사용하는 shuffle 함수는 다음과 같이 작동한다.
numpy.random에서 permutation함수를 통해 0~100 사이의 랜덤한 값을 생성한다.
그 수들은 feature 행렬과 클래스 label 벡터를 섞기위해 사용된다.
이제 AdalineSGD classifier를 트레이닝하기 위해 fit 함수를 사용한다.

ada = AdalineSGD(n_iter=15, eta=0.01, random_state=1)
ada.fit(X_std, y)

plot_decision_regions(X_std, y, classifier=ada)
plt.title('Adaline - Stochastic Gradient Descent')
plt.xlabel('sepal length [standardized]')
plt.ylabel('petal length [standardized]')
plt.legend(loc='upper left')

plt.tight_layout()
#plt.savefig('./adaline_4.png', dpi=300)
plt.show()

plt.plot(range(1, len(ada.cost_) + 1), ada.cost_, marker='o')
plt.xlabel('Epochs')
plt.ylabel('Average Cost')

plt.tight_layout()
# plt.savefig('./adaline_5.png', dpi=300)
plt.show()

결과 그래프를 보면 평균 cost가 꽤 빨리 감소하고, 15번의 반복 이후에 최종으로 결정된 경계가 Adaline을 사용한 배치 gradient descent와 유사한 것을 볼 수 있다.
만약 모델을 업데이트하고 싶다면, 간단하게 각 샘플에 대한 partial_fit 함수를 호출하면 된다.

저작자표시 비영리 변경금지