MLOps-Feature Engineering

Apr 18, 2021

Post Series

Feature Engineering

Squeezing the most out of data

Making data useful before training model
Representing data in forms that help models learn
Increasing predictive quality
Reducing dimensionality with feature engineering

Data를 표현하는 방법에 따라 Model이 학습하는데에 많은 영향을 미친다. 예를들어 Data가 Normalized 되있다면 더 빨리 수렴할 수 있다.

적절한 Input data 를 선택하거나 transforming 하는것은 예측의 질을 높이는 핵심 요소이며 Input의 dimension을 줄이는것은 Computing 자원을 아끼기에 매우 좋을것이다.

The art of feature engineering

Feature engineering에서는 모델의 성능을 향상시키기 위해(Computing resource를 줄이기 위해) Transforming, projecting, eliminating, combining feature 작업을 수행하여 raw data로부터 새로운 version의 data를 만들어 낸다.

Objective function은 항상 지속적으로 model이 올바른 목표에 향하게 적절히 조정해준다.

그리고 필요시에 새로운 feature를 만들어 model에 update시켜줄 수도 있다.

그리고 ML pipeline과 같이 이러한 작업을 점진적이고 반복적으로 수행하여 Model의 성능을 향상시킨다.

Typical ML pipeline

Training 도중에는 전체 Data set에 대한 접근이 가능하기 때문에 global properties를 활용할 수 있다. 예를 들어 전체 feature에 대한 standard deviation을 계산해서 Normalization에 활용할 수 있을것이다.

주의해야 할 점은 Serving을 할 때도 Training에서 적용한 Feature engineering을 동일하게 적용해주어야 한다는 것이다. 예를 들어 Training 과정에서 categorical feature에 대한 one-hot vector를 만들었다면 Serving 과정에서도 동일하게 만들어줘야한다.

하지만 보통 Training과 Serving 과정은 각각 개별적으로 일어나기 때문에 동일하게 제대로 적용을 안하는 경우가 많을 수 있다. 그리고 이러한 error는 찾기또한 힘들다.

Key points

Feature engineering은 매우 어렵고 시간이 드는 작업이지만, 성공시키는것이 매우 중요한 작업이다.
feature engineering을 통해 squeezing the most out of data를 하는 것은 Model이 더 잘 학습하도록 해준다.
더 적은 feature를 가지고 학습할 수록 더 효율적으로 computing resource를 사용할 수 있다.
Training 할 때 적용됬던 feature engineering은 Serving할때도 역시 동일하게 적용되어야한다.

Preprocessing Operations

가장 중요한 Preprocessing 작업 중 하나는 Data cleansing 이다. 필요 없는 데이터를 제거하거나, 에러가 있는 데이터를 정정하는 작업을 한다.

Data에 Scailing, Normalizing 등의 transformation을 수행하는 Feature tuning 을 할 수도 있다.

Dimension을 줄이기 위해 필요한 feature만 추출해내는 Feature extraction 은 computing resource를 절약하게해준다.

여러가지 기술으로 새로운 feature를 추가하거나 하는 Feature construction 을 수행할 수 있다.

위의 그림은 왼쪽 집의 정보에 대한 Raw data에서 Feature engineering을 통해 오른쪽의 Feature vector로 Data를 표현한 예시이다.

Integer를 float로 표현하고, Normalize를하고, street_name같은 경우는 one-hot vector로 표현하는 등의 여러가지 작업을 거친다.

이런 Data외에도 Text나 Image, 그 외의 다양한 Data에 대해 우리는 경험적 지식을 이용해 다양한 Pre-processing 작업을 할 수 있다.

Key points

Data preprocessing is transforms raw data into a clean and training-ready dataset
Featrue engineering maps:
- Raw data into feature vectors
- Integer values to floating-point values
- Normalize numerical values
- Strings and categorical values to vectors of numeric values
- Data from one space into a different space
Feature Engineering Techniques

Numercial Range와 Grouping에 속하는 기본적인 Feature engineering technique를 알아본다.

Scailing

기존의 range를 미리 정해진 range로 바꾸는 것이다. 예를 들어 Grayscale image의 pixel값이 0~255라면

-1~1로 recale하는것이 여기에 속한다.

이렇게 rescaling을 하면 신경망이 더 빨리 수렴하는데에 도움이 되고, 특정한 경우에 NaN값이 발생하는것을 방지할 수 있다.

Normalization

Normalization은 (X - Xmin) / (Xmax - X_min) 의 계산을 거쳐서 값들을 rescailing하는 과정이다. 해당 과정을 거치면 원래 값이 뭐였는지에 상관없이 0~1사이의 값으로 rescailing된다.

보통 Normalization은 Gaussian distribution을 갖지 않는 data에 적용하면 대게 좋은 효과를 보여준다.

Standardization

Standardization은 z-score라고 불리는 계산과정을 통해서 모든 값을 0을 기준으로 평준화 한다. 이 역시 Data가 normal distribution의 형태를 가지고 있을때 좋은 효과를 보여줄 수 있다.

Data에 Normalization과 Standardization을 둘 다 적용해 결과를 비교해보면 좋은 start line이 될 수 있을것이다.

Bucketing

Bucketing은 일종의 grouping이다. 예를 들어 집이 지어진 년도를 vector화 한다고 할 때, 사실상 1960년과 1961년은 큰 차이가 없지만 이를 각각 개인적으로 one-hot vector로 만들면 그 dimension이 커질 수 있다.

따라서 한 Bucket을 10년으로 잡아 분류하면 더 효율적으로 분류가 가능하다.

Key Points

Feature engineering - prepares, tunes, transforms , extracts and constructs features.
Feature engineering is key for model refinement
Feature engineering helps with ML analysis

Feature crosses

Feature crosses는 여러개의 feature를 합쳐서 새로운 feature를 만드는것이다.

non-linearity한 data를 어떻게 encoding할지에 대해서 여러가지 tool을 이용해 시각적으로 Data를 구조를 보면 Insight를 얻을 수 있을것이다.

PyojunCode

MLOps-Feature Engineering

Feature Engineering

Preprocessing Operations

Feature Engineering Techniques

Feature crosses

연관글