Auto Regressive Integrated Moving Average (ARIMA) models and a similar concept known as Auto Regressive Conditional Heteroskedasticity (ARCH) models will our focus here. The distinction and different purposes of the “MA” or the “CH” respectively offers an interesting map for when to choose between the two.

ARIMA models are able to measure relationships on our Time Series data that have both long term trends (AR) and sudden disruptions (MA). An ARIMA model is essentially two different models added together. The Auto Regressive aspect models the predicted value on previous values of itself earlier in time. The Moving Average aspect does…

Recommender Systems rely on the concept of similarity or proximity of your data. This relates to 2D coordinates on a grid, and the distances between them that can be measured with Geometry and Trigonometry. We can use unsupervised learning methods to cluster data points in different ways based on this concept, sometimes called “distance metrics”. We can efficiently organize our data visually to clarify what we already know, or even learn new things altogether.

Data & Background: Cuisines from Around the World
Our data comes from a Kaggle competition “What’s Cooking?” that tasks participants to classify the type of cusine, with…

Sometimes for a classification problem we will have a clear and important distinction for our target classes. Even here though, we can have an unfortunate situation where there are simply not many observations of one class in proportion to another. This can be challenging for classification models to distinguish between the classes. If the models can’t distinguish well, the predictions they make will not be useful. When a model is trained on significantly imbalanced classes it may even struggle to make predictions that improve upon baseline accuracy. Fortunately, there are methods that can alleviate this situation. …

In Natural Language Processing (NLP), word vectorization offers several options for assigning numerical values to text data. It is this process that allows us to build models that can reliably identify characteristics from a given document. We will explore how Term Frequency-Inverse Document Frequency (TF-IDF) vectorization can be applied to distinguish patterns in a document and help us classify where text may have originated from given its content.

Background for Experiment:
We will be using TF-IDF to help us classify content from Reddit posts to see if a model can identify which subreddit a post came from. For our purposes here…

In Part 1 we went over how to encode categorical features that are either nominal or ordinal. Completing this process gave us a DataFrame with more than double our original 82 features. For our purposes here, we want to keep as many of these new features as possible, but this will present some challenges.

Model Assumptions & Feature Challenges:
We will later be using a basic Linear Regression model but we must check if our features follow certain properties for this model to be useful. An important concept is Multicollinearity between features. Multicollinearity is present when the features used to predict…

The Ames, Iowa housing prices dataset (Ames data) offers a great opportunity to explore the tools available to us as Data Scientists. We’re going to explore the encoding of categorical data and later, how it we can work with all of those categories by using dimension reduction and Principal Component Analysis. Our target feature is the sale price of a home and the other features describe the characteristics, location, and condition of the home. Before modeling our data and making predictions, we need to process our categorical features or variables.

Categorical variables can be of two types:
- Nominal: Simple distinguishing…

Aaron Hume

Data professional with a passion for numbers.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store