Developing an ML Model to Predict the Energy Efficiency of Buildings

Predicting the Energy Star Score of Buildings Using a Random Forest Regressor

Machine learning (ML) is a branch of artificial intelligence built on the foundation of systems learning from data, identifying patterns and making decisions with minimal human intervention, and it can be used to produce models relevant to enormous fields like the energy sector. Based on the code and three-article series by Will Koehrsen, I worked on a project at the intersection of ML and sustainable energy which I’ll be sharing in this article!

Intro to Project

The objective of this project is to use an energy dataset from New York City and build a model that can predict the Energy Star Score of buildings. An Energy Star Score is expressed on a scale of 0–100 (100 being the best) and allows you to understand how your building’s energy consumption measures up against similar buildings nationwide. It is a great method to see how well your building performs with regard to energy consumption, taking into account a building’s physical attributes, operations, and primary uses.

This is a regression ML task which is a type of supervised learning. The task of a regression algorithm is to map an input value with a continuous target variable. Where classification is used for predicting a label, regression is used for predicting a quantity. Seeing as we are trying to predict a score, it wouldn’t make sense to use classification for this purpose.

Data Cleaning

We can start off by inputting the dataset and libraries, and then can move into the cleaning and wrangling process. We first need to address the problem that the missing values are encoded as “Not Available”, which we can replace with ‘not a number’ using np.nan. This converts all values that are not available to values that can be interpreted as numbers. We can use astype() to convert columns with strings to the float datatype, a float being a data type composed of a number that is not an integer (eg it includes a fraction in decimal format). The astype()function also enables any suitable existing column to be converted to a categorical type.

Another very common problem when dealing with datasets is that there can be missing values that must be filled in or removed. They’re fine for our exploratory data analysis, but need to be handled before training on a model. We can use the following function to calculate missing values by column,

def missing_values_table(df):
# Total missing values
mis_val = df.isnull().sum()

Then print some summary information using the following code,

   print("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_val_table_ren_columns.shape[0]) +
" columns that have missing values.")

And the result is:

Your selected dataframe has 60 columns.
There are 46 columns that have missing values.

We can then calculate the number of missing values and the percentage of the total values that are missing for each column, and remove any columns with more than 50% missing values. The result is that we will remove 11 columns.

Exploring our Data

One of the most fun steps of an ML workflow is exploratory data analysis, or EDA. This is a visualization method that analyzes datasets to summarize the main characteristics, trends, and patterns in our data. We can start off with a single variable plot, also called univariate.


A univariate plot shows the distribution of a single variable, like a histogram. We can examine the distribution of the Energy Star Score variable, which is shown below:

This is interesting because the expectation would be a flat distribution, however this is not the case! 1 and 100 are disproportionately larger than the values in between. The Energy Star Score is based on self-reported usage, so we can guess why quite a few buildings are at the 100 mark. We can take note that there may be some bias in the reporting of scores.

Density Plots

Next we can create a few density plots, where we look for relationships between our features and target. We can look at the effect of categorical variables on our target and can do this by investigating the distribution of a single variable coloured by a class.

The first density plot I created was a plot of the Energy Star Score coloured by the type of building, shown below:

Seeing as these lines vastly differ, we can see that the building type is very correlated with the Energy Star Score. Office buildings have a higher score (thus are more energy efficient) whereas hotels have a much lower score.

The second density plot I looked at was a plot of the Energy Star Score coloured by the borough, which you can see below:

The borough has a much less significant impact on the score in comparison to the building type, however because there are definitely some differences between the boroughs we can include it in our model.

We will include both the building type and borough in our ML model because they both have an effect on the target (the building type notably more so than the borough).

Examining Correlations

Finding correlations in data is another part of EDA. Pearson’s correlation coefficient measures the statistical relationship between two continuous variables, providing information about the magnitude of the correlation and the direction of the relationship. This is set between the values -1 and 1, -1 meaning two variables are perfectly negatively linearly correlated and 1 means the two variables are perfectly positively linearly correlated.

We can input the following to get the most negative and positive correlations:

# Find all correlations and sort
correlations_data = data.corr()['score'].sort_values()

# Print the most negative correlations
print(correlations_data.head(15), '\n')

# Print the most positive correlations

The strongest negative correlations are the site EUI and weather normalized site EUI, followed by the weather normalized source EUI and source EUI. The EUI stands for energy unit intensity and expresses a building’s energy use by dividing the amount of energy used by the square footage of the building. The lower score, the better. Thus it makes sense that when the Energy Star Score increases, the EUI decreases.

Two-Variable Plots

Next we can use a scatterplot to plot two variables against one another and use colour to represent a categorical variable like the building type which we created a density plot with earlier. Now that we know site EUI is the most negatively correlated variable with the Energy Star Score, creating a plot to compare them is a good idea!

There is a very clear and linear negative relationship between the Site EUI and the score. EUI is definitely something to use in our model.

We can also run a pairs plot, which is a grid that shows bivariate relationships between every variable in a multivariate dataset:

Again we are shown the high correlation between the EUI and energy star score, but nothing else to really take into account. The site EUI and weather normalized EUI are highly correlated which of course makes sense but isn’t very relevant. In fact, we likely do not need both of these variables because they are so similar — and we’ll get to this in feature engineering and selection.

Feature Engineering and Selection

Feature engineering and selection are an important part of an ML workflow. Firstly looking at feature engineering, this is where we transform the given data into a more interpretable form, AKA taking raw data and creating new features that are more digestible. In feature engineering, we one-hot encode categorical variables and add in a natural log transformation of the numerical variables.

One-hot encoding is a very common method of preprocessing categorical features because a model cannot understand “borough”, but it can understand a numerical translation like “0”. Therefore we will record the borough and property use type variables as 0 and 1, like this:

# Select the categorical columns
categorical_subset = data[['Borough', 'Largest Property Use Type']]
# one hot encode
categorical_subset = pd.get_dummies(categorical_subset)

We can also add in the natural log transformation of the numerical variables because this makes highly skewed distributions less skewed, subsequently making the data more interpretable and helping our model to understand non-linear relationships in the data. We can replace each variable xwith log(x), and then join our two dataframes (numeric and categorical) using concat(), a function that combines (or concatenates) the text from multiple strings. This translates to the code below:

# Create columns with log of numeric columns
for col in numeric_subset.columns:
# Skip the Energy Star Score column
if col == 'score':
numeric_subset['log_' + col] = np.log(numeric_subset[col])
# Join the two dataframes using concat
features = pd.concat([numeric_subset, categorical_subset], axis = 1)

We are left with 11,319 observations (buildings) and 110 columns (features). However, not all of these will be useful for our model. This is where feature selection comes in!

Feature Selection

It is very key to consider what features to use as there are thousands of possible features. If we did plug in all our features, the algorithm wouldn’t be very efficient and would be heavily overfitted (which occurs when a model fits exactly against its training data). Less misleading data results in increased accuracy and also of course reduced training time which is something that must be considered in ML!

We can remove unnecessary features by looking at collinear features. These are features that are strongly correlated with one another, and often times it’s useful to remove one from a pair. We can calculate the collinearity between features by using the correlation coefficient, and then we can drop one from a pair if the coefficient is >0.6 like this:

# Remove the collinear features above a specified correlation coefficient
features = remove_collinear_features(features, 0.6);

# Remove any columns with all na values
features = features.dropna(axis=1, how = 'all')
(11319, 65)

After our feature selection, we have 64 features, one being the target.

Naive Baseline

The final step before making our ML models is determining a naive baseline that our models will have to beat. Baselines help put a complex model into context with regards to accuracy by which all other models evaluated on a dataset can be compared. This is important because we need to make sure our models can outperform a naive guess — if we can’t get to this point, our model may not be very good.

Before we do this, we can divide our features into training and testing sets. Most of the data will be used for training, and a smaller amount will go to the testing set where we evaluate the mapping learned by the model in the training process. We can split 70% of the buildings with a score into the testing set and 30% into the training set using train_test_split()like this:

X, X_test, y, y_test = train_test_split(features, targets, test_size = 0.3, random_state = 42)

We are left with 1858 buildings with no score, 6338 buildings with a score in training set, and 3123 buildings with a score in the testing set.

Now for the baseline, there are different metrics that can be used for evaluating models, but we will use the mean absolute error (MAE) which is easy to work with and quite popular. It’s also intuitive because the units of the error score match the target value units. The MAE measures the average error between paired observations, expressing the absolute value of the difference between an observed value and the true value of a quantity. We can make our model guess the median value of the target on the training set, and then using the MAE metric we’ll be able to see how far off it is from the real median value.

We can use this function:

def mae(y_true, y_pred):
return np.mean(abs(y_true - y_pred))
baseline_guess = np.median(y)

And we get that,

The baseline guess is a score of 66.00
Baseline Performance on the test set: MAE = 24.4845

The average estimate is off by about 25 points, meaning a 25% error in this case. This should not be too hard for our model to beat.

Evaluating and Comparing Models

Although we dropped data with more than 50% missing values, there are still some missing observations. We can use median imputation, which replaces all the missing values in a column with the median value of the column. We can also use normalization to scale the features by putting each in a range between 0 and 1. When comparing multiple algorithms, this is a good step to take.

We can finally evaluate five different models to see which is best — Linear Regression, K-Nearest Neighbors Regression, Random Forest Regression, Gradient Boosted Regression, and Support Vector Machine Regression. I used the defaults for the model hyperparameters and saw which had the lowest MAE using the following code:

lr = LinearRegression()
lr_mae = fit_and_evaluate(lr)

'Linear Regression Performance on the test set: MAE = %0.4f' % lr_mae

svm = SVR(C = 1.0, gamma = 0.1)
svm_mae = fit_and_evaluate(svm)

'Support Vector Machine Regression Performance on the test set: MAE = %0.4f' % svm_mae

random_forest = RandomForestRegressor(random_state=50)
random_forest_mae = fit_and_evaluate(random_forest)

'Random Forest Regression Performance on the test set: MAE = %0.4f' % random_forest_mae

gradient_boosted = GradientBoostingRegressor(random_state=50)
gradient_boosted_mae = fit_and_evaluate(gradient_boosted)

'Gradient Boosted Regression Performance on the test set: MAE = %0.4f' % gradient_boosted_mae

knn = KNeighborsRegressor(n_neighbors=10)
knn_mae = fit_and_evaluate(knn)

'K-Nearest Neighbors Regression Performance on the test set: MAE = %0.4f' % knn_mae

Which yields,

Linear Regression Performance on the test set: MAE = 13.4651
Support Vector Machine Regression Performance on the test set: MAE = 16.2058
Random Forest Regression Performance on the test set: MAE = 9.5231
Gradient Boosted Regression Performance on the test set: MAE = 10.0116
K-Nearest Neighbors Regression Performance on the test set: MAE = 13.0131

We can see that the random forest regressor performs the best. Random forest is a strong modelling technique that works by aggregating many decision trees to limit overfitting and error.


Hyperparameters are set by a data scientist before training and can be adjusted to optimize the performance of a model. We can choose the best hyperparameters for a model through both random search and cross-validation. Random search is where we define a range of options, and then randomly select combinations to try, and cross-validation assesses the performance of the hyperparameters.

I optimized n_estimators,max_depth, min_samples_leaf, min_samples_split, and max_features.

In my code I built a hyperparameter grid and performed a hyperparameter search using 4-fold cross-validation over different combinations of hyperparameters, which looked like this:

# hyperparameters
# Number of trees used in the boosting process
n_estimators = [100, 200, 400, 500, 800]
# Maximum depth of each tree
max_depth = [2, 4, 5, 8, 12]
# Minimum number of samples per leaf
min_samples_leaf = [1, 2, 4, 6, 8]
# Minimum number of samples to split a node
min_samples_split = [2, 4, 6, 8, 10]
# Maximum number of features to consider for making splits
max_features = ['auto', 'sqrt', 'log2', None]
# Define the grid of hyperparameters to search
hyperparameter_grid = {'n_estimators': n_estimators,
'max_depth': max_depth,
'min_samples_leaf': min_samples_leaf,
'min_samples_split': min_samples_split,
'max_features': max_features}
# Create the model to use for hyperparameter tuning
model = RandomForestRegressor(random_state = 42)
# Set up the random search with 4-fold cross validation
random_cv = RandomizedSearchCV(estimator=model,
cv=4, n_iter=25,
scoring = 'neg_mean_absolute_error',
n_jobs = -1, verbose = 1,
return_train_score = True,
# Fit on the training data, y)

The best model gives the following hyperparameters:

  • n_estimators = 200
  • max_depth = 12
  • min_samples_leaf = 1
  • min_samples_split = 8
  • max_features = None

We can then use grid-search to find the optimal amount of decision trees which translates to the following graph:

As the number of trees used by the model increases, both the training and the testing error decrease. 800 trees yield the lowest error in cross-validation. Because the test error is higher than the training error, we are overfitting but not by a large margin and this is quite common. Our new and final hyperparameters are:

  • n_estimators = 800
  • max_depth = 12
  • min_samples_leaf = 1
  • min_samples_split = 8
  • max_features = None

Testing Out The Model

We can look at the performance of the default model vs the tuned model on our test set to see if our hyperparameter tuning did anything.

We can use the following code:

# Make predictions on the test set using default and final model
default_pred = default_model.predict(X_test)
final_pred = final_model.predict(X_test)

Which yields,

Default model performance on the test set: MAE = 9.5367.
Final model performance on the test set: MAE = 9.4970.

There is only an improvement of 0.42% but this could be a significant benefit depending on the use case.

Locally Interpretable Model-agnostic Explanations

Now we have our model developed and tested, but we don’t exactly know how it works. Sometimes machine learning is thought of as a ‘black box’ because it can be difficult to understand the inner workings of a model after they have been trained. We can use LIME, a technique that approximates ML models with interpretable models to explain each individual prediction.

What’s most important is finding where our observation goes wrong, and why. We input this code to see a wrong prediction:

# Display the predicted and true value for the wrong instance
print('Prediction: %0.4f' % model.predict(wrong.reshape(1, -1)))
print('Actual Value: %0.4f' % y_test[np.argmax(residuals)])

# Explanation for wrong prediction
wrong_exp = explainer.explain_instance(data_row = wrong,
predict_fn = model.predict)
Prediction: 21.6300
Actual Value: 94.0000

In this example, our model predicted a score of 21.63 and the actual value was 94.

The plot explaining this prediction is below:

I minimized the y-axis font sizes so they would fit on the screen but they’re a bit small to read! The feature at the top is the Site EUI which had the largest effect on the prediction. Because the Site EUI was high, the score was predicted to be low. However, in this case, the Energy Star Score was high despite the high EUI. We can see that the model’s prediction logically does make sense, and it would be more interesting to look into why a building has both a high EUI and Energy Star Score because this is quite contradictory!


I learned so much from this project and got to go through every phase of building an ML model: data cleaning, EDA, feature engineering and selection, comparing models based on MAE, performing hyperparameter tuning and evaluating the best model, and interpreting the results using LIME. Having an effective model to predict the Energy Star Score of a building could be very impactful. Energy-efficient buildings are becoming more and more significant with the rise of the climate crisis, and not only can reduce greenhouse gas emissions but also save money for businesses.

You can check out all the code for this project here!

You can also watch the video I made here:

Thank you so much for reading this! I’m a 15-year-old passionate about sustainability, and am the author of “Chronicles of Illusions: The Blue Wild”. If you want to see more of my work, connect with me on LinkedIn, Twitter, or subscribe to my monthly newsletter!



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store