Forecasting the Power Output of PV Systems Using an ML Algorithm

Naila Moloo
7 min readFeb 20, 2022


Machine learning has enormous potential for solving supply and demand problems in the energy sector. One particular energy source that faces issues with such a prospect is solar energy, which is rapidly becoming one of the most promising sources for producing power.

Because the output of solar energy is based on fluctuating weather conditions, power imbalances can cause significant losses which can be a large cost driver for companies. This can mainly be attributed to what is referred to as the duck curve. The duck curve plots power production over the course of a day and shows the timing imbalance between peak demand and energy production.

The majority of solar energy is produced off-peak during the day, but people want to use solar on-peak when the sun isn’t shining. As more solar energy is exported to the grid, the curves deepen. This can be problematic! Thus, being able to predict the output of photovoltaic systems can be very useful, especially with our increasing generation of solar power.

Introduction to Project

There are numerous models for predicting the output of photovoltaic systems, however these are mainly for large-scale solar farms which operate much differently than power plants on a smaller scale. I created a machine learning algorithm to predict the power output of PV systems in the short term which is significant for managing power grid production on a daily and hourly time frame. This is also helpful for resource planning and energy storage and delivery!

I used a dataset from Kaggle showing the solar power generated from a solar plant where there are 21 features. This was a regression ML task since the goal was to map an input value with a continuous target variable.

Data Exploration

After cleaning and wrangling my data, I started my exploratory data analysis process. I created some histograms and then using the Pearson correlation coefficient began to examine correlations between my features and the target variable, the target variable being the power output in kW. I did this using the following code:

# Find all correlations and sort
correlations_data = data.corr()['power output'].sort_values()

# Print the most negative correlations
print(correlations_data.head(15), '\n')

# Print the most positive correlations

Importantly, one of the most negative correlations was between the angle_of_incidence and target variable having a correlation of -0.646537. Keep in mind, the coefficient ranges between 1 and -1.

The angle of incidence is the angle a ray of sun makes with a line perpendicular to a surface. A surface directly facing the sun has an angle of incidence of zero, and a surface parallel to the sun has an angle of incidence of 90°. As the angle between the sun and the absorbing surface changes, the intensity of light on the surface simultaneously decreases. Therefore, if the angle of incidence is higher, the output will be lower because a surface will receive less light.

Another large correlation was between the zenith and power output, zenith being the angle between the sun’s rays and the vertical direction.

I then created a pairs plot to visualize the correlation between some of the most notable variables including the power output, zenith, angle of incidence, and temperature above ground. I used the following correlation function,

def corr_func(x, y, **kwargs):
r = np.corrcoef(x, y)[0][1]
ax = plt.gca()
ax.annotate("r = {:.2f}".format(r),
xy=(.2, .8), xycoords=ax.transAxes,
size = 20)

And got this:

Feature Engineering and Selection

Perhaps the most time-consuming step in the machine learning workflow is feature engineering and selection. The feature engineering was fairly simple here as there were no categorical variables to one-hot encode, so all I did was add in the natural log transformation of the numerical variables which is used to help models learn non-linear relationships within the data and make skewed data more normally distributed. The code to execute natural log transformation looked like this:

# Create columns with log of numeric columns
for col in numeric_subset.columns:
# Skip the power output column
if col == 'power output':
numeric_subset['log_' + col] = np.log(numeric_subset[col])

For the feature selection, I removed the collinear features because this improves the interpretability of the model. Having unnecessary features will reduce the accuracy of the final model. I removed any of the collinear features above the specified correlation coefficient of 0.6, and ended up taking out total_cloud_cover_sfcand wind_speed_10_m_above_gnd. Before feature selection there were 4213 observations and 21 features, and after there were 4213 and 10 features!

For the naive baseline prediction, I set my code to predict the median value of the target on the training set using the mean absolute error (MAE) as a metric which is useful for regression. The baseline guess was a score of 996.77 and the baseline performance on the test set had an MAE of 836.8352, a value quite easy to beat! Lastly, I normalized my data by putting each in a range between 0 and 1, like this:

scaler = MinMaxScaler(feature_range=(0, 1))
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Choosing The Best Model

I compared the MAE of five ML models on the data: the Linear Regressor, Support Vector Machine Regressor, Random Forest Regressor, Gradient Boosted Regressor, and K-Nearest Neighbors Regressor. The output was the following:

Linear Regression Performance on the test set: MAE = 403.3947
Support Vector Machine Regression Performance on the test set: MAE = 796.1451
Random Forest Regression Performance on the test set: MAE = 272.4012
Gradient Boosted Regression Performance on the test set: MAE = 299.9439
K-Nearest Neighbors Regression Performance on the test set: MAE = 347.4569

Evidently, the best model is the Random Forest Regressor with an MAE of 272.4012. We can also compare this to the baseline error of 836.8352 and see that this is an enormous improvement!

Hyperparameter Tuning

The next step was tuning the hyperparameters of the Random Forest Regressor. I used both random search and cross-validation to do this. Random search allows us to define a range of options and then randomly selects combinations to try, and 4-fold cross-validation assesses the performance of the hyperparameters so that we can see which combination of hyperparameters yields the lowest MAE! The best combination was:






I then played around with the hyperparameter values individually a little bit which got me to the following code:

final_model = RandomForestRegressor(
max_depth = 14,
min_samples_leaf = 1,
min_samples_split = 8,
max_features = 'log2',
random_state = 42, n_estimators = 900)

Evaluating on the Test Set

The test set performance can be a great indicator of how well a model will perform if deployed in the real world. When I compared the performance of the default Random Forest Regressor Regressor to the tuned model, I got the following (where MAE translates to accuracy):

Default model performance on the test set: Accuracy = 79.46%
Final model performance on the test set: Accuracy = 79.71%

Any model with an accuracy between 70%–90% is a pretty solid model (over 90% generally indicates overfitting).

When I used timeit to see how long it took to train both models, I got the following:

The time taken for the default model is 0.031437240000002475
The time taken for the final model is 0.03166563099999564

The final model only took 0.0002 more seconds to train than the default model.

Local Interpretable Model-Agnostic Explanations

Sometimes machine learning can be thought of as a ‘black box’ because we don’t understand the inner workings of a model, but this is where local interpretable model-agnostic explanations come in, or LIME! This is a technique that approximates any ML model with a local, interpretable model to explain each individual prediction. I wanted to see a wrong prediction and then understand why this occurred, so I displayed the predicted and true value for the wrong instance using the following code:

print('Prediction: %0.4f' % model.predict(wrong.reshape(1, -1)))
print('Actual Value: %0.4f' % y_test[np.argmax(residuals)])

# Explanation for wrong prediction
wrong_exp = explainer.explain_instance(data_row = wrong,
predict_fn = model.predict)

And then plotted this,

Here, the prediction for the power output was 2555.5416 and the actual value was 428.0965. This is very off! When looking at the above graph, the top feature beingangle_of_incidence contributed most largely to the increase in the prediction. We can interpret this as saying that our model thought the power output would be much higher than it actually was because the angle of incidence was low.

We can remember that when the angle of incidence is higher, the output will be lower because a surface directly facing the sun has an angle of incidence of zero (which is the best case scenario). Therefore it logically makes sense that if the angle of incidence is low, our power output will be high. We might want to ask why the power plant has such a low output even though there is a low angle of incidence. It could potentially be something to do with an unusual weather condition!

When I looked at a right prediction I got this graph,

Here, the prediction was 451.3608 and the actual value was 459.2989. We see that the angle of incidence is high, contributing largely to the lower power output.


This article outlined my photovoltaic output project and was so much fun to work on! All my code can be found here, and if you want to learn about it in a more visual way you can check out my video :)

Thank you so much for reading this! I’m a 15-year-old passionate about sustainability, and am the author of “Chronicles of Illusions: The Blue Wild”. If you want to see more of my work, connect with me on LinkedIn, Twitter, or subscribe to my monthly newsletter!