You can draw a trendline on any scatter chart. But should you?

We are going to look at plotting trendlines over scatter chart data and how we can validate them. We’ll also see that, even if they turn out to be not that great, they still might be useful.

A trendline should show some relationship between the variables being plotted. If we plot ice cream sales against summer temperatures we would likely see some correlation between the two variables. And if we were to plot a regression line it might well discover a linear relationship - the hotter it gets the more ice cream is sold (though at the extremes you might find that relationship breaks down, e.g. if it gets too hot, people may not go out and so will stop buying).

But how do we know if there really is a meaningful relationship? 

The validity of a trendline depends on how well it represents the underlying relationship in the data. Here are two extremes:

  • Plotting a trendline over random data is clearly not sensible - we can do it, but it won’t be meaningful.

  • Plotting a trendline over a set of points that have a strict linear relationship (where all the dots already form a straight line) will simply overlay a solid line on the dotted one  -  this may be valid but it won’t lead to any further insight or information.

Here are a couple of examples - the first one is made with pseudo-random data.

And this second one represents data with a strict linear relationship.

These are extremes and you will find most plots will be somewhere in between. For example, the following plot is linear data with noise added (to make it more realistic) so there ought to be a genuine relationship to be discovered.

You can easily see that the trendline demonstrates the underlying linearity of the relationship.

Here is the Python code that produced the chart above. We begin by generating random x values. The y values are calculated as 3 times x with some random noise added. We convert that into a Pandas dataframe and then plot it as a Plotly scatter chart.

# Generate random data
x = np.random.rand(100) * 100  # Random x values between 0 and 100
y = 3 * x + np.random.randn(100) * 30  # y = 3x with noise

# Create a DataFrame
data = pd.DataFrame({'x': x, 'y': y})

# Plot scatter chart with trendline
fig = px.scatter(data, x='x', y='y', title='Scatter Chart with Trendline',
                 labels={'x': 'X-axis', 'y': 'Y-axis'},
                 # Ordinary Least Squares regression for trendline
                 trendline='ols', trendline_color_override='red')  

# Show the plot
fig.show()

Plotly allows us to add a trendline very easily. Here we specify the Ordinary Least Squares (OLS) algorithm for generating a linear regression line.

OLS calculates a straight line through the data by minimizing the sum of squared residuals. A residual is the difference between the observed values and the predicted values, i.e the values of the points that will make up the trendline. OLS will find the line that minimizes the sum of the squared residuals.

But how do I know if it is valid

Not all graphs are so easily interpreted, so how can we be more confident that the trendline is valid? Here are some ideas.

Visual checking

A trendline should exhibit a recognizable trend or pattern. (This might not be linear but we will limit ourselves to this type of relationship here). So, you can visually check that it looks right. If the plotted points are closely grouped around the regression line and there are not too many outliers then you probably have a valid chart.

But there are statistical methods we can use, too.

Correlation and p-values

You can check if there is a meaningful correlation between the variables by finding the coefficient of determination (R²). This measures how well the trendline fits the data. R² is a real number in a range of 0 to 1 and tells us how well a mathematical model (a line or curve) fits a set of data. R² values close to 1 indicate a strong fit, while values near 0 suggest the trendline does not explain much of the variability in the data.

Looking at the p-value will helps us decide if the results of an experiment or study are meaningful or just happened by chance. The p-value in regression analysis is a statistical measure used to assess whether the relationship between the variables is statistically significant. The p-value represents the probability that the observed relationship occurred by chance if the null hypothesis (no relationship) is true.

A low p-value (typically < 0.05) suggests that the relationship is statistically significant, meaning the independent variable(s) likely have an effect on the dependent variable. A high p-value (≥ 0.05) suggests that the relationship might not be statistically significant.

Plotly uses the statsmodels library to draw trendlines and provides functions to retrieve the regression results including R² and the p-value.

Some of these results can be seen if you hover over the trendline in the Plotly chart, as you can see in the screenshot below.

But we can also access the values programmatically. Append the following to the code above and you will see the resulting values.

# Extract trendline results
results = px.get_trendline_results(fig)
ols_model = results.iloc[0]["px_fit_results"]  # OLS regression model

# Print p-value and R²
print("P-Value:", ols_model.pvalues[1])  # p-value for the slope
print("R²:", ols_model.rsquared)

The result can be seen below: you can see that the R² is close to one and the p-value is very low telling us that there is a meaningful correlation between the variables and that the result is statistically significant.

   P-Value: 1.5509131226498723e-51
   R²: 0.9034004881292087

If we were to do the same thing with the other graphs we’d get these results:

Random data graph

P-Value: 0.4897676270987973
R²: 0.00488066061729886

Linear data graph

P-Value: 1.4099324978279833e-124
R²: 1.0

With the linear data the p-value is very low and R² is as high as it can be, 1. Whereas with the random data the p-value is much higher than 0.05 telling us that the result could be random (and of course it is) while R² is very low meaning that the trendline doesn’t fit the data well.

Not valid but useful, anyway

So that’s the formal way of looking at trendlines and their validity. But even when a trendline is not exactly valid, it can still be useful.

I recently plotted some historical weather data for London to see if I could detect the effect of global warming on summer temperatures.

I made a scatter graph like the one below. Can you see a trend?

It’s not striking but there does seem to be an upwards trend. I added a trendline thus:

Aha! There we see a definite trend upwards. Proof that London temperatures are rising. But many of the data points are not very close to the line.

So, what if I were to try and validate the trendline? Unfortunately, although the p-value is quite low, R² is about 3.9 - nowhere near 1, so it doesn’t pass that validity test.

Does that mean my trendline is useless?

In the figure below I’ve added a line to the scatter chart and that makes the picture a little clearer.

There certainly is an overall rise in temperature but perhaps there is not a simple linear relationship between time and temperature. The temperatures wobble up and down around the trendline.

But the trendline is not useless.

Perhaps there are cyclical weather patterns that sometimes raise the overall temperature and at other times lower it. Also, weather, at least at a local level, is notoriously chaotic. The point is that while there is not a simple pattern to be discovered here, at least not without more data, there is an overall trend that the OLS line has highlighted.

Conclusion

Using the measures we discussed above can give you confidence in the validity of your data. But, as we see above, a low R² doesn’t necessarily mean that the trendline shouldn’t be drawn.

There is one other thing you might want to consider as well: Do you have enough data? The weather data that I plotted is for the 20th and 21st centuries. What if I had been able to retrieve data from the beginning of the Industrial Revolution when carbon emissions began to increase? Or perhaps, beyond that. Maybe a different trend would emerge.


Thanks for reading and I hope that this has been useful. I you would like to see more articles like this then click on the link at the top of this page. You might also subscribe to my email list to know when new articles are published.

Code for this article is here