Background Statistical Interpretation

Figure 1 Presents a histogram plot of the observed daily maximum temperatures in degrees Celsius for the month of September from 1951 to 2025. The observations have been grouped into 1°C intervals. The graph includes the mean and the 5% and 95% probability thresholds. The 5% threshold indicates that 5% of the observations in the dataset fall below this value, while the 95% threshold indicates that 95% of the observations fall below this value. Notably, 90% of the observations fall between these two thresholds or 328.5 days in a typical year of 365 days (365 x 0.90). Observations that fall outside these two thresholds are usually consider rare or extreme. There should be 18.25 days below the 5% probability and 18.25 days above the 95% probability in a year of 365 days.

Variation in Climate Data

Climate data fluctuates daily and annually due to the inherent chaotic dynamism of the climate system. This system is influenced by various cause-and-effect factors that randomly affect the weather we experience. When we graph the measurement of a single weather variable over numerous observations, we often observe a pattern known as the normal distribution. The normal distribution holds importance in statistics because most real-world measurements, such as human height, test scores, blood pressure, and so on, when graphed, exhibit this symmetrically shaped bell curve (Figure 1).

Some important charactersistic of the normal distribution include:

The mean, median, and mode for the observations are all found at the center of the normal distribution.
The left and right sides the normal distribution mirror each other.
Most values occur near the mean, with probabilities tapering off smoothly toward the end of the bell curve.
The total area under the bell curve is 1.00 or 100%, we can use this fact to determine the probability of occurrence of an observation.
Statistians use two parameters to desbribe the normal distibition of a datset, μ (mu) — the mean, which describes the center and σ (sigma) — the standard deviation, which describes the spread of the data. From these parameters, I have calculated the 5% and 95% probability thresholds for the daily maximum and minimum data from 1951 to 2000 described here so we can identify extremely warm and cold events. We can also use these thresholds to see if human-caused climate change is warming temperatures in the four Okanagan cities.
About 68% of values in a dataset lie within 1 standard deviation of the mean, 95% lie within 2 standard deviations, and 99.7% lie within 3 standard deviations.

On this website, you can find graphs that display the daily maximum and minimum temperatures for each of the four Okanagan cities from 1951 to 2026. These graphs include lines that represent the mean, the 5% probability, and the 95% probability values for each day based on data from 1951 to 2000. If temperatures are indeed gradually warming over time, we should observe a shift in the distribution of daily measurements to be above the mean in the 21st century (2001 to 2026). For instance, Figure 2 illustrates Kelowna’s daily maximum temperatures during the year of the infamous 2021 Western North America heat wave. During this heat wave, numerous daily maximum temperature records were broken in southern British Columbia in late June. In Kelowna, daily maximum temperatures exceeded the 1951 to 2000 daily average for 35 consecutive days, starting on June 17 and ending on July 20. In all of 2021, 237 days were above the 1951 to 2000 daily average in Kelowna. If the climate of 2021 resembled the 1951 to 2000 period, we would expect only 182.5 days to be above the average (365 divided by 2).

Finding Data Trends - Best-Fit Lines

A best-fit line (or trendline) is a straight or curved line on a scatterplot (graph) that best represents the underlying trend of a set of data points. Best-fit lines are used to mathematically model the relationship of two variables suspected of being influenced by cause-and-effect and to make predictions.

The creation of a best-fit line from a set of data is easily done by a statistical technique known as linear regression. In a simple two variable linear regression analysis, one dependent variable is mathematically regressed with only one independent variable. One outcome of this analysis is to derive an equation for a best-fit line model of the relationship between the dependent (called Y) and independent (called X) variables. For a linear, two variable situation, this equation will have the mathematical form:

Y = a +/- b X

Where,

Y is the value of the dependent variable,

X is the value of the independent variable,

a is the intercept of the regression line on the Y-axis when X = 0,

and b is the slope of the regression line. Also called the regression coefficient. The sign (+ or -) of this coefficient describes the direction of the relationship between the X and Y variables. A positive slope indicates that an increase in X leads to an increase in Y. A negative slope indicates that an increase in X leads to a decrease in Y. The regression coefficient also describes the rate of change in the dependent variable Y relative to the independent variable X.

Figure 3 shows a scatterplot of surface minimum atmospheric pressure (mb - millibars) and surface maximum wind speed (kph - kilometers per hour) for 312 measurements of tropical storms and hurricanes that formed in 1996 to 2000 in the Atlantic, East Pacific, and West Pacific oceans. The regression equation for the best-fit line was determined to be:

Y = 2296.49 - 2.21488 X

From this equation, we can generate predictions of Y by knowing the value of X. For example, if we know X to be 950 mb, then it follows that we can calculate the corresponding wind speed as:

Y = 2296.49 – 2.21488 (950) = 2296.49 - 2104.25 = 192.24 kph

Correlation Coefficient

The correlation coefficient is a statistic that measures the strength of the correlation between independent and dependent variables in linear regression analysis. The symbol "r" often represents this statistic (see Figure 3). The value of the correlation coefficient ranges from 1.00 to -1.00 (see Figures 4 and 5). A value of 0.0 indicates that there is absolutely no relationship between the X and Y variables. The strength of the relationship between the X and Y variables increases as the value of r approaches 1.00 and -1.00. Perfect correlation occurs if r equals either 1.00 (perfect positive) or -1.00 (perfect negative). Positive correlation coefficients indicate that an increase in the value of the X variable results in an increase in the value of the Y variable. Negative correlation coefficients indicate that an increase in the value of the X variable results in a decrease in the value of the Y variable.

Associated with the calculation of the correlation coefficient is a probability value. This value determines the probability of finding a correlation at least as strong as the one found in an analysis purely by random chance, assuming no real relationship exists. In the analysis of surface minimum atmospheric pressure and surface maximum wind speed measurements for 312 tropical storms and hurricanes, the probabability value was found to be less than (<) 0.0001 or 1 in 10,000 (P < 0.0001 in Figure 3). Probabability values are provided for the various linear regression analyses found on this website.

Figure 2 Plot of the daily maximum temperatures recorded in Kelowna from January 1 to December 31, 2021. The graph also includes the average daily temperature over the 50-year period from 1951 to 2000, represented by a dark grey thick line. The 5% and 95% probability intervals for 1951 to 2000 are depicted by dotted light grey lines.

Figure 3 Regression best-fit line plotted on a scatterplot of surface minimum atmospheric pressure (X-variable) and surface maximum wind speed (Y-variable) measurements for 312 tropical storms and hurricanes. Also shown is the correlation coefficient (r = -0.9649) and its associated statistical significance or probability value (P < 0.0001).

Figure 4 Four examples of data point distributions with varying positive correlation coefficient values. A best-fit regression line through the center of the data points is also shown.

Figure 5 Four examples of data point distributions with varying negative correlation coefficient values. A best-fit regression line through the center of the data points is also shown.