Analysis of alignments in R-Environment - The "best-fit" definition


Definition #3

Given a cartesian plan and a set of points with coordinates (xn,yn) they can be considered "aligned" within a E (parameter of goodness-of-fit) if the absolute value of the correlation coefficient rxy is greater than E.

Note: rxy is the covariance of (x,y) divided by the product of the two standard deviations of x and y. R has the powerful function cor(x,y), and the two expressions give the same result:


In other words, given a set of n points, by calculating the correlation coefficient you can evaluate if they are someway "aligned": a value of E closer to 1 makes the alignment more precise.

The only problem raises when all the points are on an horizontal or vertical line. In this case, the correlation coefficient can not be calculated because the variance of x or y is 0; for this reason we should add the condition that if the variance of x or y is equal to 0, the points are aligned.

It is easy to define a boolean function (in R-Environment) which returns TRUE or FALSE when called with the an array of coordinates and the tolerance in input:

   { if(var(x)==0 | var(y)==0) TRUE else abs(cor(x,y))>e }

The function gets as input two arrays x and y containing the coordinates of the set of points and the goodness-of-fit parameter e.

Verify an alignment within a specific area (2)

Let consider again the alignment which led to a paradoxical result with Definition #1:

In order to test the function aligned we can define the two arrays of coordinates x and y (from the points 16,85,89,94):

px <- c(x[16],x[85],x[89],x[94])
py <- c(y[16],y[85],y[89],y[94])

By calling the function:


we get TRUE: the four points are aligned within a parameter of goodness-of-fit equal to 0.999.

It doesn't raise any paradox, because the same result is obtained on subsets of the four points:


All the three functions return correctly TRUE.

It is easy to draw the line "fitting" the set of points: just calculate the regression line (i.e. the line chosen so that it comes as close to the points as possible) and ask R to draw it:

rl <- lm(py~px)

Please, download the complete R file from here bestfit.txt: you can test it on R-Environment.

Definition #4

With the method of regression line it is easy to calculate the "goodness of fit": for each point it is the perpendicular distance to the line of best fit. This could lead to a new definition of "alignment" dealing with the maximum distance tolerated E and the line of best fit:

Given a cartesian plan and a set of points with coordinates (xn,yn) they can be considered "aligned" with a tolerance E if - calculated the regression line which best fits the points, each point has a distance from that line smaller than E.

In other words, given a set of n points you can always calculate the inclination (m) and the intercept (q) of the regression line best fitting the set: the points are "aligned" if the perpendicular distance between each (x,y) from the line y=mx+q is smaller than E.

The function is easily defined. Firstly we define a function calculating the distance between a point and a line:

distance<-function(x,y,m,q) { abs((y-m*x-q)/sqrt(m^2+1)) }

Then we create the boolean aligned in this way:

aligned <- function(x,y,e)
  rl <- lm(y~x);
  m <- as.numeric(rl[[1]][2]);
  q <- as.numeric(rl[[1]][1]);
  d <- distance(x,y,m,q)<e;
  tmp <- TRUE;
  for(i in d) tmp <- tmp & i;

This function is more "user friendly": the input is the same as the aligned function according to Definition #3, but it is easier to fix the value of E, being in the same measurement unit of the coordinates. So if your coordinates are in meters, also E is in meters.

By using the latter definition of aligned, you can call:


It returns TRUE, so we can say that these four points:

are at a distance from the line smaller than 25 meters. In order to get the exact distances from the best fit line you can define a new function:

distances <- function(x,y)
  rl <- lm(y~x);
  m <- as.numeric(rl[[1]][2]);
  q <- as.numeric(rl[[1]][1]);

By calling it with the x and y arrays of coordinates, it returns an array of distances from the regression line:

13.758366 5.419433 23.150012 3.972212

The point 16 is 13.76 units from the line, the point 85 is 5.42 units from the line and so on.

It gives no information about the alignment; you have to test if the distances are all smaller than E: if so, they are aligned with a tolerance E.

This leads to an interesting point: there is a best fit line for each set of points. By adding the point 12 to the list:

px <- c(px,x[12])
py <- c(py,y[12])

we can draw the new regression line:

rl <- lm(py~px)

and the distances are clearly greater than before:

3365.7148 1036.0479 2164.4213 629.8708 5936.3132

The result should be read keeping in mind the order of points defined: 16, 85, 89, 94 and 12. So the point 12 is at a distance of 5936.31 units from the regression line.

By defining a E=6000, they can be positively considered aligned (!):


Please, download the complete R file from here bestfit2.txt: you can test it on R-Environment.

© 2019 Mariano Tomatis Antoniono