3.0

3.0 | 3 ratings Rate this file 52 downloads (last 30 days) File Size: 1.31 KB File ID: #24599

Pairwise Euclidean distances

by Michael Chen

 

30 Jun 2009 (Updated 03 Jul 2009)

Code covered by BSD License  

Fully vectorized function to compute square Euclidean or Mahalanobis distances between vectors.

Download Now | Watch this File

File Information
Description

function D = sqdistance(A, B, M)
% Square Euclidean or Mahalanobis distances between all sample pairs
% A: d x n1 data matrix
% B: d x n2 data matrix
% M: d x d Mahalanobis matrix
% D: n1 x n2 pairwise square distance matrix
% d and n are the dimension and the number of vectors.

The output is a matrix of square of Euclidean or Mahalanobis distances between vectors in the inputs. Column vectors are assumed in this code.

When two matrices A and B are inputed, this function computes the square Euclidean distances between them. If an extra positive definite matrix M is inputed, it computes Mahalanobis distances.

If only one matrix A is inputed, it computes square Euclidean distances between vectors in A. In this case, it is equivalent to the square of pdist function in matlab statistics toolbox but faster.

Sample code:
d=1000;n1=5000;n2=6000;
A=rand(d,n1);B=rand(d,n2);
M=rand(d,d);M=M*M'+eye(d);
D1=sqdistance(A,B);
D2=sqdistance(A);
D3=sqdistance(A,B,M);

MATLAB release MATLAB 7.8 (R2009a)
Zip File Content  
Other Files license.txt,
sqdistance.m
Tags for This File  
Everyone's Tags
Tags I've Applied
Add New Tags Please login to tag files.
Comments and Ratings (8)
02 Jul 2009 John D'Errico

The problem with this code is it is potentially inaccurate. It uses an identity that people are oh so impressed with, but squares the numbers unnecessarily. I'll talk about this problem eventually, but first, let me discuss another major flaw with this code.

It does not actually compute the Euclidean distance. It computes the square of that distance. As such, it has the wrong units. Don't forget that if your data has units miles, then all "distances" produced by this function will have units of miles squared.

Note that this code claims to produce the same numbers as does the pdist function. But it does not. pdist produces TRUE distances, not the square of the distance. There is a difference.

A = randn(3,5);
B = randn(3,4);

distance(A,B)
ans =
       2.6333 7.7299 4.5349 4.6107
       3.1854 4.636 9.1202 1.0953
        2.277 9.2126 3.9594 9.9298
       5.9439 1.416 6.8336 2.748
       3.1344 0.96642 7.4918 1.6271

The true interpoint distances, as one will normally see by any correct tool available to compute distance is closer to this:

sqrt(distance(A,B))
ans =
       1.6228 2.7803 2.1295 2.1472
       1.7848 2.1531 3.02 1.0466
        1.509 3.0352 1.9898 3.1512
        2.438 1.19 2.6141 1.6577
       1.7704 0.98306 2.7371 1.2756

I imagine the author will claim that by not computing the square root, it saves time. I suppose so, but to then claim that the result is a distance as most expect to see would be misleading. And surely to claim that it produces the same as other codes is very misleading.

A nice property for a distance measure to have is linearity. We would like to see that

distance(k*A,k*B) = k*distance(A,B)

It is something that pdist will give you. But this is not a property one will gain from distance.

On to the other problem, the one that I see as most damaging. Recall the results of the above experiment for distance(A,B). Now, try adding any fixed offset to both A and B. I'll add a large one to the matrices to show that complete trash is generated.

distance(A+1e8,B+1e8)
ans =
     8 8 8 8
     4 -4 4 4
     4 4 4 12
     8 0 8 0
     0 0 8 0

See that NO significant digits remain in the result.

A higher quality tool (like pdist for example) will survive even this extreme test quite nicely. As it turns out, this is not an extreme test for pdist. (Since pdist does only interpoint distances for a single matrix, these numbers will not be comparable. See only that adding 1e8 did not change any digits printed out.)

pdist(A'+1e8)
ans =
  Columns 1 through 8
       1.9891 1.8594 3.139 2.8911 2.7152 2.6316 1.5634 3.9224
  Columns 9 through 10
       3.1472 1.7458

pdist(A')
ans =
  Columns 1 through 8
       1.9891 1.8594 3.139 2.8911 2.7152 2.6316 1.5634 3.9224
  Columns 9 through 10
       3.1472 1.7458

I will point out that all of the above mentioned flaws could have been repaired with a few simple changes to the code. Had the code been written as follows, it would take only about 20% longer to execute in my quick test, but it would have worked properly and robustly.

if nargin == 1
    B = A;
    mu = mean(A,2);
else
    mu = (mean(A,2) + mean(B,2))/2;
end
A = bsxfun(@minus,A,mu);
B = bsxfun(@minus,B,mu);
D = full(-2*A'*B);
D = bsxfun(@plus,D,full(sum(B.^2)));
D = sqrt(bsxfun(@plus,D,full(sum(A.^2)')));

Finally, the help for this code is itself wrong. Here is what it says when you do "help distance".

Compute distances between all sample pairs
  X: d x n data matrix
  D: n x n pairwise distance matrix
  Written by Mo Chen (mochen@ie.cuhk.edu.hk). March 2009.

See that it tells you that X is a d by n matrix of data. Does it tell you that the function actually accepts TWO arguments? That if there are two arguments, then it computes distances between all pairs of columns of the two matrices?

No error checks to verify the data actually conforms for distance computation. I'll be generous and give this a 2 star rating.

02 Jul 2009 Michael Chen

The speed gain is not that this code does not compute sqrt but that it has no for loops, which is the main purpose of this function: demostrating how to vectorized the code in such scenarios.
The suggestion of robustness is valuable, lthough I've never suffer the extreme case in my practice,
The code is updated to include the centerization step and fix the outdated comments.

02 Jul 2009 John D'Errico

Somewhat better now in ONE respect. However, just because you have never suffered from the extreme case I showed does not mean that you have never had the problem. You've just never noticed it, or even known to look.

There is still a problem. Perhaps my long comments before were not extensive enough. This is NOT actually a valid distance metric. NOT. NOT. Read the definition of a distance metric. Here are a few sites:

http://www.mathreference.com/top-ms,dm.html
http://en.wikipedia.org/wiki/Metric_space

See that one requirement for a valid distance metric is the triangle inequality. The triangle inequality states that

   D(x,y) <= D(x,z) + D(y,z)

Does distance satisfy that? TRY IT!

distance(1,5)
ans =
    16

distance(1,3) + distance(5,3)
ans =
     8

(distance(1,3) + distance(5,3)) > distance(1,5)
ans =
     0

So this function fails to satisfy one of the basic requirements for a distance metric. This does NOT compute a distance. Just because you call the function distance does not make it a distance. The failure to compute the sqrt makes this not a valid distance.

Next, this function fails to work properly in one dimension!

distance([1,2])
ans =
                    0.5 1.5
                    1.5 0.5

I would have expected to see the matrix [0 1;1 0] returned as the result. (Surely you will concede this fact.) Don't complain that it was my suggested change that caused it to fail, because the original code fails too here, and by a larger margin.

originaldistance([1,2])
ans =
     8 6
     6 2

Finally, the changes to the help made it more accurate, but still fail to describe the behavior of this code. The help now states that when when called with two arguments, it returns the square of the interpoint Euclidean distances between columns of the matrices. It does not state at all what happens when called with only one argument.

Looking slightly deeper at the code pointed out serious flaws still. My rating is verging closer to one star at this point.

02 Jul 2009 Michael Chen

If you want the Euclidean distance itself, nobody prevents you from taking a simple sqrt on top of this function, it wont cost you a second. On the other hand, there are a lot of situations that the square distance is required (or sufficient) not the distance, such as KNN, Kmeans, Spherical Gaussian density, etc.

This is just code for academic purpose, if you feel helpful, just use it where it is suitable. I'm not making some industry product, so give me a break.

03 Jul 2009 John D'Errico

Improving. It now runs as claimed for the 1-d case, so the bug is removed. I also like that this is called sqdistance. The problem with calling it distance is that it was not a distance. Better is to make that clear, that this returns the square of the Euclidean distance. So this is also good.

There is now even an attempt at error checking, in that the author uses the assert function to flag problems. However, assert allows you to provide TWO arguments. See what happens when I call sqdistance with improperly sized arrays:

sqdistance(rand(3,4),rand(2))
??? Error using ==> sqdistance at 16
Assertion failed.

All it tells me is "Assertion failed". For gods sake, what assertion? Use the second argument! Allow the code to exit gracefully and descriptively. Tell the user that the arguments are incompatible in size for this operation. To just tell them "Assertion failed" is silly. Why bother to do so?

Obviously from the authors last comment, this is just a code for his own academic purposes. So apparently it needs not be any good, or do what he claims it does. I'll argue that he is wrong.

When you post something on a website like this, hundreds or even many thousands of MATLAB users may use your code, or try to do so. They may look at it, hoping to learn something from what you post. So I'm sorry, but it would be a disservice on my part to any MATLAB user or student for me NOT to tell them that I see a problem, and what is wrong, and if possible, how the problem should best have been resolved (in my opinion.) And if this author has improved his code or coding style because of what I show to be wrong, then I've helped him too.

03 Jul 2009 Michael Chen

If you have read the code STL of C++, you will find there is little if statements, and almost no runtime check for input. That is because it reduces the efficiency.
It is more severe in matlab, when you call a function a lot of times, such as kmeans (try to add some 'if' in the inner loop you will see). There are different programming styles: 1) make a lot redundant check in the beginning of every function which sacrify efficiency for robustness; 2) just make sure it works right when input is right. Assuring input correctness is the caller's resonsibility.
These different styles are just trading off between robustness and efficiency. You prefer the robust one, that's fine, just do it your way. I like my way better.
I'm trying provide some functionality and idea here. When I read other's code, I always feel those redundant inputs verification are misleading, which makes it hard to see what the function is acualling doing. If some one (like you) wants to get some safty, I'm sure it is a easy jod for him to add the 'if's himself, the code is there after all. But I like my code to be clean.

03 Jul 2009 Michael Chen

By the way, reading you review reminds me some review comments of some of my papers. Some reviewers just like to focus on whether the formate is right, whether the citation is right even whether the spell is right but not the idea of the paper itself. That is realy a pity.

03 Jul 2009 Michael Chen

One more word for input verification, you can not check every aspects of the inputs. For example, checking whether the input matirx is positive definite in this code is just crazy which will cost more time than the function itself. One must end up at some point between checking everything and checking nothing, which is a design desicion the coder should make.
In such a simple code, i dont want nasty guarding code, which be even longer than the main funcional code, distracting the reader's attention.

Please login to add a comment or rating.
Updates
01 Jul 2009

update the description

02 Jul 2009

Add a centerization step for robustness purpose. Split the code for different number of input for efficiency purpose. Update comments.

02 Jul 2009

update to support Mahalanobis distance. fix a bug for one dimensional case.

03 Jul 2009

remove any redundant error check

Tag Activity for this File
Tag Applied By Date/Time
euclidean Michael Chen 01 Jul 2009 10:23:12
distance Michael Chen 01 Jul 2009 10:23:13
pairwise Michael Chen 01 Jul 2009 13:46:18
mahalanobis Michael Chen 03 Jul 2009 06:56:00
distance Daniele 21 Jul 2009 05:15:41
pairwise salih marioud 20 Sep 2009 08:39:19
 

MATLAB Central Terms of Use

NOTICE: Any content you submit to MATLAB Central, including personal information, is not subject to the protections which may be afforded information collected under other sections of The MathWorks, Inc. Web site. You are entirely responsible for all content that you upload, post, e-mail, transmit or otherwise make available via MATLAB Central. The MathWorks does not control the content posted by visitors to MATLAB Central and, does not guarantee the accuracy, integrity, or quality of such content. Under no circumstances will The MathWorks be liable in any way for any content not authored by The MathWorks, or any loss or damage of any kind incurred as a result of the use of any content posted, e-mailed, transmitted or otherwise made available via MATLAB Central. Read the complete Terms prior to use.

Contact us at files@mathworks.com