ncd

Getting started

Zgli.Folder

Zgli.encoder

Zgli.tree

1.Kolmogorov complexity

The Kolmogorov complexity of an object is the length of the shortest program in a predetermined syntax that when run, produces that same object as output. As an example that illustrates this concept, consider the following strings:

String	Size
S1: "ggggggggggggggggg"	size = 17
S2: "LATQvgkCQaNwEadqO"	size = 17

Notice that although they have the same size in terms of algorithmic information needed to describe the two strings, one can use the following programs P1 and P2 describing S1 and S2, respectively:

Program	Program length
P1: "g" * 17	𝐾(S1) ≈ 4
P2: "LATQvgkCQaNwEadqO"	𝐾(S2) ≈ 17

Notice that the information in S1 is basically described by the number of times "g" is repeated and therefore, the Kolmogorov Complexity of S1 is much shorter than writing the entire string. On the other hand, there are no patterns that one can use to describe in a shorter manner the string S2, therefore its Kolmogorov Complexity is nearly maximally, i.e., approximately equal to the size of S2.

2.Information Distance

The Normalized Compression Distance measures the similarity between two objects, but to understand it we first need to take a look at the concept of information distance and its normalized version. The Information Distance represents the distance between two objects 𝑥 and 𝑦 by the shortest program (the one corresponding to Kolmogorov Complexity) which transforms 𝑥 into 𝑦 and vice-versa.

3.Normalized Information Distance

The information distance will then give us the absolute distance between two objects independently of the size of the objects, but if we want to measure similarity between objects, we must first normalize this distance to get the relative distances between objects. For example, the ID can tell us that two strings differ by 17 bits, but does not take into account if this difference is between two strings of size 50 or 500. For this reason, we divide the information distance by the size of the biggest description of the two strings.

4.Normalized Compression Distance

With the NID we can then, theoretically, measure the similarity between any computer file. However, since this distance is calculated based on the representation of each object with the least Kolmogorov Complexity, and this notion, in turn, is not computable, we have to use an approximation of this measure: the Normalized Compression Distance (NCD). The Normalized Compression Distance approximates of K by using real-world compressors Z (i.e GZIP, BZLIB, ZLIB), to represent the size of compressed objects. Having this notion in mind, the NID was rewritten to be applicable in the real world

The distance is calculated by first appending both files together and compressing them. Then both files are compressed individually taking the difference between the appended compressed file size and the smallest individual compressed file size. Finally, this difference is normalized by dividing it by the size of the largest individual compressed file. Since we are using compressors Z to approximate K of each object it is obvious that the better the Z is the better the NCD results.

zgli.Folder

zgli.Encoder

zgli.Tree

Changelog