Folder.distance_matrix

Computes the normalized compression distance between all documents inside the folder. [source]
Parameters:
compressor : {'zlib','gzip','bzlib','lzma','raw'}
  Wich compressor to use. Diferent compressors may yield diferent results when clustering.
  • 'zlib' produces asymmetrical matrices and smaller sizes for small strings.
  • 'gzip' has simillar behavior to zlib (normally with bigger compressed sizes).
  • 'bzlib' produces symmetrical matrices and is recomended when using data that was encoded using the zgli.encode_df function.
  • 'lzma' usually produces the most compressed sizes.
  • 'raw' uses the raw file sizes i.e files do not get compressed.
output_path : string
  The file path of the to where the distance matrix should be outputed.

compress_by_col : bool default = False
  Defines if the data shoul be compressed by column or normally. Default = False, i.e the files are compressed normally.

delimiter : string default = ,
  Defines if the data shoul be compressed by column or normally. Default = False, i.e the files are compressed normally.

weights : list default = None
  Defines if the data shoul be compressed by column or normally. Default = False, i.e the files are compressed normally.

verbose : int default = False
  Defines if the data shoul be compressed by column or normally. Default = False, i.e the files are compressed normally.
dm_name : string default = distmatrix
  Defines the name for the distance matrix text file.

Returns:
distance_matrix : list(list)
  All the ncds between the files inside the folder.
Outputs:
distance_matrix : .txt
  A .txt containing a matrix of the same format as the one printed to the screen when verbose = 1.
See also:
folder.get_file_lengths
 Compute a distance matrix of all files using the normalized compression distance.

folder.get_file_sizes
  Return the names of all the files inside the folder.

Example:

# Imports
>>> from zgli.folder import Folder

# Define Parameters
>>> data_path = '../../data/examples/10-mammals'

# Initialize Folder class
>>> folder = Folder(data_path)

# Compute matrix
>>> dm = folder.distance_matrix(compressor = 'bzlib', output_path = 'dm_folder', dm_name = 'distmatrix')

0_mouse.txt 0.0 0.941648 0.964551 0.967002 0.957282 0.960252 0.960088 0.967124 0.960965 0.965251
0_rat.txt 0.941648 0.0 0.966302 0.96132 0.958167 0.958732 0.966924 0.960157 0.955044 0.958172
1_graySeal.txt 0.964551 0.966302 0.0 0.77382 0.96105 0.959818 0.954923 0.964729 0.949891 0.942299
1_harborSeal.txt 0.967002 0.96132 0.77382 0.0 0.960009 0.960252 0.953671 0.962769 0.947334 0.94423
2_blueWhale.txt 0.957282 0.958167 0.96105 0.960009 0.0 0.955691 0.854465 0.960157 0.949342 0.95903
2_chimpanzee.txt 0.960252 0.958732 0.959818 0.960252 0.955691 0.0 0.953301 0.8649 0.949175 0.961604
2_finWhale.txt 0.960088 0.966924 0.954923 0.953671 0.854465 0.953301 0.0 0.956238 0.948465 0.958172
2_human.txt 0.967124 0.960157 0.964729 0.962769 0.960157 0.8649 0.956238 0.0 0.95123 0.959459
4_horse.txt 0.960965 0.955044 0.949891 0.947334 0.949342 0.949175 0.948465 0.95123 0.0 0.952166
5_cat.txt 0.965251 0.958172 0.942299 0.94423 0.95903 0.961604 0.958172 0.959459 0.952166 0.0