LZHW

LZHW (Lempel-Ziv Huffman Welch)

LZHW is an open source project under MIT License that consists of a python library and a command line tool. The algorithms are written in Cython, a superset of python that compiles into C code, so they are pretty fast.

Github link: https://github.com/MNoorFawi/lzhw

Python Library

The python library is a compression suite to compress big lists and/or pandas dataframes using a combined algorithm (lzhw) developed from Lempel-Ziv, Huffman and LZ-Welch techniques.

For a quick start on how to use the library, please have a look at the Github repo link above. While a full documentation can be found here https://mnoorfawi.github.io/lzhw/

Command Line Tool

The tool can be downloaded from the SourceForge project here and can work from command line with no prior python installation.

It works on Windows and soon a Mac version will be available.

Tool Quick Start

Here is the file and its help argument to see how it works and its arguments:

lzhw -h

Output

usage: lzhw [-h] [-d] -f INPUT -o OUTPUT [-c COLUMNS [COLUMNS ...]] [-r ROWS]
            [-nh] [-p] [-j JOBS]

LZHW is a tabular data compression tool. It is used to compress excel, csv and
any flat file. Version: 0.0.10

optional arguments:
  -h, --help            show this help message and exit
  -d, --decompress      decompress input into output
  -f INPUT, --input INPUT
                        input file to be (de)compressed
  -o OUTPUT, --output OUTPUT
                        output where to save result
  -c COLUMNS [COLUMNS ...], --columns COLUMNS [COLUMNS ...]
                        select specific columns by names or indices (1-based)
                        to compress or decompress
  -r ROWS, --rows ROWS  select specific rows to decompress (1-based)
  -nh, --no-header      skip header / data to be compressed has no header
  -p, --parallel        compress or decompress in parallel
  -j JOBS, --jobs JOBS  Number of CPUs to use if parallel (default all but 2)

As we can see, the tool takes an input file “-f”, and output “-o” where it should put the result whether it is compression or decompression based on the optional “-d” argument which selects decompression.

The tool as well takes a “-c” argument which is the Columns in case we want only to compress or decompress specific columns from the input file instead of dealing with all the columns unnecessarily. This argument accepts names and indices separated by coma.

The “-nh”, –no-header, argument to specify if the data has no header.

The “-r”, –rows, argument is to specify number of rows to decompress, in case we don’t need to decompress all rows.

The “-p”, –parallel, argument is to make compression and decompression goes in parallel to speed it up. And specifying the “-j”, –jobs, argument to determine the number of the CPUs to be used, in default it is all CPUs minus 2.

Compress

How to compress:

The tool can be used through command line. For those who are new to command line, the easiest way to start it is to put the lzhw.exe tool in the same folder with the sheet you want to compress. Then go to the folder’s directory at the top where you see the directory path and one click then type cmd, black command line will open to you where you can type the examples below.

Using german_credit data from UCI Machine Learning Repository [1]

lzhw -f "german_credit.xlsx" -o "gc_comp.txt"

Reading files, Can take 1 minute or something ...
Running CScript.exe to convert xls file to csv for better performance

Microsoft (R) Windows Script Host Version 5.812
Copyright (C) Microsoft Corporation. All rights reserved.

100%|██████████████████████████████████████████████████████████████████| 62/62 [00:02<00:00, 21.92it/s]
Finalizing Compression ...
Creating gc_comp.txt file ...
time taken:  0.06792410214742024  minutes
Compressed Successfully

In parallel:

lzhw -f "german_credit.xlsx" -o "gc_comp.txt" -p

Reading files, Can take 1 minute or something ...
Running CScript.exe to convert xls file to csv for better performance

Microsoft (R) Windows Script Host Version 5.812
Copyright (C) Microsoft Corporation. All rights reserved.

100%|███████████████████████████████████████████████████████████████████| 62/62 [00:00<00:00, 74.28it/s]
Finalizing Compression ...
Creating gc_comp.txt file ...
time taken:  0.030775876839955647  minutes
Compressed Successfully

Now, let’s say we are interested only in compressing the Age, Duration and Amount columns

lzhw -f "german_credit.xlsx" -o "gc_subset.txt" -c Age,Duration,Amount

Reading files, Can take 1 minute or something ...
Running CScript.exe to convert xls file to csv for better performance

Microsoft (R) Windows Script Host Version 5.812
Copyright (C) Microsoft Corporation. All rights reserved.

100%|███████████████████████████████████████████████████| 3/3 [00:00<00:00, 249.99it/s]
Finalizing Compression ...
Creating gc_subset.txt file ...
time taken:  0.01437713384628296  minutes
Compressed Successfully

Decompress

Now it’s time to decompress:

If your original excel file was big and of many rows and columns, it’s better and faster to decompress it into a csv file instead of excel directly and then save the file as excel if excel type is necessary. This is because python is not that fast in writing data to excel as well as the tool sometimes has “Corrupted Files” issues with excel.

Decompressing in parallel using 2 CPUs.

lzhw -d -f "gc_comp.txt" -o "gc_decompressed.csv" -p -j 2

100%|████████████████████████████████████████████████████| 62/62 [00:00<00:00, 99.00it/s]
Finalizing Decompression ...
Creating gc_decompressed.csv file ...
time taken:  0.014344350496927897  minutes
Decompressed Successfully

Look at how the -d argument is used.

Let’s now check that it was decompressed really successfully:

head -n 4 gc_decompressed.csv

Duration,Amount,InstallmentRatePercentage,ResidenceDuration,Age,NumberExistingCredits,NumberPeopleMaintenance,Telephone,ForeignWorker,Class,CheckingAccountStatus.lt.0,CheckingAccountStatus.0.to.200,CheckingAccountStatus.gt.200,CheckingAccountStatus.none,CreditHistory.NoCredit.AllPaid,CreditHistory.ThisBank.AllPaid,CreditHistory.PaidDuly,CreditHistory.Delay,CreditHistory.Critical,Purpose.NewCar,Purpose.UsedCar,Purpose.Furniture.Equipment,Purpose.Radio.Television,Purpose.DomesticAppliance,Purpose.Repairs,Purpose.Education,Purpose.Vacation,Purpose.Retraining,Purpose.Business,Purpose.Other,SavingsAccountBonds.lt.100,SavingsAccountBonds.100.to.500,SavingsAccountBonds.500.to.1000,SavingsAccountBonds.gt.1000,SavingsAccountBonds.Unknown,EmploymentDuration.lt.1,EmploymentDuration.1.to.4,EmploymentDuration.4.to.7,EmploymentDuration.gt.7,EmploymentDuration.Unemployed,Personal.Male.Divorced.Seperated,Personal.Female.NotSingle,Personal.Male.Single,Personal.Male.Married.Widowed,Personal.Female.Single,OtherDebtorsGuarantors.None,OtherDebtorsGuarantors.CoApplicant,OtherDebtorsGuarantors.Guarantor,Property.RealEstate,Property.Insurance,Property.CarOther,Property.Unknown,OtherInstallmentPlans.Bank,OtherInstallmentPlans.Stores,OtherInstallmentPlans.None,Housing.Rent,Housing.Own,Housing.ForFree,Job.UnemployedUnskilled,Job.UnskilledResident,Job.SkilledEmployee,Job.Management.SelfEmp.HighlyQualified
6,1169,4,4,67,2,1,0,1,Good,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0
48,5951,2,2,22,1,1,1,1,Bad,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0
12,2096,2,3,49,1,2,1,1,Good,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0

It looks awful in the command line 😀 but it’s decompressed.

Now let’s say that we only interested in decompressing the first two columns that we don’t remember how they were spelled.

lzhw -d -f "gc_comp.txt" -o "gc_subset_de.csv" -c 1,2 

100%|███████████████████████████████████████████████████| 2/2 [00:00<00:00,  8.05it/s]
Finalizing Decompression ...
Creating gc_subset_de.csv file ...
time taken:  0.00028291543324788414  minutes
Decompressed Successfully

Now let’s have a look at the decompressed file:

head gc_subset_de.csv

Duration,Amount
6,1169
48,5951
12,2096
42,7882
24,4870
36,9055
24,2835
36,6948
12,3059

We can also use the -r argument to decompress specific rows from the data frame.

lzhw -d -f "gc_comp.txt" -o "gc_subset_de.csv" -r 4

100%|████████████████████████████████████████████████████| 62/62 [00:00<00:00, 369.69it/s]
Finalizing Decompression ...
Creating gc_subset_de.csv file ...
time taken:  0.007962568600972494  minutes
Decompressed Successfully

Here we only decompressed the first 4 rows, 1-based, including the header.

Let’s look how the data looks like:

cat "gc_subset_de.csv"

Duration,Amount,InstallmentRatePercentage,ResidenceDuration,Age,NumberExistingCredits,NumberPeopleMaintenance,Telephone,ForeignWorker,Class,CheckingAccountStatus.lt.0,CheckingAccountStatus.0.to.200,CheckingAccountStatus.gt.200,CheckingAccountStatus.none,CreditHistory.NoCredit.AllPaid,CreditHistory.ThisBank.AllPaid,CreditHistory.PaidDuly,CreditHistory.Delay,CreditHistory.Critical,Purpose.NewCar,Purpose.UsedCar,Purpose.Furniture.Equipment,Purpose.Radio.Television,Purpose.DomesticAppliance,Purpose.Repairs,Purpose.Education,Purpose.Vacation,Purpose.Retraining,Purpose.Business,Purpose.Other,SavingsAccountBonds.lt.100,SavingsAccountBonds.100.to.500,SavingsAccountBonds.500.to.1000,SavingsAccountBonds.gt.1000,SavingsAccountBonds.Unknown,EmploymentDuration.lt.1,EmploymentDuration.1.to.4,EmploymentDuration.4.to.7,EmploymentDuration.gt.7,EmploymentDuration.Unemployed,Personal.Male.Divorced.Seperated,Personal.Female.NotSingle,Personal.Male.Single,Personal.Male.Married.Widowed,Personal.Female.Single,OtherDebtorsGuarantors.None,OtherDebtorsGuarantors.CoApplicant,OtherDebtorsGuarantors.Guarantor,Property.RealEstate,Property.Insurance,Property.CarOther,Property.Unknown,OtherInstallmentPlans.Bank,OtherInstallmentPlans.Stores,OtherInstallmentPlans.None,Housing.Rent,Housing.Own,Housing.ForFree,Job.UnemployedUnskilled,Job.UnskilledResident,Job.SkilledEmployee,Job.Management.SelfEmp.HighlyQualified
6,1169,4,4,67,2,1,0,1,Good,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0
48,5951,2,2,22,1,1,1,1,Bad,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0
12,2096,2,3,49,1,2,1,1,Good,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,1,0,0,1,0,0
42,7882,2,4,45,1,2,1,1,Good,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0,0,1,0,0,1,0,0,1,0

All data is now 5 rows only including the header.

Notes on the Tool

1- compression is much faster than decompression, it is good to compress sequentially and decompress in parallel.

2- This error message can appear while compressing or decompressing in parallel

lzhw.exe [-h] [-d] -f INPUT -o OUTPUT [-c COLUMNS [COLUMNS ...]]
                [-r ROWS] [-nh]
lzhw.exe: error: the following arguments are required: -f/--input, -o/--output

It is totally fine, just press Enter and proceed or leave it until it tells you “Compressed Successsfully” or “Decompressed Successfully”.

The error is due to some parallelization library bug that has nothing to do with the tool so it is ok.

3- The progress bar of columns compression, it doesn’t mean that the tool has finished because it needs still to write the answers. So you need to wait until “Compressed Successfully” or “Decompressed Successfully” message appears.

4- The tool takes a couple of seconds from 8 to 15 seconds to start working and compressing at the first time and then it runs faster and faster the more you use it.

DEFLATE Note

The techniques may seem similar to the DEFLATE algorithm which uses both LZSS, which is a variant of LZ77, and huffman coding, but I am not sure how the huffman coding further compresses the triplets. I believe it compresses the triplets altogether not as 3 separate lists as lzhw. And also it doesn’t use the lempel-ziv-welch for further compression.

LZHW also uses a modified version of LZ77, in which it uses a dictionary, key-value data structure, to store the already-seen patterns with their locations during the compression process, so that the algorithm instead of blindly going back looking for matching, it knows where exactly to go. This speeds up the compression process.

For example, let’s say the algorithm now has found “A”, it needs to see in previous sequences where is the longest match. It will do so using the dictionary {“A”: [1, 4, 5, 8]}. So it will go and start looking starting from these locations instead of blindly looking for “A”‘s indices.

DEFLATE Algorithm may be more complicated than lzhw, discussed here, but the latter is designed specifically for tabular data types to help in data science and data analysis projects.

Reference

[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Share

Leave a Reply

Your email address will not be published. Required fields are marked *