k-means document clustering using Apache Mahout command line

Tutorial overview

You have spent a lot of time searching and reading tutorials talking about k-means document clustering using Apache Mahout command line without coming to master it? Apache Mahout… what’s behind this strange name? This article is for you! It will guide you step by step to handle this tool perfectly.

First of all we need to introduce the software environment we are using during this article. Actually we have already installed Cloudera’s open-source Apache Hadoop distribution on a single node cluster running under CentOS 6.3 operating system. Secondly, it’s obvious that for this tutorial we need to have mahout (at least 0.7 or later versions) installed among Hadoop ecosystem. To download the Cloudera’s Hadoop distribution please follow this link. http://www.cloudera.com/downloads/quickstart_vms/5-7.html

At first sight, we need to check if mahout is already installed, and if so, which version we have then, with the following command.

rpm -qa | grep mahout

If this is not the case. There’s no need to worry, we can simply install mahout with this command (root privileges are required):

yum install mahout

Once we have our environment ready we can finally get to the heart of the matter. Actually, In order to achieve our goal, Kmeans text clustering using Apache Mahout command line process comprises the following steps.

1. Getting the data:

The first step is to get our dataset that will eventually represent our raw material on which we will test our clustering algorithm. There are several ways to do it, the simplest one is to Copy and paste a collections of articles from the internet and save them as text files. Then we must save all the text files under the same folder. Furthermore, we can use an online public ready to use text collections. For instance, in this tutorial we will work with datasets available for download through this link.

https://de.dariah.eu/tatom/datasets.html#datasets

We saved them under a folder named «tragedy»

For this tutorial we will use texts by Jane Austen and Charlotte Brontë

2. Coping text files to The Hadoop Distributed File System (HDFS) 

After downloading our text collections locally, and in order to be able to handle it with mahout, it’s time to copy it to our HDFS.

hadoop fs -copyFromLocal $HOME/tragedy

NB:  $HOME is the path to your home directory

We can verify that the operation has succeeded using this command:

hadoop fs –ls

It allows us to list all the files existing in the HDFS

1

3. Convert our dataset into a SequenceFiles

The choice of the file format in distributed computation is a very important matter for the success of the process. SequenceFile is a hadoop flat file format where each document is represented as a key-value pair. Actually, Key is the document id and value is its content.

For instance:  Key: “Austen_Emma-part01» value:” Emma Woodhouse, handsome, clever, and rich, with a comfortable home and happy disposition …”

At this stage, we will use the apache mahout command seqdirectory in order to convert our input text files into sequenceFiles.

mahout seqdirectory -i tragedy -o tragedy-seqfiles -c UTF-8 -chunk 5

-i : specifying the input directory

-o : specifying the output directory

UTF-8 : specifying the encoding of our input files

-chunk : specifying the size of each block of data

Like the previous step, we can check that the tragedy-seqfiles directory has been created using this command:

hadoop fs –ls

22

4. Convert sequenceFiles to sparse vector file format

In this step, we will convert each sequenceFile into a feature vector. You wonder what a feature vector is?

In order to be able to run properly, most algorithms in text mining require a numerical representation of texts. That’s why, we should turn the collections of texts we had in the previous steps into numerical feature vectors. Therefore, every document is represented as a vector where each element of the vector is a word and its weight respectively. There are two different types for a word weight:

  • Local weight: which is the frequency of terms in documents.
  • Global weight: in this case the weight of each word is expressed using tf–idf, short for term frequency–inverse document frequency. Actually tf-idf is a numerical measure that reflects the importance of a word in a collection of documents. It uses the local weight of a word (already calculated with the local weight method) and adds the inverse documents frequency measure to express the importance of a word in a corpus. Therefore, rarer word get a higher weight.

At this step, we will use the apache mahout command seq2sparse.

mahout seq2sparse -nv -i tragedy-seqfiles -o tragedy-vectors 

-i : specifying the input directory

-o : specifying the output directory

-nv: very important option that keeps the files names for later use when displaying the result of text clustering

3

We can check that the tragedy-vectors directory has been created, and we already have 7 items in it using the following command:

hadoop fs -ls tragedy-vectors

4

5. Running k-means text clustering algorithm

Before passing to action by applying k-means clustering algorithm on our textual data, there is a simple step left. In order to have initial centroids values which will be later used with  the k-means algorithm, we should, in the first place, run canopy clustering on our data.

mahout canopy -i tragedy-vectors/tf-vectors -o tragedy-vectors/tragedy-canopy-centroids -dm org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000

-i : specifying the input directory

-o : specifying the output directory

-dm: specifying the distance measure used by the canopy algorithm, in text clustering the Cosine Distance measure is the best and the most accurate

-t1,-t2: distances thresholds used for clustering. 

5

At the end we can check that tragedy-canopy-centroids has been created successfully using this command.

hadoop fs -ls tragedy-vectors/tragedy-canopy-centroids

b

Once we have generated initial centroids values we can finally run k-means algorithm on our documents.

mahout kmeans -i tragedy-vectors/tfidf-vectors -c tragedy-canopy-centroids -o tragedy-kmeans-clusters -dm org.apache.mahout.common.distance.CosineDistanceMeasure –clustering -cl -cd 0.1 -ow -x 20 -k 10

-i : specifying the input directory

-o : specifying the output directory

-c : specifying the centroids directory

-dm: specifying the distance measure used by the k-means algorithm.

-x : specifying the maximum number of iterations for k-means algorithm

-k : specifying the number of clusters

-ow: if the output directory already exists overwrite it

-cd: optional convergence delta parameter

-cl: run input vector clustering after computing Canopies

6

Since we are running only a single node hadoop cluster, the execution time depends on the size of textual data we aim to cluster, in addition to the computer we are using to run this tutorial. In our case, with an i7 8go ram computer, we clustered 9.5 GB of texts in 53 minutes.

6. Interpreting the clustering final result

We’re almost there. At this stage we will print the output of our clustering and try to interpret it. We should use the following apache mahout command.

mahout clusterdump -dt sequencefile -d tragedy-vectors/dictionary.file-* -i tragedy-kmeans-clusters/clusters-3-final -o clusters.txt -b 100 -p tragedy-kmeans-clusters/clusteredPoints -n 20

-p: specifying the path of the clusteredPoints file. This file contains the mapping between every input vectors with their cluster. This option must be specified in order to be able to print the                                       points associated with every cluster

-dt: specifying dictionary type. It can take two values, either the sequencefile value or the text value.

-d: specifying the dictionary path

-n: specifying the number of top terms to print for each cluster

-i: specifying the path of the directory containing sequence Files

-o: specifying the output file

-b: specifying the length of each word to print

a

Once we have already executed the above command, our output file should be created in the output path specified previously. It is a text file which can be opened with any text processing software.

Finally we should interpret the output file. Actually, VL-21535 is the cluster name, n=1388 means that we have 1388 elements in the cluster, C=[0:0.035…] refers to the coordinates of the cluster center, and top terms refers to the list of top terms in the current cluster.

Conclusion

Last but not least, it is essential to insist on the huge improvement gained in terms of performance using apache mahout for textual data clustering compared to the traditional approach. Actually, the traditional methods which do not Involve Big Data technologies have many limits (i.e. As can be seen in this tutorial we managed to cluster a 9.5 GB of text in only 53 minutes. In comparison to using the same machine that has been used during this tutorial, and working with traditional text mining technologies we did not succeed to cluster more than 300 KB of data)

Authors:

GUEDRIA Soulaimane, Grenoble Alpes University

Linkedin:  https://tn.linkedin.com/in/soulaimen

Twitter: https://twitter.com/SoulGuedria

GUERRAZ Agnés, Grenoble Alpes University

References:

https://ashokharnal.wordpress.com/2014/02/09/text-clustering-using-mahout-command-line-step-by-step/

https://datasciencehacks.wordpress.com/2013/09/08/performing-document-clustering-using-apache-mahout-k-means/