Skip to main content

An impatient start with Mahout - Machine learning

One of my friend ask me to develop a course for students in subject "Machine Learning". Course should be very simple to familiar students with machine learning. Main purpose of this course is to explorer the machine learning world to the students and playing over this topics with own their hands. Machine learning is very much matured and a lot of tools and frameworks is available to wet your hands in this topics, however most of the articles or tutorials you could found in the internet will start installing cluster or have to write a bunch of code (even, in site mahout, they are using maven) to start learning. Even more not all students are familiar with hadoop or do not have very powerful notebook to install and run all the components to get test of machine learning. For these reasons i have got the following approach:
  1. Standalone Hadoop
  2. Standalone Mahout 
  3. And a few CSV data files to learn how to works with Predictions

Assume you already have java installed in your work station. If not please refer to Oracle site to download and install java yourself. First we will install standalone hadoop, check the installation and  after that we will install Mahout and try to run some example to understand whats going under the hood. 
  • Install Hadoop and run simple map reduce for test
Hadoop version: 2.6.0 
Download hadoop-2.6.0.tar.gz from apache hadoop download site. In the moment of written the blog version 2.6.0 is stable to use. 
Unarchive the gz file some where in you local disk. Add these following path to your .bash_profile path as follows:
export HADOOP_HOME=/PATH_TO_HADDOP_HOME_DIRECTORY/hadoop-2.6.0
export PATH=$HADOOP_HOME/bin:$PATH
Now lets check the installation of the hadoop. Create one directory in current folder inputwords and copy all the xml files from the hadoop etc installation folder as follows:
$ mkdir inputwords
$ cp $HADOOP_HOME/etc/hadoop/*.xml inputwords/
Now we can run the hadoop standalone map reduce to count all the words found in the xmls
$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount inputwords outputwords
You should see a bunch of logs in your console and if everything went fine you should found a few files in folder outputwords (this folder will create in the runtime). Run the following command in you console
$cat ./outputwords/part-r-00000
it should show a lot of words follows by numbers as follows:
use 9
used 22
used. 1
user 40
user. 2
user? 1
users 21
users,wheel". 18
uses 2
using 3
value 19
values 1
version 1
version="1.0" 5
via 1
when 4
where 1
which 5
while 1
who 2
will 7
If you are curious to count more word, download the famous novel of William Shakespeare "The Tragedy of Romeo" from here and run with hadoop wordcount.
  • Download and install apache mahout 
Lets download the apache mahout distributive from here and unarchive the file some where in your local machine. Mahout distributive contains all the libraries and example for running machine learning  on top of hadoop. 
Now we need some data for learning purpose. We can use grouplens data for our purpose, certainly you can generate data for yourself, but i highly recommended data from grouplens. Grouplens organisation collecting social data for research and you can use these data for your purpose. There are a few datasets available in grouplens site such as MovieLens, BookCrosing and e.t.c. For my course we are going to use movielens datasets, because it's formatted and grouped well. Lets download the movielens datasets and unarchive the file somewhere in your filesystems.
First i would like to examine the datasets to get a closer look on the data, which will give us a very good understanding to use the data well. 
After unarchive the ml-data.tar.gz, you should find a list of datasets in your folder.
u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a tab separated list of 
          user id | item id | rating | timestamp. 
              The time stamps are unix seconds since 1/1/1970 UTC   

u.info     -- The number of users, items, and ratings in the u data set.

u.item     -- Information about the items (movies); this is a tab separated
              list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.
              The movie ids are the ones used in the u.data data set.

u.genre    -- A list of the genres.

u.user     -- Demographic information about the users; this is a tab
              separated list of
              user id | age | gender | occupation | zip code
              The user ids are the ones used in the u.data data set.
For getting recommendation we will use u.data. Now it's times to study a few theory about Recommendation and what is for?

  • Recommendation:
Mahout contains a recommender engine—several types of them, in fact, beginning with conventional user-based and item-based recommenders. It includes implementations of several other algorithms as well, but for now we’ll explore a simple user-based recommender. For detail information of the recommender engine please see here.

  • Examine the DataSet:
Lets take a closer look to the file u.data.
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
62 257 2 879372434
286 1014 5 879781125
200 222 5 876042340
210 40 3 891035994
224 29 3 888104457
303 785 3 879485318
122 387 5 879270459
194 274 2 879539794
291 1042 4 874834944
234 1184 2 892079237
119 392 4 886176814
167 486 4 892738452
299 144 4 877881320
291 118 2 874833878
308 1 4 887736532
95 546 2 879196566
38 95 5 892430094
102 768 2 883748450
63 277 4 875747401
For example user with id 196 recommended film 242 with 3 preferences. I have import the u.data file in excel and sort by userid as follows:

User 1 rate 3 films [1,2] and user 3 also rate film 2. If we want to find recommendation for user3 by user 1, then it should be the film 1 (Toy Story).
Lets run the recommendation engine of Mahout and examine the result.
$hadoop jar $MAHOUT_HOME/mahout-examples-0.10.0-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE --input PATH_TO/ml-100k/u.data --output outputm
You should found the result in the file outputm/part-r-00000. If you check attentively, you should found that recommendation for the user 3 as follows:
3 [137:5.0,248:4.8714285,14:4.8153844,285:4.8153844,845:4.754717,124:4.7089553,319:4.7035174,508:4.7006173,150:4.68,311:4.6615386]
which is differ from that we guess earlier, because recommendation engine also use preferences (rating) from other user.
Lets write down a few bunch of code in java to make sure which films was recommended by Mahout.
package com.blu.mahout;

import org.apache.avro.generic.GenericData;

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.Reader;
import java.nio.channels.Channels;
import java.nio.channels.FileChannel;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Optional;
import java.util.stream.Collectors;
import java.util.stream.Stream;

import static java.util.stream.Collectors.groupingBy;
import static java.util.stream.Collectors.reducing;
import static java.util.stream.Collectors.toList;

/**
 * Created by shamim on 17/04/15.
 * Source file should be, u.data, u.item and mahout generated generated file
 */
public class PrintRecomendation {
    private static final int INPUT_LENGTH = 6;

    private static List userRecommendedFilms = new ArrayList<>();
    private static List userWithFilmItems = new ArrayList<>();

    public static void main(String[] inputs) throws IOException{
        System.out.println("Print the recommendation and recommended films for given user:");

        if(inputs.length < INPUT_LENGTH){
            System.out.println("USAGES: PrintRecomendation USERID UDATA_FILE_NAME UITEM_FILE_NAME MAHOUT_REC_FILE UDATA_FOLDER UMDATA_FOLDER" +
                    "" + " Example: java -jar mahout-ml-1.0-SNAPSHOT.one-jar.jar 3 u.data u.item part-r-00000 /Users/shamim/Development/workshop/bigdata/hadoop/inputm/ml-100k/ /Users/shamim/Development/workshop/bigdata/hadoop/outputm/"
                    );
            System.exit(0);
        }
        String USER_ID = inputs[0];
        String UDATA_FILE_NAME = inputs[1];
        String UITEM_FILE_NAME = inputs[2];
        String MAHOUT_REC_FILE = inputs[3];
        String UDATA_FOLDER = inputs[4];
        String UMDATA_FOLDER = inputs[5];

        // Read UDATA File
        Path pathToUfile = Paths.get(UDATA_FOLDER,UDATA_FILE_NAME);
        List filteredLines = Files.lines(pathToUfile).filter(s -> s.contains(USER_ID)).collect(toList());
        for(String line : filteredLines){
            String[] words =  line.split("\\t");
            if(words != null){
                String userId = words[0];
                if(userId.equalsIgnoreCase(USER_ID)){
                    userRecommendedFilms.add(line);
                }
            }
        }
        List userWithFilmName = new ArrayList();

        CharsetDecoder dec= StandardCharsets.UTF_8.newDecoder()
                .onMalformedInput(CodingErrorAction.IGNORE);
        Path pathToUItem=Paths.get(UDATA_FOLDER, UITEM_FILE_NAME);

        List nonFiltered;

        try(Reader r= Channels.newReader(FileChannel.open(pathToUItem), dec, -1);
            BufferedReader br=new BufferedReader(r)) {
            nonFiltered=br.lines().collect(Collectors.toList());
        }

        Path pathToMahoutFile = Paths.get(UMDATA_FOLDER, MAHOUT_REC_FILE);

        List filteredMLines = Files.lines(pathToMahoutFile).filter(s -> s.contains(USER_ID)).collect(toList());
        String recommendedFilms= "";
        for(String line : filteredMLines){
            String[] splited = line.split("\\t");
            if(splited[0].equalsIgnoreCase(USER_ID)){
                recommendedFilms = splited[1];
                break;
            }
        }

        String[] filmsId = recommendedFilms.split(",");

        for(String filmId : filmsId){
            String[] idWithRating = filmId.split(":");
            String id = idWithRating[0];
            String rating = idWithRating[1];
            for(String filmLine : nonFiltered ){
                String[] items = filmLine.split("\\|");
                if(id.equalsIgnoreCase(items[0])){
                    System.out.println("Film name:" + items[1]);
                }

            }
        }

    }
}
You should found the following output in your console:
Film name:Grosse Pointe Blank (1997)
Film name:Postino, Il (1994)
Film name:Secrets & Lies (1996)
Film name:That Thing You Do! (1996)
Film name:Lone Star (1996)
Film name:Everyone Says I Love You (1996)
Film name:People vs. Larry Flynt, The (1996)
Film name:Swingers (1996)
Film name:Wings of the Dove, The (1997)

Comments

Popular posts from this blog

Tip: SQL client for Apache Ignite cache

A new SQL client configuration described in  The Apache Ignite book . If it got you interested, check out the rest of the book for more helpful information. Apache Ignite provides SQL queries execution on the caches, SQL syntax is an ANSI-99 compliant. Therefore, you can execute SQL queries against any caches from any SQL client which supports JDBC thin client. This section is for those, who feels comfortable with SQL rather than execute a bunch of code to retrieve data from the cache. Apache Ignite out of the box shipped with JDBC driver that allows you to connect to Ignite caches and retrieve distributed data from the cache using standard SQL queries. Rest of the section of this chapter will describe how to connect SQL IDE (Integrated Development Environment) to Ignite cache and executes some SQL queries to play with the data. SQL IDE or SQL editor can simplify the development process and allow you to get productive much quicker. Most database vendors have their own fron...

8 things every developer should know about the Apache Ignite caching

Any technology, no matter how advanced it is, will not be able to solve your problems if you implement it improperly. Caching, precisely when it comes to the use of a distributed caching, can only accelerate your application with the proper use and configurations of it. From this point of view, Apache Ignite is no different, and there are a few steps to consider before using it in the production environment. In this article, we describe various technics that can help you to plan and adequately use of Apache Ignite as cutting-edge caching technology. Do proper capacity planning before using Ignite cluster. Do paperwork for understanding the size of the cache, number of CPUs or how many JVMs will be required. Let’s assume that you are using Hibernate as an ORM in 10 application servers and wish to use Ignite as an L2 cache. Calculate the total memory usages and the number of Ignite nodes you have to need for maintaining your SLA. An incorrect number of the Ignite nodes can become a b...

Load balancing and fail over with scheduler

Every programmer at least develop one Scheduler or Job in their life time of programming. Nowadays writing or developing scheduler to get you job done is very simple, but when you are thinking about high availability or load balancing your scheduler or job it getting some tricky. Even more when you have a few instance of your scheduler but only one can be run at a time also need some tricks to done. A long time ago i used some data base table lock to achieved such a functionality as leader election. Around 2010 when Zookeeper comes into play, i always preferred to use Zookeeper to bring high availability and scalability. For using Zookeeper you have to need Zookeeper cluster with minimum 3 nodes and maintain the cluster. Our new customer denied to use such a open source product in their environment and i was definitely need to find something alternative. Definitely Quartz was the next choose. Quartz makes developing scheduler easy and simple. Quartz clustering feature brings the HA and...