One of my friend ask me to develop a course for students in subject "Machine Learning". Course should be very simple to familiar students with machine learning. Main purpose of this course is to explorer the machine learning world to the students and playing over this topics with own their hands. Machine learning is very much matured and a lot of tools and frameworks is available to wet your hands in this topics, however most of the articles or tutorials you could found in the internet will start installing cluster or have to write a bunch of code (even, in site mahout, they are using maven) to start learning. Even more not all students are familiar with hadoop or do not have very powerful notebook to install and run all the components to get test of machine learning. For these reasons i have got the following approach:
- Standalone Hadoop
- Standalone Mahout
- And a few CSV data files to learn how to works with Predictions
Assume you already have java installed in your work station. If not please refer to Oracle site to download and install java yourself. First we will install standalone hadoop, check the installation and after that we will install Mahout and try to run some example to understand whats going under the hood.
- Install Hadoop and run simple map reduce for test
Hadoop version: 2.6.0
Download hadoop-2.6.0.tar.gz from apache hadoop download site. In the moment of written the blog version 2.6.0 is stable to use.
Unarchive the gz file some where in you local disk. Add these following path to your .bash_profile path as follows:
export HADOOP_HOME=/PATH_TO_HADDOP_HOME_DIRECTORY/hadoop-2.6.0 export PATH=$HADOOP_HOME/bin:$PATH
Now lets check the installation of the hadoop. Create one directory in current folder inputwords and copy all the xml files from the hadoop etc installation folder as follows:
$ mkdir inputwords $ cp $HADOOP_HOME/etc/hadoop/*.xml inputwords/
Now we can run the hadoop standalone map reduce to count all the words found in the xmls
$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar wordcount inputwords outputwords
You should see a bunch of logs in your console and if everything went fine you should found a few files in folder outputwords (this folder will create in the runtime). Run the following command in you console
$cat ./outputwords/part-r-00000
it should show a lot of words follows by numbers as follows:
use 9 used 22 used. 1 user 40 user. 2 user? 1 users 21 users,wheel". 18 uses 2 using 3 value 19 values 1 version 1 version="1.0" 5 via 1 when 4 where 1 which 5 while 1 who 2 will 7
If you are curious to count more word, download the famous novel of William Shakespeare "The Tragedy of Romeo" from here and run with hadoop wordcount.
- Download and install apache mahout
Lets download the apache mahout distributive from here and unarchive the file some where in your local machine. Mahout distributive contains all the libraries and example for running machine learning on top of hadoop.
Now we need some data for learning purpose. We can use grouplens data for our purpose, certainly you can generate data for yourself, but i highly recommended data from grouplens. Grouplens organisation collecting social data for research and you can use these data for your purpose. There are a few datasets available in grouplens site such as MovieLens, BookCrosing and e.t.c. For my course we are going to use movielens datasets, because it's formatted and grouped well. Lets download the movielens datasets and unarchive the file somewhere in your filesystems.
First i would like to examine the datasets to get a closer look on the data, which will give us a very good understanding to use the data well.
After unarchive the ml-data.tar.gz, you should find a list of datasets in your folder.
u.data -- The full u data set, 100000 ratings by 943 users on 1682 items. Each user has rated at least 20 movies. Users and items are numbered consecutively from 1. The data is randomly ordered. This is a tab separated list of user id | item id | rating | timestamp. The time stamps are unix seconds since 1/1/1970 UTC u.info -- The number of users, items, and ratings in the u data set. u.item -- Information about the items (movies); this is a tab separated list of movie id | movie title | release date | video release date | IMDb URL | unknown | Action | Adventure | Animation | Children's | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not; movies can be in several genres at once. The movie ids are the ones used in the u.data data set. u.genre -- A list of the genres. u.user -- Demographic information about the users; this is a tab separated list of user id | age | gender | occupation | zip code The user ids are the ones used in the u.data data set.
For getting recommendation we will use u.data. Now it's times to study a few theory about Recommendation and what is for?
- Recommendation:
Mahout contains a recommender engine—several types of them, in fact, beginning with conventional user-based and item-based recommenders. It includes implementations of several other algorithms as well, but for now we’ll explore a simple user-based recommender. For detail information of the recommender engine please see here.
User 1 rate 3 films [1,2] and user 3 also rate film 2. If we want to find recommendation for user3 by user 1, then it should be the film 1 (Toy Story).
Lets run the recommendation engine of Mahout and examine the result.
Lets write down a few bunch of code in java to make sure which films was recommended by Mahout.- Examine the DataSet:
Lets take a closer look to the file u.data.
For example user with id 196 recommended film 242 with 3 preferences. I have import the u.data file in excel and sort by userid as follows:196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923 166 346 1 886397596 298 474 4 884182806 115 265 2 881171488 253 465 5 891628467 305 451 3 886324817 6 86 3 883603013 62 257 2 879372434 286 1014 5 879781125 200 222 5 876042340 210 40 3 891035994 224 29 3 888104457 303 785 3 879485318 122 387 5 879270459 194 274 2 879539794 291 1042 4 874834944 234 1184 2 892079237 119 392 4 886176814 167 486 4 892738452 299 144 4 877881320 291 118 2 874833878 308 1 4 887736532 95 546 2 879196566 38 95 5 892430094 102 768 2 883748450 63 277 4 875747401
User 1 rate 3 films [1,2] and user 3 also rate film 2. If we want to find recommendation for user3 by user 1, then it should be the film 1 (Toy Story).
Lets run the recommendation engine of Mahout and examine the result.
$hadoop jar $MAHOUT_HOME/mahout-examples-0.10.0-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE --input PATH_TO/ml-100k/u.data --output outputmYou should found the result in the file outputm/part-r-00000. If you check attentively, you should found that recommendation for the user 3 as follows:
3 [137:5.0,248:4.8714285,14:4.8153844,285:4.8153844,845:4.754717,124:4.7089553,319:4.7035174,508:4.7006173,150:4.68,311:4.6615386]which is differ from that we guess earlier, because recommendation engine also use preferences (rating) from other user.
package com.blu.mahout; import org.apache.avro.generic.GenericData; import java.io.BufferedReader; import java.io.FileInputStream; import java.io.IOException; import java.io.Reader; import java.nio.channels.Channels; import java.nio.channels.FileChannel; import java.nio.charset.Charset; import java.nio.charset.CharsetDecoder; import java.nio.charset.CodingErrorAction; import java.nio.charset.StandardCharsets; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.util.ArrayList; import java.util.Arrays; import java.util.List; import java.util.Optional; import java.util.stream.Collectors; import java.util.stream.Stream; import static java.util.stream.Collectors.groupingBy; import static java.util.stream.Collectors.reducing; import static java.util.stream.Collectors.toList; /** * Created by shamim on 17/04/15. * Source file should be, u.data, u.item and mahout generated generated file */ public class PrintRecomendation { private static final int INPUT_LENGTH = 6; private static ListuserRecommendedFilms = new ArrayList<>(); private static List userWithFilmItems = new ArrayList<>(); public static void main(String[] inputs) throws IOException{ System.out.println("Print the recommendation and recommended films for given user:"); if(inputs.length < INPUT_LENGTH){ System.out.println("USAGES: PrintRecomendation USERID UDATA_FILE_NAME UITEM_FILE_NAME MAHOUT_REC_FILE UDATA_FOLDER UMDATA_FOLDER" + "" + " Example: java -jar mahout-ml-1.0-SNAPSHOT.one-jar.jar 3 u.data u.item part-r-00000 /Users/shamim/Development/workshop/bigdata/hadoop/inputm/ml-100k/ /Users/shamim/Development/workshop/bigdata/hadoop/outputm/" ); System.exit(0); } String USER_ID = inputs[0]; String UDATA_FILE_NAME = inputs[1]; String UITEM_FILE_NAME = inputs[2]; String MAHOUT_REC_FILE = inputs[3]; String UDATA_FOLDER = inputs[4]; String UMDATA_FOLDER = inputs[5]; // Read UDATA File Path pathToUfile = Paths.get(UDATA_FOLDER,UDATA_FILE_NAME); List filteredLines = Files.lines(pathToUfile).filter(s -> s.contains(USER_ID)).collect(toList()); for(String line : filteredLines){ String[] words = line.split("\\t"); if(words != null){ String userId = words[0]; if(userId.equalsIgnoreCase(USER_ID)){ userRecommendedFilms.add(line); } } } List userWithFilmName = new ArrayList (); CharsetDecoder dec= StandardCharsets.UTF_8.newDecoder() .onMalformedInput(CodingErrorAction.IGNORE); Path pathToUItem=Paths.get(UDATA_FOLDER, UITEM_FILE_NAME); List nonFiltered; try(Reader r= Channels.newReader(FileChannel.open(pathToUItem), dec, -1); BufferedReader br=new BufferedReader(r)) { nonFiltered=br.lines().collect(Collectors.toList()); } Path pathToMahoutFile = Paths.get(UMDATA_FOLDER, MAHOUT_REC_FILE); List filteredMLines = Files.lines(pathToMahoutFile).filter(s -> s.contains(USER_ID)).collect(toList()); String recommendedFilms= ""; for(String line : filteredMLines){ String[] splited = line.split("\\t"); if(splited[0].equalsIgnoreCase(USER_ID)){ recommendedFilms = splited[1]; break; } } String[] filmsId = recommendedFilms.split(","); for(String filmId : filmsId){ String[] idWithRating = filmId.split(":"); String id = idWithRating[0]; String rating = idWithRating[1]; for(String filmLine : nonFiltered ){ String[] items = filmLine.split("\\|"); if(id.equalsIgnoreCase(items[0])){ System.out.println("Film name:" + items[1]); } } } } }
You should found the following output in your console:
Film name:Grosse Pointe Blank (1997) Film name:Postino, Il (1994) Film name:Secrets & Lies (1996) Film name:That Thing You Do! (1996) Film name:Lone Star (1996) Film name:Everyone Says I Love You (1996) Film name:People vs. Larry Flynt, The (1996) Film name:Swingers (1996) Film name:Wings of the Dove, The (1997)
Comments