News

This week The apache Ignite book becomes one of the top books of leanpub

This week The apache Ignite book becomes one of the top books of leanpub.

Sunday

A single node Hadoop + Cassandra + Pig setup

UP1: Our book High Performace in-memory computing with Apache Ignite has been released. The book briefly described how to improved performance in existing legacy Hadoop cluster with Apache Ignite.

In our current project, we have decided to store all operational logs into NoSQL DB. It's total volume about 97 TB per year. Cassandra was our main candidate to use as NoSQL DB. But we also have to analysis and monitor our data, where comes Hadoop and Pig to help. Within 2 days our team able to developed simple pilot projects to demonstrate all the power of Hadoop + Cassandra and Pig.
For the pilot project we used DataStax Enterprise edition. Seems this out of box product help us to quick install Hadoop, Cassandra stack and developed our pilot project. Here we made a decision to setup Hadoop, Cassandra, and Pig by our self. It's my first attempt to install Cassandra over Hadoop and Pig. Seems all these above products already running already a few years, but I haven't found any step by step tutorial to setup a single node cluster with Hadoop + Cassandra + pig.
First of all, we are going to install Hadoop and Cassandra, therefore, will try to run pig_cassandra Map only job over Cassandra column family which will save the result on Hadoop HDFS file system.
Setup Hadoop:
1) Download hadoop from the following link - http://www.sai.msu.su/apache/hadoop/core/stable/ then un archive the file
tar -xvf hadoop-0.20.2.tar.gz
rm hadoop-0.20.2.tar.gz
cd hadoop-0.20.2
2) Edit /conf/core-site.xml. I have used localhost in the value of fs.default.name

     
       fs.default.name
       hdfs://localhost:9000
     
     
3) Edit /conf/mapred-site.xml.

     
         mapred.job.tracker
         localhost:9001
     
   
4) Edit /conf/hdfs-site.xml. Since this test cluster has a single node, replication factor should be set to 1.

     
      dfs.replication
      1
     
   
5) Set your JAVA_HOME variable in /conf/hadoop-env.sh. If you already have the JAVA_HOME variable in your .bash_profile - it's redundant.
6) Format the name node (one per install).
$ bin/hadoop namenode -format
it should print out the following message
12/07/15 15:54:20 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = Shamim-2.local/192.168.0.103
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
12/07/15 15:54:21 INFO namenode.FSNamesystem: fsOwner=samim,staff,com.apple.sharepoint.group.1,everyone,_appstore,localaccounts,_appserverusr,admin,_appserveradm,_lpadmin,_lpoperator,_developer,com.apple.access_screensharing
12/07/15 15:54:21 INFO namenode.FSNamesystem: supergroup=supergroup
12/07/15 15:54:21 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/07/15 15:54:21 INFO common.Storage: Image file of size 95 saved in 0 seconds.
12/07/15 15:54:21 INFO common.Storage: Storage directory /tmp/hadoop-samim/dfs/name has been successfully formatted.
12/07/15 15:54:21 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at Shamim-2.local/192.168.0.103
************************************************************/
6.1) set up passphraseless ssh
Check that you can login into localhost without passphrase
ssh localhost
if you cannot than first enable your ssh server
system preferences-> sharing-> check the box for remote loging, also you can allow access for all user
then execute the following commands
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
7) Start all hadoop components
$ bin/hadoop-daemon.sh start namenode
$ bin/hadoop-daemon.sh start jobtracker
$ bin/hadoop-daemon.sh start datanode
$ bin/hadoop-daemon.sh start tasktracker
$ bin/hadoop-daemon.sh start secondarynamenode
starting namenode, logging to /Users/samim/Development/NoSQL/hadoop/core/hadoop-0.20.2/bin/../logs/hadoop-samim-namenode-Shamim-2.local.out
starting jobtracker, logging to /Users/samim/Development/NoSQL/hadoop/core/hadoop-0.20.2/bin/../logs/hadoop-samim-jobtracker-Shamim-2.local.out
starting datanode, logging to /Users/samim/Development/NoSQL/hadoop/core/hadoop-0.20.2/bin/../logs/hadoop-samim-datanode-Shamim-2.local.out
you can check all the log file to make sure that everything goes well.
8) Verify the NameNode and DataNodes communication through web interface. http://localhost:50070/dfshealth.jsp Check the page and confirm that you have one Live node
9) Verify that the JobTracker and TaskTrackers are communicating by looking at the JobTracker web interface and confirming one node listed in the Nodes column: http://localhost:50030/jobtracker.jsp
10) Use the hadoop command-line tool to test the file system:
$ hadoop dfs -ls /
$ hadoop dfs -mkdir /test_dir
$ echo "A few words to test" > /tmp/myfile
$ hadoop dfs -copyFromLocal /tmp/myfile /test_dir
$ hadoop dfs -cat /test_dir/myfile
A few words to test
Setup Cassandra: 1) Download the source code for cassandra verion 1.1.2 from the following link http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.1.2/apache-cassandra-1.1.2-src.tar.gz assume you know how to build the cassandra from the source code, otherwise you will find a lot of information though google to build cassandra from the source code.
2) Edit CASSANDRA_HOME/conf/cassandra.yaml file to set the listen_address and rpc_address to localhost.
3) Start cassandra $ cassandra/bin/ ./cassandra 4) Check the cluster through node tool utility
$ /bin ./nodetool -h localhost ring
Note: Ownership information does not include topology, please specify a keyspace. 
Address         DC          Rack        Status State   Load            Owns                Token                                       
127.0.0.1       datacenter1 rack1       Up     Normal  55.17 KB        100.00%         96217188464178957452903952331500076192  
Cassandra cluster starts up, now we are going to configure pig
Setup Pig: 1) Download pig from the apache site as follows http://www.sai.msu.su/apache/pig/ tar -xvf pig-0.8.0.tar.gz rm pig-0.8.0.tar.gz At this moment we will try to run the pig_cassandra example which you can find with the source distribution. First of all it's better to read the README.TXT file from the folder apache-cassandra-1.1.2-src/examples/pig/README.txt Set all the env variables describes in the readme.txt file as follows:
export PIG_HOME=%YOUR_PIG_INSTALLION_FOLDER%
export PIG_INITIAL_ADDRESS=localhost
export PIG_RPC_PORT=9160
export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner
Also if you would like to run using the Hadoop backend, you should also set PIG_CONF_DIR to the location of your Hadoop config. In my cases
export PIG_CONF_DIR=hadoop/core/hadoop-0.20.2/conf
In this stage you can run grunt shell to run map reduce task, run examples/pig$ bin/pig_cassandra -x local it should prompt grunt shell, but i have got the following clssnofound exception: java.lang.ClassNotFoundException: org.apache.hadoop.mapred.RunningJob For quick fix, i decide to edit the pig_cassandra file as follows:
export HADOOP_CLASSPATH="/Users/xyz/hadoop/core/hadoop-0.20.2/hadoop-0.20.2-core.jar"
CLASSPATH=$CLASSPATH:$PIG_JAR:$HADOOP_CLASSPATH
While i got the grunt shell, i create a keyspace and one column family in cassandra cluster and insert some value through cassandra-cli
[default@unknown] create keyspace Keyspace1;
  [default@unknown] use Keyspace1;
  [default@Keyspace1] create column family Users with comparator=UTF8Type and default_validation_class=UTF8Type and key_validation_class=UTF8Type;
  [default@KS1] set Users[jsmith][first] = 'John';
  [default@KS1] set Users[jsmith][last] = 'Smith';
  [default@KS1] set Users[jsmith][age] = long(42)
then i run following pig query in grunt shell
grunt> rows = LOAD 'cassandra://Keyspace1/Users' USING CassandraStorage() AS (key, columns: bag {T: tuple(name, value)});
grunt> cols = FOREACH rows GENERATE flatten(columns);
grunt> colnames = FOREACH cols GENERATE $0;
grunt> namegroups = GROUP colnames BY (chararray) $0;
grunt> namecounts = FOREACH namegroups GENERATE COUNT($1), group;
grunt> orderednames = ORDER namecounts BY $0;
grunt> topnames = LIMIT orderednames 50;
grunt> dump topnames;
Pig run the script and here is the statistics:
2012-07-15 17:29:35,878 [main] INFO  org.apache.pig.tools.pigstats.PigStats - Detected Local mode. Stats reported below may be incomplete
2012-07-15 17:29:35,881 [main] INFO  org.apache.pig.tools.pigstats.PigStats - Script Statistics: 

HadoopVersion PigVersion UserId StartedAt FinishedAt Features
0.20.2 0.8.3 samim 2012-07-15 17:29:14 2012-07-15 17:29:35 GROUP_BY,ORDER_BY,LIMIT

Success!

Job Stats (time in seconds):
JobId Alias Feature Outputs
job_local_0001 colnames,cols,namecounts,namegroups,rows GROUP_BY,COMBINER 
job_local_0002 orderednames SAMPLER 
job_local_0003 orderednames ORDER_BY,COMBINER file:/tmp/temp-833597378/tmp-220576755,

Input(s):
Successfully read records from: "cassandra://Keyspace1/Users"

Output(s):
Successfully stored records in: "file:/tmp/temp-833597378/tmp-220576755"

Job DAG:
job_local_0001 -> job_local_0002,
job_local_0002 -> job_local_0003,
job_local_0003


2012-07-15 17:29:35,881 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2012-07-15 17:29:35,886 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2012-07-15 17:29:35,887 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2012-07-15 17:29:35,888 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2012-07-15 17:29:35,904 [main] INFO  org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2012-07-15 17:29:35,907 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2012-07-15 17:29:35,907 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(1,age)
(1,last)
(1,first)
You should found the output file in the hadoop file system tmp. In my case its
file:/tmp/temp-833597378/tmp-220576755
If you would like to run the example-script.pig, you would have to create one KeySpace name MyKeySpace and column family according to the pig script. I just edit the example-script.pig and set the newly created keyspace1 and column family Users. Then you can run it like this:
examples/pig$ bin/pig_cassandra example-script.pig
If you want to run the pig in local mode, add the following predicates -x local. For example pig_cassandra -x local example-script.pig. Without the instruction -x local, pig will run on Hadoop mode. See here for more information. Thank'x Nabanita to point out this moment.

see the statistics in the console. My next step is to set up Cassandra cluster with 4 nodes over Hadoop and run Map reduce all over the cluster nodes.
Resources:
1) Cassandra high-performance cook book.
2) Cassandra definitive guide.
3) http://stackoverflow.com/questions/8846788/pig-integrated-with-cassandra-simple-distributed-query-takes-a-few-minutes-to-c

39 comments :

Diptiman...King Of Spike said...

Hi Shamim, While I tried to connect pig with cassandra I found following error.
Can you check this?
grunt> rows = LOAD 'cassandra://risqvu/sample1' USING CassandraStorage() AS (key, columns: bag {T: tuple(name, value)});
2013-02-25 23:30:10,780 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve CassandraStorage using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /usr/local/hadoop/pig-0.10.1/pig_1361863488129.log

Shamim Bhuiyan said...

Hello,
you have to add Cassandra libraries on the Hadoop classpath. You can add cassandra libs in hadoop-env.sh (in hadoop master)as follows:
export HADOOP_CLASSPATH=/hadoop/apache-cassandra-1.1.2/lib/*:$HADOOP_CLASSPATH

P.S. in hadoop master you should have cassandra libraries installed, no need to run cassandra on hadoop master

Diptiman...King Of Spike said...

Hi,
I did the same as you instructed, but still I am getting the same error.

I am giving you the file paths
for hadoop :/usr/local/hadoop
for pig: /usr/local/hadoop/pig-0.10.1
for cassandra :/usr/local/cassandra

Shamim Bhuiyan said...

Here is two key point:
1) Add Cassandra libraries in hadoop classpath - you can check after restarting hadoop cluster, as for example:
ps -ef | grep -i cassandra
in console you should see the cassandra libraries path on hadoop name node and job tracker.
2) You also have to confirm CASSANDRA_HOME in pig_cassandra file as follows:
cassandra_home="/hadoop/apache-cassandra-1.2.1"
Copy pig_casssandra file in PIG_HOME/bin directory and run from there

Diptiman...King Of Spike said...

Thanks Shamim, That problem was solved. But now the problem is, whether there is any difference in fetching table in pig from cassandra, if the table is created by using CQL or CLI?

Shamim Bhuiyan said...

Yes, there are one limitation. When you are creating table through CQL3, meta data is not visible for CLI. If you want to read across table created by cql, you will have add CUSTOM STORAGE keywords in create table clause as follows:
create table abc(
a varchar primary key,
b bigint
) with CUSTOM STORAGE;

Diptiman...King Of Spike said...

That's true, I knew that I can't see table in cli that was created in cql3 as I am unaware of "customer storage". But my question was regarding "integrating cassandra(cql) with pig".
For Example:
In your blog you integrated Users table by writing the syntax "rows = LOAD 'cassandra://Keyspace1/Users' USING CassandraStorage() AS (key, columns: bag {T: tuple(name, value)});"
Here the table was made in CLI.
Is this syntax same for cql also?

Diptiman...King Of Spike said...

Is there any book about " role of cql in pig" ?

Shamim Bhuiyan said...

rows = LOAD 'cassandra://Keyspace1/Users' USING CassandraStorage() AS (key, columns: bag {T: tuple(name, value)});
this syntex may not work in several version of Cassandra. Following syntex will work for you in any version
rows = LOAD 'cassandra://Keyspace1/Users' USING CassandraStorage();
Here you will get every row in one tuple - tuple name with values.
There is no book related cql in pig. If you want to use cql (i don't know how) you have to rewrite CasandraStorage class in cassandra distribution.

Diptiman...King Of Spike said...

Hi Shamim!
Till now we are able to load data from cassandra to pig. We can filter the data in pig and we can read it. But the problem is about storing into cassandra. However we can read those data as the output of the command run by pig. But when we want to store the same into cassandra it is showing the error that "Unable to produce data in cassandra://keyspace/table"

Shamim Bhuiyan said...

it's depend on your data model and pig script. Hard to say what are you trying.

nabanita said...

Hi Shamim, why you have started pig in local mode everywhere?

As i was trying Pig in distributed mode and loaded data from HDFS but when i tried to store into cassandra after some churning to bring it in format (key,bag:{Tuple(name,value)})...... i am facing issue lyk...'could not initiate mapreducelauncher'..

I have cassandra1.2.1 and pig 0.10.0 and hadoop client running on my system.also set the requiste Pig Environment variables as mentioned by you.....

Diptiman...King Of Spike said...

Nabanita....se kichi proper reply daunahanti...hope se tamara question bujhiparantu.

Shamim Bhuiyan said...

@nabanita,
It's single node hadoop cluster and it's logical to run pig in local node, anyway configuration doesn't have changed for running pig in distributed mode.
From your comment i can assume that your cluster works fine in local mode.
From your issue "could not initiate mapreducelauncher" - i can guess some wrong with your pig configuration. You can send full stack trace.
In Cassandra version 1.2.* default partitioner has been changed to MURMUR3, you have to changed the variable PIG_PARTITIONER to murmur3.
For trouble shooting with pig, dump the result in console or file and check the you data model in Cassandra.
Also for Cassandra Hadoop integration you can check the Cassandra wiki
http://wiki.apache.org/cassandra/HadoopSupport
Even you can ask question here
http://www.mail-archive.com/user@cassandra.apache.org/
Also you can make a try to Cloudera hadoop distribution, it's has built in support for pig and much more.

Sri said...

We are new to Cassandra and trying to understand the benefits of having hadoop on cassandra cluster? Is it for map reduce and pig? We do have a big implementation of hadoop and our Cassandra cluster is a separate entity. Any docs or link that can help us is appreciated.

Shamim Bhuiyan said...

Hello Sri,
>>We are new to Cassandra and trying to understand the benefits of having hadoop on cassandra cluster?
Cassandra will be your primary data storage and you will be able to analysis your data such as compute aggregate through Hadoop map reduce. Pig is just a high level data flow language to write Map reduce job. It's better to have cassandra node and hadoop task tracker node in single machine, then hadoop task tracker can get the local data from the Cassandra node and you will get better performance. Anyway, you can configure Hadoop cluster and Cassandra cluster in separate entity.
See http://wiki.apache.org/cassandra/HadoopSupport

sahil deshpande said...

Hi,
Thanks for the great guide. I followed the steps and changed all the 9000 ports to my custom setup.
But while running hadoop word count example, it still tries to connect to cassandra using 9160 port. Since the port values are different, the connection fails with this error-
"13/05/07 02:37:00 ERROR hadoop.ConfigHelper: failed to connect to any initial addresses
13/05/07 02:37:00 ERROR hadoop.ConfigHelper:
java.io.IOException: Unable to connect to server localhost:9160
at org.apache.cassandra.hadoop.ConfigHelper.createConnection(ConfigHelper.java:589)
at org.apache.cassandra.hadoop.ConfigHelper.getClientFromAddressList(ConfigHelper.java:554)
at org.apache.cassandra.hadoop.ConfigHelper.getClientFromOutputAddressList(ConfigHelper.java:543)
at org.apache.cassandra.client.RingCache.refreshEndpointMap(RingCache.java:66)
at org.apache.cassandra.client.RingCache.(RingCache.java:59)
at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter.(ColumnFamilyRecordWriter.java:104)
at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter.(ColumnFamilyRecordWriter.java:91)
at org.apache.cassandra.hadoop.ColumnFamilyOutputFormat.getRecordWriter(ColumnFamilyOutputFormat.java:133)
at org.apache.cassandra.hadoop.ColumnFamilyOutputFormat.getRecordWriter(ColumnFamilyOutputFormat.java:61)
at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:553)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:408)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:216)
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused
at org.apache.thrift.transport.TSocket.open(TSocket.java:183)
at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:37)
at org.apache.cassandra.hadoop.ConfigHelper.createConnection(ConfigHelper.java:580)
... 11 more
"
Can you please guide me on resolving this issue?
Thanks a lot!

Shamim Bhuiyan said...

Hello,
do you change this value PIG_RPC_PORT=9160?

Kushal Chokhani said...

Hi Shamim

Thanks for sharing your experience and the resources. It has been a great help. I am pretty new to cassandra and hadoop. Here I am trying to setup an app as mentioned in the Chapter 11 of Cassandra High Performance Cookbook. It should read the contents from CF using hadoop job.

As of now, I am stuck at the following exception :
java.lang.RuntimeException: unable to load keyspace cql3_worldcount
at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.(ColumnFamilyRecordReader.java:245)
at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.(ColumnFamilyRecordReader.java:209)
at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.(ColumnFamilyRecordReader.java:300)
at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.(ColumnFamilyRecordReader.java:300)
at org.apache.cassandra.hadoop.ColumnFamilyRecordReader.initialize(ColumnFamilyRecordReader.java:165)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:522)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.get(ArrayList.java:324)
at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.(ColumnFamilyRecordReader.java:230)
... 12 more


I can read the contents using cqlsh for the same keyspace.

Do you know what could be wrong here. I can send you the complete code if you want.

Shamim Bhuiyan said...

Good Morning, Khusal!
First of all, Cassandra Cook book is legacy book and no recommended to use. If you are trying CQL3 - please try the example from the Cassandra Source code, there are a few word count example.
If you are interested to use Pig, you should check the following issue
https://issues.apache.org/jira/browse/CASSANDRA-5234
Issue contain a lot of discussion and clue to work with cql3 CF
Regards
Shamim

Kushal Chokhani said...

Thanks Shamim for the response. Actually those WordCount examples are working perfectly fine for me.

But I couldn't figure out how to use the hadoop cluster there. I mean the WordCount code works even when there is no hadoop server configured on the machine. I guess it just uses the hadoop library and does the job. And even that might work for me as long as I can actually does the MapReduce work using the hadoop server running externally.

Can you please point me in the right direction here. How can I change the configuration in the WordCount code to actually use a external Hadoop setup with all the nodes and triggers up. Basically when it is all happening at the cluster level.

benslin kard said...

Hadoop the software does is bust that data into pieces that it then spreads across your different servers.
Hadoop Development

Sherlock said...

Hi Shamim,

I am new to hadoop, pig and cassandra. I was trying your tutorial and after doing the configuration. I realized pig was giving error while executing the cassandra queries.

I am using the apache cassandra version 2.0.1 ver.

grunt > rows = LOAD 'cassandra://Keyspace1/Users' USING CassandraStorage();


It gives following error : ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: InvalidRequestException(why:Keyspace 'Keyspace1' does not exist)

Not sure whats the issue. I created the keyspace via both cli and cql. Issue remain same.

Shamim Bhuiyan said...

@Sherlock,
I haven't test above setup with cassandra version 2.0.*. It's difficult to answer your question in this moment

Sherlock said...

Hi Shamim,

When i was trying to run in mapreduce mode on hadoop using : bin/pig_cassandra example-script.pig

I get below error. I esnured the file is present . Cannot make out where i am goign wrong?

ERROR 2997: Unable to recreate exception from backed error: Error: java.lang.ClassNotFoundException: org.apache.thrift.TException

HadoopVersion PigVersion UserId StartedAt FinishedAt Features
0.20.2 0.8.0 cleophusmichealpereira 2013-10-16 20:08:03 2013-10-16 20:08:52 GROUP_BY,ORDER_BY,LIMIT

Failed!

Failed Jobs:
JobId Alias Feature Message Outputs
job_201310162006_0001 colnames,cols,namecounts,namegroups,rows GROUP_BY,COMBINER Message: Job failed! hdfs://localhost:9000/tmp/temp1204958173/tmp-997794666,

Input(s):
Failed to read data from "cassandra://Keyspace1/Users"

Output(s):
Failed to produce result in "hdfs://localhost:9000/tmp/temp1204958173/tmp-997794666"



Shamim Bhuiyan said...

@Sherlock
you have to add libthrift-0.*.*.jar in your pig classpath. You should have found the lib in $CASSANDRA_HOME/lib

sumit said...

Hi,

This is really nice post.

Joy said...

Hi Shamim,

Thanks for the helpful post as I'm very new to cassandra.

Is there any way to integrate hive/pig with a running & pre-loaded cassandra cluster without building any java codes/projects?

This idea came as I've easily installed cassandra tar-ball edition, so is there any similar steps to integrate hive/pig with cassandra?

Shamim Bhuiyan said...

Hello, joy!
Unfortunately there is no one step hive/pig installation. If you only have to analyze data, you can take a look in presto. Presto is very easy to install and with sql flavor. Any way you can use datastax enterprise, which have integrated hive pig with cassandra, also cloudera (without cassandra)

Anonymous said...

Thanks in your information i have very benefits.



Hadoop Training in Chennai

BHUPENDRA KUMAR said...

hi shamim,

the problem of live node is resolved now...

can you pls help me i am new in cassandrai i dont know how to build cassandra from source code.


thank you

sriramoju someshwar said...

Hi....

I am new to cassandra and am trying read the data from cassandra using pig but am getting this error....

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org/apache/pig/StoreFuncInterface
Details at logfile: /home/accure/pig_1406250048465.log

please help me to come from this error.....

BHUPENDRA CHANDEL said...

hi shamim...

can you please help me. while running pig to load data from cassandra i found this error.

grunt> rows = LOAD 'cassandra://Keyspace1/Users' USING CassandraStorage();
2014-07-30 01:44:08,183 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve CassandraStorage using imports: [org.apache.cassandra.hadoop.pig., , org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /home/eits/Project/cassandra-1.0/examples/pig/pig_1406709841894.log


enviroment Path

[root@dhcppc1 conf]# export JAVA_HOME=/usr/java/jdk1.7.0_55
[root@dhcppc1 conf]# export PATH=$PATH:/usr/java/jdk1.7.0_55/bin
[root@dhcppc1 conf]# export PIG_HOME=/home/eits/pig-0.11.1
[root@dhcppc1 conf]# export PIG_INITIAL_ADDRESS=localhost
[root@dhcppc1 conf]# export PIG_RPC_PORT=9160
[root@dhcppc1 conf]# export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner
[root@dhcppc1 conf]# export PIG_CONF_DIR=/home/eits/hadoop-0.20.2/conf
[root@dhcppc1 conf]# export HADOOP_CLASSPATH=/home/eits/hadoop-0.20.2/hadoop-0.20.2-core.jar
[root@dhcppc1 conf]# export CLASSPATH=$HADOOP_CLASSPATH
[root@dhcppc1 conf]# export HADOOP_CLASSPATH=/home/eits/hadoop-0.20.2/hadoop-0.20.2-core.jar:/home/eits/Project/cassandra-1.0/lib
[root@dhcppc1 conf]# export CLASSPATH=$HADOOP_CLASSPATH:$PIG_JAR

can you please guide me where i am doing wrong.

Shamim Bhuiyan said...

Hello, BHUPENDRA
you have to add cassandra/lib/*.jars to Hadoop classpath.

BHUPENDRA CHANDEL said...

hi..

i have added all jar file of cassandra/lib, hadoop/lib and build/lib/jars to hadoop classpath still getting the same error..

error 1070: could not resolve CassandraStorage using imports

hadoop Version hadoop-0.20.2
cassandra-1.0.6
pig-0.11.1
on RHEL5x64

are these package compatible to each other.
hadoop is running correctly-
$jps
9347 CliMain
19560 SecondaryNameNode
19635 JobTracker
19435 DataNode
19736 TaskTracker
19778 Jps
19328 NameNode

what can be the reason.can you please guide me.
thanks

BHUPENDRA CHANDEL said...

hi shamim...

thanks a lot for this article...finaly its working.
:-)

BHUPENDRA CHANDEL said...

hi shamim,

i am trying to setup hadoop on cassandra cluster.but fail to do the same. can you please help me

details-
Machine1-->hadoop namenode and jobtracker and pig
Machine2--> cassandra+datanode+tasktracker
machine3--> cassandra+datanode+tasktracker

i have crated keyspace on machine2 that is also available on machine3. but when i tried to read data using pig. got the error alias unable to store the data. i think the problem is with integration part of hadoop and pig with cassandra cluster. can u please guide me what are the required step. how to set hadoop_classpath on each Machine. can you please guide me.

Cossan said...

Hi,
I fellowed the tutorial to setup a "single node Hadoop + Cassandra + Pig setup"

Cassandra, hadoop and pig start well

When in pig shell I run the fellowing commands :

grunt> rows = LOAD 'cassandra://testpregene/test' USING CassandraStorage();
grunt> describe rows;
rows: {key: int,columns: {(name: (),value: bytearray)}}
grunt> illustrate rows;
works fine and gives me the schema.
Bute the command below always gives me an error :
grunt> dump rows;
Input(s):
Failed to read data from "cassandra://testpregene/test"

Output(s):
Failed to produce result in "hdfs://node-c3-05.rqchp.qc.ca:9000/tmp/temp-1524759323/tmp-1167316699"

In the logs trace it gives me this error :
Error: java.lang.ClassNotFoundException: org.apache.cassandra.thrift.AuthenticationException

in conf/cassandra.yaml, I have :
authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
authorizer: org.apache.cassandra.auth.AllowAllAuthorizer

Here are my environnements variables in my bash_profile:
export CASSANDRA_HOME=/path_to/apache-cassandra-2.0.6-src
export CASSANDRA_CONF=$CASSANDRA_HOME/conf
export PIG_HOME=/path_to/pig-0.12.0-src
export PIG_PATH=$PIG_HOME/pig.jar:$PIG_HOME/pig-withouthadoop.jar
export PIG_INITIAL_ADDRESS=localhost
export PIG_PARTITIONER=org.apache.cassandra.dht.Murmur3Partitioner
export PIG_RPC_PORT=9160
export HADOOP_HOME=/path_to/hadoop-0.20.2
export HADOOP_CONF=$HADOOP_HOME/conf
export PIG_CONF_DIR=$HADOOP_CONF
export HADOOP_CLASSPATH=$HADOOP_HOME/hadoop-0.20.2-core.jar:$PIG_PATH:$CASSANDRA_HOME/lib/*.jar

Can somebody help me find what's wrong in my configuration please?

Thanks!

Ousmane

Cossan said...

Hi,
I fellowed the tutorial to setup a "single node Hadoop + Cassandra + Pig setup"

Cassandra, hadoop and pig start well

When in pig shell I run the fellowing commands :

grunt> rows = LOAD 'cassandra://testpregene/test' USING CassandraStorage();
grunt> describe rows;
rows: {key: int,columns: {(name: (),value: bytearray)}}
grunt> illustrate rows;
works fine and gives me the schema.
Bute the command below always gives me an error :
grunt> dump rows;
Input(s):
Failed to read data from "cassandra://testpregene/test"

Output(s):
Failed to produce result in "hdfs://node-c3-05.rqchp.qc.ca:9000/tmp/temp-1524759323/tmp-1167316699"

In the logs trace it gives me this error :
Error: java.lang.ClassNotFoundException: org.apache.cassandra.thrift.AuthenticationException

in conf/cassandra.yaml, I have :
authenticator: org.apache.cassandra.auth.AllowAllAuthenticator
authorizer: org.apache.cassandra.auth.AllowAllAuthorizer

Here are my environnements variables in my bash_profile:
export CASSANDRA_HOME=/path_to/apache-cassandra-2.0.6-src
export CASSANDRA_CONF=$CASSANDRA_HOME/conf
export PIG_HOME=/path_to/pig-0.12.0-src
export PIG_PATH=$PIG_HOME/pig.jar:$PIG_HOME/pig-withouthadoop.jar
export PIG_INITIAL_ADDRESS=localhost
export PIG_PARTITIONER=org.apache.cassandra.dht.Murmur3Partitioner
export PIG_RPC_PORT=9160
export HADOOP_HOME=/path_to/hadoop-0.20.2
export HADOOP_CONF=$HADOOP_HOME/conf
export PIG_CONF_DIR=$HADOOP_CONF
export HADOOP_CLASSPATH=$HADOOP_HOME/hadoop-0.20.2-core.jar:$PIG_PATH:$CASSANDRA_HOME/lib/*.jar

Can somebody help me find what's wrong in my configuration please?

Thanks!

Ousmane