BigData PIG Project work
- LOAD
- FILTER
- FOREACH ... GENERATE
- SPLIT
- GROUP
- JOIN
- DESCRIBE
- EXPLAIN
- ILLUSTRATE
- DUMP
> pig -x local
> pig -x local [script]
> pig -x hadoop [script]
Case Study 1:
Movies dataset with 50000 observations. This dataset has 5 columns(Id, Name, Year, Rating, Duration).
1) grunt> mov = load 'Desktop/Basha/Basha2019/PIG_Practicals/movies_data.xls' using PigStorage(',') as (id:int, name:chararray, year:int, rating:float, duration:int);
Case Study 1:
Movies dataset with 50000 observations. This dataset has 5 columns(Id, Name, Year, Rating, Duration).
1) grunt> mov = load 'Desktop/Basha/Basha2019/PIG_Practicals/movies_data.xls' using PigStorage(',') as (id:int, name:chararray, year:int, rating:float, duration:int);
grunt> describe mov;
mov: {id: int,name: chararray,year: int,rating: float,duration: int}
grunt> dump mov;
2) Movies list which has rating>4.
grunt> mov_ratingfour = filter mov by (float)rating>4.0;
3) List of movies in the file.
grunt> mov_group = group mov all;
grunt> mov_count = foreach mov_group generate COUNT(mov.id);
grunt> dump mov_count;
4) List title and duration from the file & display the list using duration in DESC.
grunt> mov_duration = foreach mov generate name,(double)duration/60;
grunt> mov_notnull = filter mov_duration by $1 is not null;
grunt> mov_duration_order = order mov_notnull by $1 DESC;
grunt> mov_long = LIMIT mov_duration_order 50;
grunt> dump mov_long;
5) Grouping file List using year.
grunt> mov_group_year = group mov by year;
grunt> mov_group_rating = foreach mov_group_year generate group as year, MAX(mov.rating) as highest_rating;
grunt> dump mov_group_rating;
6) JOIN concept in movies dataset.
grunt> mov_join = JOIN mov_group_rating by (year,highest_rating),mov by (year,rating);
grunt> describe mov_join;
mov_join: {mov_group_rating::year: int,mov_group_rating::highest_rating: float,mov::id: int,mov::name: chararray,mov::year: int,mov::rating: float,mov::duration: int}
grunt> mov_best = foreach mov_join generate $0 as year,$3 as title,$1 as rating;
grunt> describe mov_best;
mov_best: {year: int,title: chararray,rating: float}
grunt> dump mov_best;
CaseStudy 2:
We have a demonetization dataset. We will extract the twitter #Demonitisation tweets and we will want to do some kind of sentimental analysis. Like the people are +ve or -ve sentiment about demonetization.
1) Load the demonetization dataset.
grunt> tweet_load = load 'Desktop/Basha/Basha2019/PIG_Practicals/demonitization_tweets.csv' using PigStorage(',');
2) Extract id, text columns from the tweets.
grunt> tweet_extract = foreach tweet_load generate $0 as id, $1 as text;
grunt> describe tweet_extract;
tweet_extract: {id: bytearray,text: bytearray}
3) Tokenize the text column value.
grunt> tweet_tokens = foreach tweet_extract generate id, text, FLATTEN(TOKENIZE(text)) as word;
grunt> describe tweet_tokens;
tweet_tokens: {id: bytearray,text: bytearray,word: chararray}
4) Using AFINN dictionary, we will define the +ve/-ve words.
grunt> tweet_dictionary = load '/home/cloudera/Desktop/Basha/Basha2019/PIG_Practicals/AFINN.txt' USING PigStorage('\t') as (word:chararray, rating:int);
5) Join the tweet_tokens and tweet_dictionary.
grunt> tweet_join = join tweet_tokens by word left outer, tweet_dictionary by word using 'replicated';
grunt> describe tweet_join;
tweet_join: {tweet_tokens::id: bytearray,tweet_tokens::text: bytearray,tweet_tokens::word: chararray,tweet_dictionary::word: chararray,tweet_dictionary::rating: int}
6) Tweet Rating.
grunt> tweet_rating = foreach tweet_join generate tweet_tokens::id as id,tweet_tokens::text as text, tweet_dictionary::rating as rate;
7) Grouping the word.
grunt> tweet_word_group = group tweet_rating by (id,text);
8) Average of tweet_rating rate value.
grunt> tweet_avg_rate = foreach tweet_word_group generate group,AVG(tweet_rating.rate) as tweet_finalrating;
9) +ve & -ve tweets.
grunt> tweet_positive = filter tweet_avg_rate by tweet_finalrating>=0;
grunt> tweet_negative = filter tweet_avg_rate by tweet_finalrating<0;