BigData: October 2019

BigData PIG Project work

LOAD
FILTER
FOREACH ... GENERATE
SPLIT
GROUP
JOIN
DESCRIBE
EXPLAIN
ILLUSTRATE
DUMP

> pig -x local

> pig -x local [script]

> pig -x hadoop [script]

Case Study 1:
Movies dataset with 50000 observations. This dataset has 5 columns(Id, Name, Year, Rating, Duration).

1) grunt> mov = load 'Desktop/Basha/Basha2019/PIG_Practicals/movies_data.xls' using PigStorage(',') as (id:int, name:chararray, year:int, rating:float, duration:int);

grunt> describe mov;

mov: {id: int,name: chararray,year: int,rating: float,duration: int}

grunt> dump mov;

2) Movies list which has rating>4.

grunt> mov_ratingfour = filter mov by (float)rating>4.0;

3) List of movies in the file.

grunt> mov_group = group mov all;

grunt> mov_count = foreach mov_group generate COUNT(mov.id);

grunt> dump mov_count;

4) List title and duration from the file & display the list using duration in DESC.

grunt> mov_duration = foreach mov generate name,(double)duration/60;

grunt> mov_notnull = filter mov_duration by $1 is not null;

grunt> mov_duration_order = order mov_notnull by $1 DESC;

grunt> mov_long = LIMIT mov_duration_order 50;

grunt> dump mov_long;

5) Grouping file List using year.

grunt> mov_group_year = group mov by year;

grunt> mov_group_rating = foreach mov_group_year generate group as year, MAX(mov.rating) as highest_rating;

grunt> dump mov_group_rating;

6) JOIN concept in movies dataset.

grunt> mov_join = JOIN mov_group_rating by (year,highest_rating),mov by (year,rating);

grunt> describe mov_join;

mov_join: {mov_group_rating::year: int,mov_group_rating::highest_rating: float,mov::id: int,mov::name: chararray,mov::year: int,mov::rating: float,mov::duration: int}

grunt> mov_best = foreach mov_join generate $0 as year,$3 as title,$1 as rating;

grunt> describe mov_best;

mov_best: {year: int,title: chararray,rating: float}

grunt> dump mov_best;

CaseStudy 2:

We have a demonetization dataset. We will extract the twitter #Demonitisation tweets and we will want to do some kind of sentimental analysis. Like the people are +ve or -ve sentiment about demonetization.

1) Load the demonetization dataset.

grunt> tweet_load = load 'Desktop/Basha/Basha2019/PIG_Practicals/demonitization_tweets.csv' using PigStorage(',');

2) Extract id, text columns from the tweets.

grunt> tweet_extract = foreach tweet_load generate $0 as id, $1 as text;

grunt> describe tweet_extract;

tweet_extract: {id: bytearray,text: bytearray}

3) Tokenize the text column value.

grunt> tweet_tokens = foreach tweet_extract generate id, text, FLATTEN(TOKENIZE(text)) as word;

grunt> describe tweet_tokens;

tweet_tokens: {id: bytearray,text: bytearray,word: chararray}

4) Using AFINN dictionary, we will define the +ve/-ve words.

grunt> tweet_dictionary = load '/home/cloudera/Desktop/Basha/Basha2019/PIG_Practicals/AFINN.txt' USING PigStorage('\t') as (word:chararray, rating:int);

5) Join the tweet_tokens and tweet_dictionary.

grunt> tweet_join = join tweet_tokens by word left outer, tweet_dictionary by word using 'replicated';

grunt> describe tweet_join;

tweet_join: {tweet_tokens::id: bytearray,tweet_tokens::text: bytearray,tweet_tokens::word: chararray,tweet_dictionary::word: chararray,tweet_dictionary::rating: int}

6) Tweet Rating.

grunt> tweet_rating = foreach tweet_join generate tweet_tokens::id as id,tweet_tokens::text as text, tweet_dictionary::rating as rate;

7) Grouping the word.

grunt> tweet_word_group = group tweet_rating by (id,text);

8) Average of tweet_rating rate value.

grunt> tweet_avg_rate = foreach tweet_word_group generate group,AVG(tweet_rating.rate) as tweet_finalrating;

9) +ve & -ve tweets.

grunt> tweet_positive = filter tweet_avg_rate by tweet_finalrating>=0;

grunt> tweet_negative = filter tweet_avg_rate by tweet_finalrating<0;

Saturday, October 19, 2019

PIG Project work