BigData

Tuesday, April 14, 2020

Hive Project work

BigData Hive Project work

1) create a database
hive> create database RETAIL;

2) show the databases
hive> show databases;
OK
default
jan2020
retail
Time taken: 0.023 seconds, Fetched: 3 row(s)

3) use the Retail database
hive> use RETAIL;

4) show the tables in Retail database
hive> show tables;
OK

5) create a table
hive> create table TXNRECORDS(txnno INT, txndate STRING, custno INT, amount DOUBLE, category STRING, product STRING, city STRING, state STRING, spendby STRING)
> row format delimited fields terminated by ',' stored as textfile;

6) load data into a table
hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/txns1.txt' into table TXNRECORDS;

hive> show tables;
OK
txnrecords
Time taken: 0.021 seconds, Fetched: 1 row(s)
hive>

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/retail.db/txnrecords
Found 1 items
-rwxrwxrwx 1 cloudera supergroup 4418139 2020-02-01 09:51 /user/hive/warehouse/retail.db/txnrecords/txns1.txt

hive> select * from txnrecords limit 10;

0 06-26-2011 4007024 40.33 Exercise & Fitness Cardio Machine Accessories Clarksville Tennessee credit

1 05-26-2011 4006742 198.44 Exercise & Fitness Weightlifting Gloves Long Beach California credit

2 06-01-2011 4009775 5.58 Exercise & Fitness Weightlifting Machine Accessories Anaheim California credit

3 06-05-2011 4002199 198.19 Gymnastics Gymnastics Rings Milwaukee Wisconsin credit

4 12-17-2011 4002613 98.81 Team Sports Field Hockey Nashville Tennessee credit

5 02-14-2011 4007591 193.63 Outdoor Recreation Camping & Backpacking & Hiking Chicago Illinois credit

6 10-28-2011 4002190 27.89 Puzzles Jigsaw Puzzles Charleston South Carolina credit

7 07-14-2011 4002964 96.01 Outdoor Play Equipment Sandboxes Columbus Ohio credit

8 01-17-2011 4007361 10.44 Winter Sports Snowmobiling Des Moines Iowa credit

9 05-17-2011 4004798 152.46 Jumping Bungee Jumping St. Petersburg Florida credit

Time taken: 0.076 seconds, Fetched: 10 row(s)

7) count the number of rows in the table.

- Hive will launch a Map_Reduce job and give the results at the end.

hive> select count(*) from txnrecords;

Query ID = cloudera_20200317115050_d45ae0eb-3662-4392-a4e2-f3e247368e52

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1579493137658_0001, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1579493137658_0001/

Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1579493137658_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2020-03-17 11:51:00,725 Stage-1 map = 0%, reduce = 0%

2020-03-17 11:51:15,508 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.99 sec

2020-03-17 11:51:28,713 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.69 sec

MapReduce Total cumulative CPU time: 3 seconds 690 msec

Ended Job = job_1579493137658_0001

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.69 sec HDFS Read: 4426362 HDFS Write: 6 SUCCESS

Total MapReduce CPU Time Spent: 3 seconds 690 msec

50000

Time taken: 55.465 seconds, Fetched: 1 row(s)

8) Create a table for Customer Data

hive> create table customer(custno string, firstname string, lastname string, age int, profession string)

> row format delimited fields terminated by ',';

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/custs.txt' into table customer;

Loading data to table retail.customer

Table retail.customer stats: [numFiles=1, totalSize=391355]

Time taken: 0.308 seconds

hive> select * from customer limit 10;

4000001 Kristina Chung 55 Pilot

4000002 Paige Chen 74 Teacher

4000003 Sherri Melton 34 Firefighter

4000004 Gretchen Hill 66 Computer hardware engineer

4000005 Karen Puckett 74 Lawyer

4000006 Patrick Song 42 Veterinarian

4000007 Elsie Hamilton 43 Pilot

4000008 Hazel Bender 63 Carpenter

4000009 Malcolm Wagner 39 Artist

4000010 Dolores McLaughlin 60 Writer

Time taken: 0.065 seconds, Fetched: 10 row(s)

hive> select count(*) from customer;

Query ID = cloudera_20200317120808_7ae1124b-a05f-43fa-bbbf-42cee5f14e71

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1579493137658_0002, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1579493137658_0002/

Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1579493137658_0002

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2020-03-17 12:08:12,143 Stage-1 map = 0%, reduce = 0%

2020-03-17 12:08:22,200 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.66 sec

2020-03-17 12:08:32,024 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.3 sec

MapReduce Total cumulative CPU time: 3 seconds 300 msec

Ended Job = job_1579493137658_0002

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.3 sec HDFS Read: 398974 HDFS Write: 5 SUCCESS

Total MapReduce CPU Time Spent: 3 seconds 300 msec

9999

Time taken: 32.407 seconds, Fetched: 1 row(s)

9) create a join table

hive> create table out1(custno string, firstname string, age int, profession string, amount double, product string)

> row format delimited fields terminated by ',';

Time taken: 0.106 seconds

hive> insert overwrite table out1
> select a.custno,a.firstname,a.age,a.profession,b.amount,b.product
> from customer a JOIN txnrecords b ON a.custno=b.custno;
Query ID = cloudera_20200412100909_d6678e43-b571-4cdc-ad33-0ae36178bdfd
Total jobs = 1

hive> select * from out1 limit 10;
OK
4007024 Cameron 59 Actor 40.33 Cardio Machine Accessories
4006742 Gregory 36 Accountant 198.44 Weightlifting Gloves
4009775 Ruby 44 Designer 5.58 Weightlifting Machine Accessories
4002199 Keith 44 Police officer 198.19 Gymnastics Rings
4002613 Hugh 43 Engineering technician 98.81 Field Hockey
4007591 Jennifer 54 Electrician 193.63 Camping & Backpacking & Hiking
4002190 Sheryl 62 Designer 27.89 Jigsaw Puzzles
4002964 Ken 67 Recreation and fitness worker 96.01 Sandboxes
4007361 Terri 52 Loan officer 10.44 Snowmobiling
4004798 Geoffrey 65 Chemist 152.46 Bungee Jumping
Time taken: 0.08 seconds, Fetched: 10 row(s)

Displaying the header of the records:
hive> set hive.cli.print.header=true;
hive> select * from out1 limit 10;
OK
out1.custno out1.firstname out1.age out1.profession out1.amount out1.product
4007024 Cameron 59 Actor 40.33 Cardio Machine Accessories
4006742 Gregory 36 Accountant 198.44 Weightlifting Gloves
4009775 Ruby 44 Designer 5.58 Weightlifting Machine Accessories
4002199 Keith 44 Police officer 198.19 Gymnastics Rings
4002613 Hugh 43 Engineering technician 98.81 Field Hockey
4007591 Jennifer 54 Electrician 193.63 Camping & Backpacking & Hiking
4002190 Sheryl 62 Designer 27.89 Jigsaw Puzzles
4002964 Ken 67 Recreation and fitness worker 96.01 Sandboxes
4007361 Terri 52 Loan officer 10.44 Snowmobiling
4004798 Geoffrey 65 Chemist 152.46 Bungee Jumping
Time taken: 0.085 seconds, Fetched: 10 row(s)

10) create out2 table

hive> create table out2 (custno int, firstname string, age int, profession string, amount double, product string, level string)
> row format delimited fields terminated by ',';

Categorize based on Age :

hive> insert overwrite table out2

> select *, case

> when age < 30 then 'low'

> when age >= 30 and age < 50 then 'middle'

> when age >= 50 then 'old'

> else 'others'

> end

> from out1;

hive> select * from out2 limit 10;

out2.custno out2.firstname out2.age out2.profession out2.amount out2.product out2.level

4007024 Cameron 59 Actor 40.33 Cardio Machine Accessories old

4006742 Gregory 36 Accountant 198.44 Weightlifting Gloves middle

4009775 Ruby 44 Designer 5.58 Weightlifting Machine Accessories middle

4002199 Keith 44 Police officer 198.19 Gymnastics Rings middle

4002613 Hugh 43 Engineering technician 98.81 Field Hockey middle

4007591 Jennifer 54 Electrician 193.63 Camping & Backpacking & Hiking old

4002190 Sheryl 62 Designer 27.89 Jigsaw Puzzles old

4002964 Ken 67 Recreation and fitness worker 96.01 Sandboxes old

4007361 Terri 52 Loan officer 10.44 Snowmobiling old

4004798 Geoffrey 65 Chemist 152.46 Bungee Jumping old

Time taken: 0.112 seconds, Fetched: 10 row(s)

11) create a final table for displaying the results.

hive> create table out3 (level string, amount double)

> row format delimited fields terminated by ',';

hive> insert overwrite table out3

> select level,sum(amount) from out2 group by level;

hive> select * from out3;

out3.level out3.amount

low 725221.3399999988

middle 1855861.669999996

old 2529100.310000011

*********************************************************

BigData Hive Concepts

Hive Concepts

- Hive is a bigdata query language.
- SQL interface to Hadoop - Hive Query Language
- Hive is a data warehouse
- Schema on write - RDBMS -> create a table with 5 columns, you must insert 5 column data.
- Schema on read - Data warehouse -> create a table with 5 columns, you can insert 10 column data.
- The data warehouse does not support indexing & stored procedures but support views.

[cloudera@quickstart ~]$ hive

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive>
hive> show databases;
OK
default
jan2020
retail
Time taken: 0.679 seconds, Fetched: 3 row(s)

hive> create database feb2020;

hive> use feb2020;

hive> show tables;

Time taken: 0.177 seconds

hive> create database mar2020;
OK
Time taken: 0.564 seconds
hive> use mar2020;
OK
Time taken: 0.02 seconds
hive> show tables;
OKTime taken: 0.139 seconds

create Table

hive> create table students_mar(

> name string,

> location string,

> phone int,

> course string)

> row format delimited fields terminated by ',';

Time taken: 0.101 seconds

insert data into Table or Load

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt' into table students_mar;

Loading data to table mar2020.students_mar

Table mar2020.students_mar stats: [numFiles=1, totalSize=106]

Time taken: 0.508 seconds

hive> select * from students_mar;

John Banglore 12345 BiGData

Mike Hyderabad 112233 Java

Bob Delhi 11111 DataScience

Smith Mumbai 22222 DBA

Time taken: 0.051 seconds, Fetched: 4 row(s)

describe Table

hive> describe formatted students_mar;

# col_name data_type comment

name string

location string

numFiles 1

totalSize 106

transient_lastDdlTime 15838

phone int

course string

# Detailed Table Information

Database: mar2020

Owner: cloudera

CreateTime: Mon Mar 09 18:49:18 PDT 2020

LastAccessTime: UNKNOWN

Protect Mode: None

Retention: 0

Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/mar2020.db/students_mar

Table Type: MANAGED_TABLE

Table Parameters:

COLUMN_STATS_ACCURATE true 05314

# Storage Information

SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

InputFormat: org.apache.hadoop.mapred.TextInputFormat

OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Compressed: No

Num Buckets: -1

Bucket Columns: []

Sort Columns: []

Storage Desc Params:

field.delim ,

serialization.format ,

Time taken: 0.344 seconds, Fetched: 33 row(s)

Hive folder structure in HDFS

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/
Found 4 items
drwxrwxrwx - cloudera supergroup 0 2020-02-12 17:36 /user/hive/warehouse/feb2020.db
drwxrwxrwx - cloudera supergroup 0 2020-01-21 10:24 /user/hive/warehouse/jan2020.db
drwxrwxrwx - cloudera supergroup 0 2020-03-09 18:49 /user/hive/warehouse/mar2020.db
drwxrwxrwx - cloudera supergroup 0 2020-02-01 09:49 /user/hive/warehouse/retail.db
[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db
Found 1 items
drwxrwxrwx - cloudera supergroup 0 2020-03-09 18:55 /user/hive/warehouse/mar2020.db/students_mar
[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar
Found 1 items
-rwxrwxrwx 1 cloudera supergroup 106 2020-03-09 18:55 /user/hive/warehouse/mar2020.db/students_mar/students_hive.txt

[cloudera@quickstart ~]$

Load the same file 2nd time

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt' into table students_mar;

Loading data to table mar2020.students_mar

Table mar2020.students_mar stats: [numFiles=2, totalSize=212]

Data gets appended

hive> select * from students_mar;

John Banglore 12345 BiGData

Mike Hyderabad 112233 Java

Bob Delhi 11111 DataScience

Smith Mumbai 22222 DBA

John Banglore 12345 BiGData

Mike Hyderabad 112233 Java

Bob Delhi 11111 DataScience

Smith Mumbai 22222 DBA

Time taken: 0.07 seconds, Fetched: 8 row(s)

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar

Found 2 items

-rwxrwxrwx 1 cloudera supergroup 106 2020-03-09 18:55 /user/hive/warehouse/mar2020.db/students_mar/students_hive.txt

-rwxrwxrwx 1 cloudera supergroup 106 2020-03-17 10:42 /user/hive/warehouse/mar2020.db/students_mar/students_hive_copy_1.txt

[cloudera@quickstart ~]$

Replace/Overwrite the data

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt' overwrite into table students_mar;
Loading data to table mar2020.students_mar
Table mar2020.students_mar stats: [numFiles=1, numRows=0, totalSize=106, rawDataSize=0]

hive> select * from students_mar;
OK
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
Time taken: 0.098 seconds, Fetched: 4 row(s)

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar

Found 1 items

-rwxrwxrwx 1 cloudera supergroup 106 2020-03-17 10:49 /user/hive/warehouse/mar2020.db/students_mar/students_hive.txt

[cloudera@quickstart ~]$

Schema On-Read

Take a sample file(students_new.txt) with more columns.

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_new.txt' overwrite into table students_mar;
Loading data to table mar2020.students_mar
Table mar2020.students_mar stats: [numFiles=1, numRows=0, totalSize=162, rawDataSize=0]

hive> select * from students_mar;
OK
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
NULL NULL NULL
Time taken: 0.067 seconds, Fetched: 5 row(s)

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar

Found 1 items

-rwxrwxrwx 1 cloudera supergroup 162 2020-03-17 10:57 /user/hive/warehouse/mar2020.db/students_mar/students_new.txt

Managed Table

-Metadata inside the table is managed by Hive. By default, all the tables are Managed Tables.

- Drop table - The table will be deleted & Data will be gone.

External Table

- Data will be read from an external directory. If you want to load the data you can load using the external files at any location.

- Drop table - The table will be deleted & But Data will be there. Data will be present in the External table location.

hive> create external table students_ext(

> name string,

> location string,

> phone int,

> course string)

> row format delimited fields terminated by ','

> location '/MARCH';

[cloudera@quickstart ~]$ hdfs dfs -ls /

Found 9 items

drwxr-xr-x - cloudera supergroup 0 2020-02-14 22:06 /FEBRUARY

drwxr-xr-x - cloudera supergroup 0 2020-01-21 19:51 /JANUARY

drwxr-xr-x - cloudera supergroup 0 2020-03-17 11:08 /MARCH

drwxrwxrwx - hdfs supergroup 0 2017-10-23 09:15 /benchmarks

drwxr-xr-x - hbase supergroup 0 2020-01-19 20:06 /hbase

drwxr-xr-x - solr solr 0 2017-10-23 09:18 /solr

drwxrwxrwt - hdfs supergroup 0 2019-05-01 17:48 /tmp

drwxr-xr-x - hdfs supergroup 0 2017-10-23 09:17 /user

drwxr-xr-x - hdfs supergroup 0 2017-10-23 09:17 /var

hive> select * from students_ext;

Time taken: 0.053 seconds

[cloudera@quickstart ~]$ hdfs dfs -put file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt /MARCH

[cloudera@quickstart ~]$

hive> select * from students_ext;

John Banglore 12345 BiGData

Mike Hyderabad 112233 Java

Bob Delhi 11111 DataScience

Smith Mumbai 22222 DBA

Time taken: 0.051 seconds, Fetched: 4 row(s)

Job History Server link
http://quickstart.cloudera:19888/jobhistory/job/job_1579493137658_0001

Hadoop Cluster Applications link
http://quickstart.cloudera:8088/cluster/apps/

Partitioning
When the table has multiple files and the user wants to search for a particular file, the query will search all the files and get the results(which is very time-consuming process). By using partitioning, we will query the specific file in the Table. Using partition your query will be very fast.

Eg:- retail/sales/month.txt
when the user query for "select * from sales where month='May';

Create partition based on month name.
Eg:- retail/sales/MAY/month.txt

- Partitioning is not mandatory
- Partitioning is completely based on the Query & data. (Eg:- sometimes where clause may vary like where country='USA' in that case, we need to define another partition)
- Partitioning on unique/primary column is not possible, We use Bucketing for this.
Eg:- select * from sales where CustomerNo='12345';

Partitioning Types
Dynamic - This is the column I want to do the partitioning, this info we need to provide. When loading the data the Hive will automatically be partitioning the data.
Eg:- retail/sales/MAY/month.txt

Static - You can define the partitions and upload the data.

Partitioning Modes

Strict Mode - Hive will allow queries based on Partitions.
select * from sales where month='May' -> will success
select * from sales where country='USA' -> will fail

Non Strict Mode - Hive will allow all queries.
select * from sales where month='May' -> will success
select * from sales where country='USA' -> will success

Bucketing or Clustering
Partitioning on unique/primary column is not possible, We use Bucketing for this.

- You tell hive I want x no of Buckets based on the column.
Eg:- I want 10 buckets based on CustomerNo.
- Buckets are nothing but folders.

Saturday, January 11, 2020

BigData PIG Practicals

BigData PIG Practicals

LOAD
FILTER
FOREACH ... GENERATE
SPLIT
GROUP
JOIN
DESCRIBE
EXPLAIN
ILLUSTRATE
DUMP

> pig -x local

> pig -x local [script]

> pig -x hadoop [script]

[cloudera@quickstart ~]$ pwd
/home/cloudera
[cloudera@quickstart ~]$ pig -x local
grunt>

/home/cloudera/Desktop/Basha/Basha2019/PIG_Practicals/students.txt
grunt> A = load 'Desktop/Basha/Basha2019/PIG_Practicals/students.txt';
grunt> describe A;
Schema for A unknown.
grunt> dump A;
(John,21,2.89)
(Sally,19,2.56)
(Alice,22,3.76)
(Doug,19,1.98)
(Susan,26,3.25)
(John,35,5.00)
(Doug,40,3.50)
(Alice,22,5.25)

grunt> A = load 'Desktop/Basha/Basha2019/PIG_Practicals/students.txt' AS (name:chararray, age:int, gpa:float);
grunt> describe A;
A: {name: chararray,age: int,gpa: float}
grunt> dump A;
(John,21,2.89)
(Sally,19,2.56)
(Alice,22,3.76)
(Doug,19,1.98)
(Susan,26,3.25)
(John,35,5.0)
(Doug,40,3.5)
(Alice,22,5.25)

grunt> R = filter A by (age>=20);
grunt> dump R;
(John,21,2.89)
(Alice,22,3.76)
(Susan,26,3.25)
(John,35,5.0)
(Doug,40,3.5)
(Alice,22,5.25)

grunt> R = filter A by (age>=20) and (gpa>=3.5);
grunt> dump R;
(Alice,22,3.76)
(John,35,5.0)
(Doug,40,3.5)
(Alice,22,5.25)

grunt> illustrate R;
---------------------------------------------------------
| A | name:chararray | age:int | gpa:float |
---------------------------------------------------------
| | John | 35 | 5.0 |
| | John | 21 | 2.89 |
| | Doug | 19 | 1.98 |
---------------------------------------------------------
---------------------------------------------------------
| R | name:chararray | age:int | gpa:float |
---------------------------------------------------------
| | John | 35 | 5.0 |
---------------------------------------------------------

grunt> F = foreach A generate age,gpa;
grunt> dump F;
(21,2.89)
(19,2.56)
(22,3.76)
(19,1.98)
(26,3.25)
(35,5.0)
(40,3.5)
(22,5.25)

grunt> G = group A by age;
grunt> dump G;
(19,{(Doug,19,1.98),(Sally,19,2.56)})
(21,{(John,21,2.89)})
(22,{(Alice,22,5.25),(Alice,22,3.76)})
(26,{(Susan,26,3.25)})
(35,{(John,35,5.0)})
(40,{(Doug,40,3.5)})
grunt> describe G;
G: {group: int,A: {(name: chararray,age: int,gpa: float)}}

grunt> H = foreach G generate group,A.name;
grunt> dump H;
(19,{(Doug),(Sally)})
(21,{(John)})
(22,{(Alice),(Alice)})
(26,{(Susan)})
(35,{(John)})
(40,{(Doug)})

grunt> store A into 'Desktop/Basha/Basha2019/PIG_Practicals/outputdir';
Input(s):
Successfully read records from: "file:///home/cloudera/Desktop/Basha/Basha2019/PIG_Practicals/students.txt"
Output(s):
Successfully stored records in: "file:///home/cloudera/Desktop/Basha/Basha2019/PIG_Practicals/outputdir"

grunt> store H into 'Desktop/Basha/Basha2019/PIG_Practicals/outputdir2' using PigStorage('|');
Input(s):
Successfully read records from: "file:///home/cloudera/Desktop/Basha/Basha2019/PIG_Practicals/students.txt"
Output(s):
Successfully stored records in: "file:///home/cloudera/Desktop/Basha/Basha2019/PIG_Practicals/outputdir2"
19|{(Doug),(Sally)}
21|{(John)}
22|{(Alice),(Alice)}
26|{(Susan)}
35|{(John)}
40|{(Doug)}

[cloudera@quickstart ~]$ pig -x local Desktop/Basha/Basha2019/PIG_Practicals/students.pig
Input(s):
Successfully read records from: "file:///home/cloudera/Desktop/Basha/Basha2019/PIG_Practicals/students.txt"
Output(s):
Successfully stored records in: "file:///home/cloudera/Desktop/Basha/Basha2019/PIG_Practicals/outputdir3"
19|{(Doug),(Sally)}
21|{(John)}
22|{(Alice),(Alice)}
26|{(Susan)}
35|{(John)}
40|{(Doug)}

Wednesday, January 1, 2020

BigData Pig Concepts

BigData PIG Concepts

- It is a Data flow language.
- PIG allows us to describe how datasets are filtered, combined, split and delivered from source to final destination.
- PIG commands are internally translated into MapReduce jobs.

LOAD
FILTER
FOREACH ... GENERATE
SPLIT
GROUP
JOIN

- PIG will work on any source of tuples. (Not like SQL where schema specified before data load)
- PIG can use complex, nested data structures.
- PIG supports Streaming of data, bulk read + writes. It also describes a series of operations.

- PIG script defines a logical plan for executing a workflow. No actual data is read until execution time.

PIG data structures:

Relation(DB) Concepts :

Relation Operations: Relational algebra contains a set of operations that transform one or more relations into other relations.

Selection
Projection
Cartesian Product
Extended Projection
Aggregation

Query Optimizer: A relational database Query Optimizer considers all the operations needed to produce the result, and finds the most efficient plan to compute it.

BigData Python Concepts

BigData Python Concepts

- Simple to understand, Easy to learn
- Supports Interactive mode & Script mode
- Interpreted Language
- Case Sensitive, Dynamically Typed
- Object oriented structure
- Platform independent
- Open source
- It has huge number of libraries/packages
- Very useful in Data Science

Important Packages for BigData programming :

NumPy
pandas
SciPy
Statsmodels
Scikits
matplotlib
BeautifulSoup

IPython notebook - IPython is a software package in Python that serves as a research notebook. By using IPython, we can write notes as well as perform data analytics in the same file. i.e., write the code and run it from within the notebook itself.

Indentation

Every line is a new line in the Python. There is no end of the line specifier.
Indentation = Whitespace at the beginning of the line.
Statements which go together must have same indentation. Each such set of statements is called a block.

Variables and Data Structures :

Build-in data types : Integer, String, Float, Boolean, Date and Time.

Additional data Structures : Lists, Tuples, Dictionary

String Exercises :

course = 'Python for Beginners'print (course)
print (course[0])
print (course[-1])
print (course[2:5])
print (course[2:])
print (course[-10:])
print (course * 2)
print (course + "TEST")

# Different String functionsprint(course.replace("Python", "Jython"))
print(type(course))
print(len(course))
print(course.upper())
print(course.lower())
print(course.count('n'))
print(course.split(' '))
print(course.swapcase())
print(course.strip())
print(course.lstrip())
print(course.rstrip())
print(":".join(course))
print('B' in course)
print(course.find('g'))

empno = input("Enter EMP NO : ")
print(int(empno) + 10)

Lists - Collection of elements of different data types.

- List contains items separated by commas and enclosed within square brackets.

Lists Exercises :

emp_list = ['E100','Ravi',1000.00,'Hyderabad','D100']
dept_list = ['D100','Accounts','Delhi']
print(emp_list)
print(dept_list)
print(emp_list[0])
print(emp_list[2:5])
print(emp_list[2:])
print(dept_list * 2)
print(emp_list + dept_list)

print(len(emp_list))
emp_list.append(25)
emp_list.insert(0,'SNO1')
emp_list.extend(['boy1','boy2'])
print(emp_list.index('E100'))
print(emp_list)
emp_list.remove('boy1')
emp_list.pop(1)
print(emp_list)
employees_age_list = [25,20,35,32,21,28,38,45,23,33]
employees_age_list.sort()
print(employees_age_list)
print(len(employees_age_list))
print(max(employees_age_list))
print(min(employees_age_list))
employees_name_list = ['Ravi','Aditya','Giri','Mohanbabu','Madhu']
print(sorted(employees_name_list, key=len))

Dictionary - Consists of Key-Value pairs. enclosed within curly braces.

- Key is any Python type, but are usually numbers or strings.

- Values can be an arbitrary Python object.

Dictionary Exercises :

# Dictionary Excersizeemp_dictionary={}
print(emp_dictionary)
emp_dictionary['Eno'] = 'E101'emp_dictionary[2] = 'Ename'dept_dictionary = {'dno':101,'dname':'Account','dloc':'Delhi'}
print(emp_dictionary['Eno'])
print(emp_dictionary[2])
print(dept_dictionary)
print(dept_dictionary.keys())
print(dept_dictionary.values())
# empno = input("Enter EMP NO : ")# print(int(empno) + 10)

Conditions - Python supports IF-ELSE, FOR, WHILE....conditions.

IF-Else Exercises:

# IF Condition Excerciseemp_age = int(input('Enter Your Age : '))
if(emp_age >= 25 and emp_age <= 30):
    print('You are eligible for Fresher post')
elif(emp_age < 25):
    print('You are not eligible for the Test')
elif(emp_age > 30 and emp_age < 60):
    print('Your are eligible for Experience post')
else:
    print('Invalid Entry')

emp_name_list = ['Ravi','Giri','John']
emp_name = input('Enter Employee Name : ')
if(emp_name in employees_name_list):
    print(f'{emp_name} present in the employees_name_list')
else:
    print(f'{emp_name} is not present in the emp_name_list')

WHILE Exercises:

# While Condition Excerciseemp_count = 0while(emp_count <= 10):
    print(emp_count)
    emp_count += 1print('End of While loop')

FOR Exercises:

# For Condition Excercisefor dept_no in ('d100','d101','d102'):
    print(dept_no)

for emp_age in range (20,30,2):
    print(emp_age)

emp_age = [22, 24, 25, 32, 35, 41]
total_emp_age = 0for age in emp_age:
    total_emp_age += age
print(total_emp_age)

emp_sno = [1, 3, 5, 7, 8]
sum_emp_sno = [i+2 for i in emp_sno]
print(sum_emp_sno)

get_emp_sno = [i+2 for i in emp_sno if i<5]
print(get_emp_sno)

for i in range(1,10):
    if(i==5):
        break    print(i)
print('Done')

File System in Python - Python supports a number of formats for file reading and writing.

- In order to open a file, use open() method specifying file name and mode of opening(read,write,append..etc)
- Open() returns a file handle.
- handle = Open(filename, mode)
- Finally we need to close the file using close() method.(Otherwise other programs might not be able to access the file.

File Exercises:

# Files - Open & Close Excerciseemp_file = open('C:/Python_Testing/file1.txt','r')
#for line in emp_file:#    print(line)print('File name is ',emp_file.name)
test = emp_file.read()
print(test)
emp_file_out = open('C:/Python_Testing/file2.txt', 'w')
#emp_file_out.seek(0,0)print(emp_file_out.write(test))
emp_file_out.close()
emp_file.close()

We can access any type of file in Python. Example convert SAS file to Text file and access the data. Python has all specific packages for all type of files.

import sas7bdat
from sas7bdat import *
# To convert a SAS file to a text file.
data = SAS7BDAT('C:/Python_Testing/e.sas7bdat')
data.convertFile('C:/Python_Testing/e_data.txt', '\t')

Functions - Reusable piece of software.
Block of statements
- That accepts some arguments. Function accepts any number of arguments.
- Perform some functionality & provide the output.
- define using def keyword.
- Scope of the variables defined inside the function is Local. These variables can't be used outside of a function.

Functions Exercises:

# Function Excercisedef sayHello():
    print('Hello World !!!')
sayHello()

def printMax(a,b):
    if(a > b):
        print(a, 'is Max value')
    elif(a == b):
        print(a, 'is equal to', b)
    else:
        print(b, 'is Max value')
printMax(5,3)

def hello(message, times=1):
    print(message * times)
hello('Welcome', 5)
hello('Hi')

def varfunc(a, b=5, c=10):
    print('a is', a, 'and b is', b, 'and c is', c)
varfunc(3,4)
varfunc(1,2,3)
varfunc(10,c=12)
varfunc(c=15,a=10)

x=20def varscopefunc(x):
    print('x value is ', x)
    x = 2    print('x local value changed to ', x)
varscopefunc(x)
print('x value is not changed', x)

Modules - Functions can be re/used with in the same program. If you want to use functions outside of the programs that can be achieved using Modules.

- Module is nothing but a package of functions.
- Modules can be imported in other programs & functions contained in those Modules can be used.
- Module create - create a .py file with functions defined in that file.
- Python has huge list of Modules, which are pre-defined. We just need to re/use those modules by using import.

Modules Exercises:

def sayHi():
    print('This is mymodule function')
    return# Factorial Programdef factorial(number):
    product = 1    for i in range(number):
        product = product * (i + 1)
    print(product)
    return product
# C:\Users\khasi\AppData\Local\Programs\Python\Python37-32\Lib\mymodule.py

from mymodule import *
print(sayHi())

numb = int(input('Enter a non-negative number : '))
num_factorial = factorial(numb)
print(num_factorial)

Main function:

import sys
# print('First Line')def main():
    print('Hello World!!!', sys.argv[0])
if __name__ == '__main__':
    main()

Exception Handling - Program will terminate abruptly if you don't handle exceptions at run time. Exceptions are handled in Python using Try-Except & Try-Except-Finally blocks.

Finally is an optional block. Finally block will be executed with or with out exception in the program.

Exception Handling Exercises:

def avg(numlist):
    ''' raise TypeError or ZeroDivisionError Exceptions.'''    sum=0    for num in numlist:
        sum = sum + num
    return float(sum)/len(numlist)

def avgReport(numlist):
        try:
            m = avg(numlist)
            print('Avg is =',m)
        except TypeError:
            print('Type Error')
        except ZeroDivisionError:
            print('Zero Division Error')

list1=[10,20,30,40]
list2=[]
list3=[10,20,30,'abc']

avgReport(list1)
print(avgReport(list2))
print(avgReport(list3))

def avg(numlist):
    ''' raise TypeError or ZeroDivisionError Exceptions.'''    sum=0    for num in numlist:
        sum = sum + num
    return float(sum)/len(numlist)

def avgReport(numlist):
        try:
            m = avg(numlist)
            print('Avg is =',m)
        except TypeError:
            print('Type Error')
        except ZeroDivisionError:
            print('Zero Division Error')
        finally:
            print('Finished avg program')

list1=[10,20,30,40]
list2=[]
list3=[10,20,30,'abc']

avgReport(list2)

Tuesday, December 31, 2019

BigData MapReduce Concept

MapReduce

Is a programming Model to process large datasets in parallel.
MapReduce divides the task into subtasks and handles them in parallel.
Input and Output always be a key-value format.

Map
Reducer

Map - You have to write a program that can produce local(key, value) pairs.
Eg:- If your data(ex. data is x,y,z) is in 3 data nodes.
After Map - you will get a key-value pair. i.e, data is key & count is value.
DataNode1 o/p => (x,3),(y,3),(z,4)
DataNode2 o/p => (x,3),(y,3),(z,4)
DataNode3 o/p => (x,4),(y,4),(z,3)

Shuffle - Single key and all the values of that key are brought together. This will happen automatically.
Eg:- After the Shuffle phase, for each key, the value from all the data nodes will be accumulated.
(x,(3,3,4))
(y,(3,3,4))
(z,(4,4,3))

Reducer - You have to write a program that will read the key from Shuffle & sum the values of a specific key. The output will be a key-value pair.
Eg:- After the Reducer phase, you will get the following output.
(x,10)
(y,10)
(z,10)

BigData Concepts

BigData Concepts

- Big Data is a massive volume of both structured and unstructured data.
- Big Data is characterized by 3V's. i.e., Volume, Velocity and Variety
(Volume - How much data is generating)
(Velocity - What pace/speed the data is generating)
(Variety - Unstructured data. Ex. picture data, video data, log data..)
- Problems of Big Data is :
Storage of the large volumes of Data.
Processing of the large volumes of Data.
Resource Management.

Hadoop

- It is an Infrastructure. It is a software by using which we can solve the Big Data problems.

HDFS - Distributed File System - Solve the problem of Storage.

Horizontal scaling

Distributed DB = Partitioning (3GB = 3 * 1GB machines) + Replication (Making multiple copies of the same data at different places - Replication factor is 3 - Fault Tolerance)

Map Reduce - Distributed Computing Engine - Solve the problem of Processing

Shared-Nothing Architecture

YARN - Cluster/Resource Manager - Solve the problem of Resource Management.

Scheduling & Coordination

Data Ecosystems in Enterprise:
Hadoop - Is an additional layer, will allows you to process the BigData in the EcoSystem, which was not done before because of many limitations. Hadoop will co-exists with existing technologies/systems and works with them.

Hadoop:
- Inexpensive commodity hardware
- Free Open Source software
- Scalable
- Reliable
- Enable data archival and reporting
- Enable cutting edge analytics

Hadoop Features:
- Designed to store large files (huge data)
- Processing the data sequentially. So it is good for analyzing entire datasets quickly.
- Run large batch processes that may take several hours, keeps running despite partial failure of the cluster.
- Handles Scheduling & Coordination very well.
- Can store and process unstructured data(log files, text, image, audio, vedio...).

Hadoop is not desined For :
- Processing small files.
- Random access of the data retrieval & should not be used to run Transactional applications.
- Interactive querying. Most of the jobs will take at least several minutes to process.
- Not allows modification of data in place. It is a write once, read many times system. New data can be appended to the file.
- Does n't meet ACID standards, 3NF, data quality in the way that relational databases do.
- Hadoop is not a replacement for your existing database systems!

Enterprise = Hadoop(bulk storage for analytics) + RDBMS(for business operations) + NO SQL DB(for run a website)

Hadoop EcoSystem:

Hadoop = HDFS + YARN + MapReduce

MapReduce - A programming framework for parallel processing of data. Hadoop coordinates execution throughout the cluster.

Pig and Hive - These tools allow the user to write data processing programs. Internally those commands are translated into MapReduce jobs.

Cluster - Set of host machines. Its an hardware infrastructure.

YARN - Resource Manager + Node Manager

Resource Manager(Like Project Manager) - Manage all the Resources - One per Cluster. (Installed on Master Machine) - Monitor the resources at regular intervals. If any resource is not working then it will create a backup for that resource.
- Resource Manager not store any kind of data.
Node Manager(Like Project Developer) - Each Machine has one Node Manager. (Installed on all Slave Machines). It will update the status of each machine is working fine to Resource Manager using heart beat. And provide the Resources, means it will update the status(memory, storage space...) information on each machine to Resource Manager. Container is nothing but a Machine.

HDFS - Hadoop Distributed File System on top of UNIX file system.

HDFS get the information of Resources by YARN.
- When user want to store File(1TB) into HDFS, Based on the Resources information, HDFS split the File(1TB) into small pieces/blocks and store in different machines.
- When user wants the File(1TB) back, then HDFS collect all the pieces and combine them and give it back to the user.
- HDFS

HDFS replicate the received File(Ex.,1TB) and store on different machines for backup and Fault tolerance. Default replication factor is 3.

HDFC = Like YARN you have Master/Slave architecture.
NameNode = Master = Installed on only Master machine = Don't store any data = Job is to manage where different blocks of data is stored on slave machines. It has all meta data information(where different data blocks are stored...). Has the information of all the blocks and store location and all meta data/folder structure information.
- Store the Metadata in memory for faster access.
- NameNode has backup called Secondary NameNode in case of Primary NameNode failure.
- Replication factor info store by NameNode but actual replication is done by DataNode. NameNode replicate DataNode blocks in event of failure.
DataNode = Slave = Installed on all slave machines = Store all the data in different blocks.

HDFC data blocks = Large file into small chunks/blocks = Each chunk/block is 128MB in size.

- Hadoop can easily calculate how many blocks can fit on a Node. These blocks are large enough to read quickly from disk.