BigData: April 2020

Tuesday, April 14, 2020

Hive Project work

BigData Hive Project work

1) create a database
hive> create database RETAIL;

2) show the databases
hive> show databases;
OK
default
jan2020
retail
Time taken: 0.023 seconds, Fetched: 3 row(s)

3) use the Retail database
hive> use RETAIL;

4) show the tables in Retail database
hive> show tables;
OK

5) create a table
hive> create table TXNRECORDS(txnno INT, txndate STRING, custno INT, amount DOUBLE, category STRING, product STRING, city STRING, state STRING, spendby STRING)
> row format delimited fields terminated by ',' stored as textfile;

6) load data into a table
hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/txns1.txt' into table TXNRECORDS;

hive> show tables;
OK
txnrecords
Time taken: 0.021 seconds, Fetched: 1 row(s)
hive>

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/retail.db/txnrecords
Found 1 items
-rwxrwxrwx 1 cloudera supergroup 4418139 2020-02-01 09:51 /user/hive/warehouse/retail.db/txnrecords/txns1.txt

hive> select * from txnrecords limit 10;

0 06-26-2011 4007024 40.33 Exercise & Fitness Cardio Machine Accessories Clarksville Tennessee credit

1 05-26-2011 4006742 198.44 Exercise & Fitness Weightlifting Gloves Long Beach California credit

2 06-01-2011 4009775 5.58 Exercise & Fitness Weightlifting Machine Accessories Anaheim California credit

3 06-05-2011 4002199 198.19 Gymnastics Gymnastics Rings Milwaukee Wisconsin credit

4 12-17-2011 4002613 98.81 Team Sports Field Hockey Nashville Tennessee credit

5 02-14-2011 4007591 193.63 Outdoor Recreation Camping & Backpacking & Hiking Chicago Illinois credit

6 10-28-2011 4002190 27.89 Puzzles Jigsaw Puzzles Charleston South Carolina credit

7 07-14-2011 4002964 96.01 Outdoor Play Equipment Sandboxes Columbus Ohio credit

8 01-17-2011 4007361 10.44 Winter Sports Snowmobiling Des Moines Iowa credit

9 05-17-2011 4004798 152.46 Jumping Bungee Jumping St. Petersburg Florida credit

Time taken: 0.076 seconds, Fetched: 10 row(s)

7) count the number of rows in the table.

- Hive will launch a Map_Reduce job and give the results at the end.

hive> select count(*) from txnrecords;

Query ID = cloudera_20200317115050_d45ae0eb-3662-4392-a4e2-f3e247368e52

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1579493137658_0001, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1579493137658_0001/

Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1579493137658_0001

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2020-03-17 11:51:00,725 Stage-1 map = 0%, reduce = 0%

2020-03-17 11:51:15,508 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.99 sec

2020-03-17 11:51:28,713 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.69 sec

MapReduce Total cumulative CPU time: 3 seconds 690 msec

Ended Job = job_1579493137658_0001

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.69 sec HDFS Read: 4426362 HDFS Write: 6 SUCCESS

Total MapReduce CPU Time Spent: 3 seconds 690 msec

50000

Time taken: 55.465 seconds, Fetched: 1 row(s)

8) Create a table for Customer Data

hive> create table customer(custno string, firstname string, lastname string, age int, profession string)

> row format delimited fields terminated by ',';

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/custs.txt' into table customer;

Loading data to table retail.customer

Table retail.customer stats: [numFiles=1, totalSize=391355]

Time taken: 0.308 seconds

hive> select * from customer limit 10;

4000001 Kristina Chung 55 Pilot

4000002 Paige Chen 74 Teacher

4000003 Sherri Melton 34 Firefighter

4000004 Gretchen Hill 66 Computer hardware engineer

4000005 Karen Puckett 74 Lawyer

4000006 Patrick Song 42 Veterinarian

4000007 Elsie Hamilton 43 Pilot

4000008 Hazel Bender 63 Carpenter

4000009 Malcolm Wagner 39 Artist

4000010 Dolores McLaughlin 60 Writer

Time taken: 0.065 seconds, Fetched: 10 row(s)

hive> select count(*) from customer;

Query ID = cloudera_20200317120808_7ae1124b-a05f-43fa-bbbf-42cee5f14e71

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1579493137658_0002, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1579493137658_0002/

Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1579493137658_0002

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2020-03-17 12:08:12,143 Stage-1 map = 0%, reduce = 0%

2020-03-17 12:08:22,200 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.66 sec

2020-03-17 12:08:32,024 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.3 sec

MapReduce Total cumulative CPU time: 3 seconds 300 msec

Ended Job = job_1579493137658_0002

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.3 sec HDFS Read: 398974 HDFS Write: 5 SUCCESS

Total MapReduce CPU Time Spent: 3 seconds 300 msec

9999

Time taken: 32.407 seconds, Fetched: 1 row(s)

9) create a join table

hive> create table out1(custno string, firstname string, age int, profession string, amount double, product string)

> row format delimited fields terminated by ',';

Time taken: 0.106 seconds

hive> insert overwrite table out1
> select a.custno,a.firstname,a.age,a.profession,b.amount,b.product
> from customer a JOIN txnrecords b ON a.custno=b.custno;
Query ID = cloudera_20200412100909_d6678e43-b571-4cdc-ad33-0ae36178bdfd
Total jobs = 1

hive> select * from out1 limit 10;
OK
4007024 Cameron 59 Actor 40.33 Cardio Machine Accessories
4006742 Gregory 36 Accountant 198.44 Weightlifting Gloves
4009775 Ruby 44 Designer 5.58 Weightlifting Machine Accessories
4002199 Keith 44 Police officer 198.19 Gymnastics Rings
4002613 Hugh 43 Engineering technician 98.81 Field Hockey
4007591 Jennifer 54 Electrician 193.63 Camping & Backpacking & Hiking
4002190 Sheryl 62 Designer 27.89 Jigsaw Puzzles
4002964 Ken 67 Recreation and fitness worker 96.01 Sandboxes
4007361 Terri 52 Loan officer 10.44 Snowmobiling
4004798 Geoffrey 65 Chemist 152.46 Bungee Jumping
Time taken: 0.08 seconds, Fetched: 10 row(s)

Displaying the header of the records:
hive> set hive.cli.print.header=true;
hive> select * from out1 limit 10;
OK
out1.custno out1.firstname out1.age out1.profession out1.amount out1.product
4007024 Cameron 59 Actor 40.33 Cardio Machine Accessories
4006742 Gregory 36 Accountant 198.44 Weightlifting Gloves
4009775 Ruby 44 Designer 5.58 Weightlifting Machine Accessories
4002199 Keith 44 Police officer 198.19 Gymnastics Rings
4002613 Hugh 43 Engineering technician 98.81 Field Hockey
4007591 Jennifer 54 Electrician 193.63 Camping & Backpacking & Hiking
4002190 Sheryl 62 Designer 27.89 Jigsaw Puzzles
4002964 Ken 67 Recreation and fitness worker 96.01 Sandboxes
4007361 Terri 52 Loan officer 10.44 Snowmobiling
4004798 Geoffrey 65 Chemist 152.46 Bungee Jumping
Time taken: 0.085 seconds, Fetched: 10 row(s)

10) create out2 table

hive> create table out2 (custno int, firstname string, age int, profession string, amount double, product string, level string)
> row format delimited fields terminated by ',';

Categorize based on Age :

hive> insert overwrite table out2

> select *, case

> when age < 30 then 'low'

> when age >= 30 and age < 50 then 'middle'

> when age >= 50 then 'old'

> else 'others'

> end

> from out1;

hive> select * from out2 limit 10;

out2.custno out2.firstname out2.age out2.profession out2.amount out2.product out2.level

4007024 Cameron 59 Actor 40.33 Cardio Machine Accessories old

4006742 Gregory 36 Accountant 198.44 Weightlifting Gloves middle

4009775 Ruby 44 Designer 5.58 Weightlifting Machine Accessories middle

4002199 Keith 44 Police officer 198.19 Gymnastics Rings middle

4002613 Hugh 43 Engineering technician 98.81 Field Hockey middle

4007591 Jennifer 54 Electrician 193.63 Camping & Backpacking & Hiking old

4002190 Sheryl 62 Designer 27.89 Jigsaw Puzzles old

4002964 Ken 67 Recreation and fitness worker 96.01 Sandboxes old

4007361 Terri 52 Loan officer 10.44 Snowmobiling old

4004798 Geoffrey 65 Chemist 152.46 Bungee Jumping old

Time taken: 0.112 seconds, Fetched: 10 row(s)

11) create a final table for displaying the results.

hive> create table out3 (level string, amount double)

> row format delimited fields terminated by ',';

hive> insert overwrite table out3

> select level,sum(amount) from out2 group by level;

hive> select * from out3;

out3.level out3.amount

low 725221.3399999988

middle 1855861.669999996

old 2529100.310000011

*********************************************************

BigData Hive Concepts

Hive Concepts

- Hive is a bigdata query language.
- SQL interface to Hadoop - Hive Query Language
- Hive is a data warehouse
- Schema on write - RDBMS -> create a table with 5 columns, you must insert 5 column data.
- Schema on read - Data warehouse -> create a table with 5 columns, you can insert 10 column data.
- The data warehouse does not support indexing & stored procedures but support views.

[cloudera@quickstart ~]$ hive

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive>
hive> show databases;
OK
default
jan2020
retail
Time taken: 0.679 seconds, Fetched: 3 row(s)

hive> create database feb2020;

hive> use feb2020;

hive> show tables;

Time taken: 0.177 seconds

hive> create database mar2020;
OK
Time taken: 0.564 seconds
hive> use mar2020;
OK
Time taken: 0.02 seconds
hive> show tables;
OKTime taken: 0.139 seconds

create Table

hive> create table students_mar(

> name string,

> location string,

> phone int,

> course string)

> row format delimited fields terminated by ',';

Time taken: 0.101 seconds

insert data into Table or Load

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt' into table students_mar;

Loading data to table mar2020.students_mar

Table mar2020.students_mar stats: [numFiles=1, totalSize=106]

Time taken: 0.508 seconds

hive> select * from students_mar;

John Banglore 12345 BiGData

Mike Hyderabad 112233 Java

Bob Delhi 11111 DataScience

Smith Mumbai 22222 DBA

Time taken: 0.051 seconds, Fetched: 4 row(s)

describe Table

hive> describe formatted students_mar;

# col_name data_type comment

name string

location string

numFiles 1

totalSize 106

transient_lastDdlTime 15838

phone int

course string

# Detailed Table Information

Database: mar2020

Owner: cloudera

CreateTime: Mon Mar 09 18:49:18 PDT 2020

LastAccessTime: UNKNOWN

Protect Mode: None

Retention: 0

Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/mar2020.db/students_mar

Table Type: MANAGED_TABLE

Table Parameters:

COLUMN_STATS_ACCURATE true 05314

# Storage Information

SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

InputFormat: org.apache.hadoop.mapred.TextInputFormat

OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Compressed: No

Num Buckets: -1

Bucket Columns: []

Sort Columns: []

Storage Desc Params:

field.delim ,

serialization.format ,

Time taken: 0.344 seconds, Fetched: 33 row(s)

Hive folder structure in HDFS

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/
Found 4 items
drwxrwxrwx - cloudera supergroup 0 2020-02-12 17:36 /user/hive/warehouse/feb2020.db
drwxrwxrwx - cloudera supergroup 0 2020-01-21 10:24 /user/hive/warehouse/jan2020.db
drwxrwxrwx - cloudera supergroup 0 2020-03-09 18:49 /user/hive/warehouse/mar2020.db
drwxrwxrwx - cloudera supergroup 0 2020-02-01 09:49 /user/hive/warehouse/retail.db
[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db
Found 1 items
drwxrwxrwx - cloudera supergroup 0 2020-03-09 18:55 /user/hive/warehouse/mar2020.db/students_mar
[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar
Found 1 items
-rwxrwxrwx 1 cloudera supergroup 106 2020-03-09 18:55 /user/hive/warehouse/mar2020.db/students_mar/students_hive.txt

[cloudera@quickstart ~]$

Load the same file 2nd time

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt' into table students_mar;

Loading data to table mar2020.students_mar

Table mar2020.students_mar stats: [numFiles=2, totalSize=212]

Data gets appended

hive> select * from students_mar;

John Banglore 12345 BiGData

Mike Hyderabad 112233 Java

Bob Delhi 11111 DataScience

Smith Mumbai 22222 DBA

John Banglore 12345 BiGData

Mike Hyderabad 112233 Java

Bob Delhi 11111 DataScience

Smith Mumbai 22222 DBA

Time taken: 0.07 seconds, Fetched: 8 row(s)

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar

Found 2 items

-rwxrwxrwx 1 cloudera supergroup 106 2020-03-09 18:55 /user/hive/warehouse/mar2020.db/students_mar/students_hive.txt

-rwxrwxrwx 1 cloudera supergroup 106 2020-03-17 10:42 /user/hive/warehouse/mar2020.db/students_mar/students_hive_copy_1.txt

[cloudera@quickstart ~]$

Replace/Overwrite the data

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt' overwrite into table students_mar;
Loading data to table mar2020.students_mar
Table mar2020.students_mar stats: [numFiles=1, numRows=0, totalSize=106, rawDataSize=0]

hive> select * from students_mar;
OK
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
Time taken: 0.098 seconds, Fetched: 4 row(s)

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar

Found 1 items

-rwxrwxrwx 1 cloudera supergroup 106 2020-03-17 10:49 /user/hive/warehouse/mar2020.db/students_mar/students_hive.txt

[cloudera@quickstart ~]$

Schema On-Read

Take a sample file(students_new.txt) with more columns.

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_new.txt' overwrite into table students_mar;
Loading data to table mar2020.students_mar
Table mar2020.students_mar stats: [numFiles=1, numRows=0, totalSize=162, rawDataSize=0]

hive> select * from students_mar;
OK
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
NULL NULL NULL
Time taken: 0.067 seconds, Fetched: 5 row(s)

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar

Found 1 items

-rwxrwxrwx 1 cloudera supergroup 162 2020-03-17 10:57 /user/hive/warehouse/mar2020.db/students_mar/students_new.txt

Managed Table

-Metadata inside the table is managed by Hive. By default, all the tables are Managed Tables.

- Drop table - The table will be deleted & Data will be gone.

External Table

- Data will be read from an external directory. If you want to load the data you can load using the external files at any location.

- Drop table - The table will be deleted & But Data will be there. Data will be present in the External table location.

hive> create external table students_ext(

> name string,

> location string,

> phone int,

> course string)

> row format delimited fields terminated by ','

> location '/MARCH';

[cloudera@quickstart ~]$ hdfs dfs -ls /

Found 9 items

drwxr-xr-x - cloudera supergroup 0 2020-02-14 22:06 /FEBRUARY

drwxr-xr-x - cloudera supergroup 0 2020-01-21 19:51 /JANUARY

drwxr-xr-x - cloudera supergroup 0 2020-03-17 11:08 /MARCH

drwxrwxrwx - hdfs supergroup 0 2017-10-23 09:15 /benchmarks

drwxr-xr-x - hbase supergroup 0 2020-01-19 20:06 /hbase

drwxr-xr-x - solr solr 0 2017-10-23 09:18 /solr

drwxrwxrwt - hdfs supergroup 0 2019-05-01 17:48 /tmp

drwxr-xr-x - hdfs supergroup 0 2017-10-23 09:17 /user

drwxr-xr-x - hdfs supergroup 0 2017-10-23 09:17 /var

hive> select * from students_ext;

Time taken: 0.053 seconds

[cloudera@quickstart ~]$ hdfs dfs -put file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt /MARCH

[cloudera@quickstart ~]$

hive> select * from students_ext;

John Banglore 12345 BiGData

Mike Hyderabad 112233 Java

Bob Delhi 11111 DataScience

Smith Mumbai 22222 DBA

Time taken: 0.051 seconds, Fetched: 4 row(s)

Job History Server link
http://quickstart.cloudera:19888/jobhistory/job/job_1579493137658_0001

Hadoop Cluster Applications link
http://quickstart.cloudera:8088/cluster/apps/

Partitioning
When the table has multiple files and the user wants to search for a particular file, the query will search all the files and get the results(which is very time-consuming process). By using partitioning, we will query the specific file in the Table. Using partition your query will be very fast.

Eg:- retail/sales/month.txt
when the user query for "select * from sales where month='May';

Create partition based on month name.
Eg:- retail/sales/MAY/month.txt

- Partitioning is not mandatory
- Partitioning is completely based on the Query & data. (Eg:- sometimes where clause may vary like where country='USA' in that case, we need to define another partition)
- Partitioning on unique/primary column is not possible, We use Bucketing for this.
Eg:- select * from sales where CustomerNo='12345';

Partitioning Types
Dynamic - This is the column I want to do the partitioning, this info we need to provide. When loading the data the Hive will automatically be partitioning the data.
Eg:- retail/sales/MAY/month.txt

Static - You can define the partitions and upload the data.

Partitioning Modes

Strict Mode - Hive will allow queries based on Partitions.
select * from sales where month='May' -> will success
select * from sales where country='USA' -> will fail

Non Strict Mode - Hive will allow all queries.
select * from sales where month='May' -> will success
select * from sales where country='USA' -> will success

Bucketing or Clustering
Partitioning on unique/primary column is not possible, We use Bucketing for this.

- You tell hive I want x no of Buckets based on the column.
Eg:- I want 10 buckets based on CustomerNo.
- Buckets are nothing but folders.