Tuesday, April 14, 2020

Hive Project work


BigData Hive Project work

1) create a database
hive> create database RETAIL;

2) show the databases
hive> show databases;
OK
default
jan2020
retail
Time taken: 0.023 seconds, Fetched: 3 row(s)

3) use the Retail database
hive> use RETAIL;

4) show the tables in Retail database
hive> show tables;
OK

5) create a table
hive> create table TXNRECORDS(txnno INT, txndate STRING, custno INT, amount DOUBLE, category STRING, product STRING, city STRING, state STRING, spendby STRING)
    > row format delimited fields terminated by ',' stored as textfile;

6) load data into a table
hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/txns1.txt' into table TXNRECORDS;

hive> show tables;
OK
txnrecords
Time taken: 0.021 seconds, Fetched: 1 row(s)
hive>

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/retail.db/txnrecords
Found 1 items
-rwxrwxrwx   1 cloudera supergroup    4418139 2020-02-01 09:51 /user/hive/warehouse/retail.db/txnrecords/txns1.txt

hive> select * from txnrecords limit 10;
OK
0 06-26-2011 4007024 40.33 Exercise & Fitness Cardio Machine Accessories Clarksville Tennessee credit
1 05-26-2011 4006742 198.44 Exercise & Fitness Weightlifting Gloves Long Beach California credit
2 06-01-2011 4009775 5.58 Exercise & Fitness Weightlifting Machine Accessories Anaheim California credit
3 06-05-2011 4002199 198.19 Gymnastics Gymnastics Rings Milwaukee Wisconsin credit
4 12-17-2011 4002613 98.81 Team Sports Field Hockey Nashville  Tennessee credit
5 02-14-2011 4007591 193.63 Outdoor Recreation Camping & Backpacking & Hiking Chicago Illinois credit
6 10-28-2011 4002190 27.89 Puzzles Jigsaw Puzzles Charleston South Carolina credit
7 07-14-2011 4002964 96.01 Outdoor Play Equipment Sandboxes Columbus Ohio credit
8 01-17-2011 4007361 10.44 Winter Sports Snowmobiling Des Moines Iowa credit
9 05-17-2011 4004798 152.46 Jumping Bungee Jumping St. Petersburg Florida credit
Time taken: 0.076 seconds, Fetched: 10 row(s)

7) count the number of rows in the table.
- Hive will launch a Map_Reduce job and give the results at the end.

hive> select count(*) from txnrecords;
Query ID = cloudera_20200317115050_d45ae0eb-3662-4392-a4e2-f3e247368e52
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1579493137658_0001, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1579493137658_0001/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1579493137658_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-03-17 11:51:00,725 Stage-1 map = 0%,  reduce = 0%
2020-03-17 11:51:15,508 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.99 sec
2020-03-17 11:51:28,713 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.69 sec
MapReduce Total cumulative CPU time: 3 seconds 690 msec
Ended Job = job_1579493137658_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.69 sec   HDFS Read: 4426362 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 690 msec
OK
50000
Time taken: 55.465 seconds, Fetched: 1 row(s)

8) Create a table for Customer Data
hive> create table customer(custno string, firstname string, lastname string, age int, profession string)
    > row format delimited fields terminated by ',';

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/custs.txt' into table customer;
Loading data to table retail.customer
Table retail.customer stats: [numFiles=1, totalSize=391355]
OK
Time taken: 0.308 seconds

hive> select * from customer limit 10;
OK
4000001 Kristina Chung 55 Pilot
4000002 Paige Chen 74 Teacher
4000003 Sherri Melton 34 Firefighter
4000004 Gretchen Hill 66 Computer hardware engineer
4000005 Karen Puckett 74 Lawyer
4000006 Patrick Song 42 Veterinarian
4000007 Elsie Hamilton 43 Pilot
4000008 Hazel Bender 63 Carpenter
4000009 Malcolm Wagner 39 Artist
4000010 Dolores McLaughlin 60 Writer
Time taken: 0.065 seconds, Fetched: 10 row(s)

hive> select count(*) from customer;
Query ID = cloudera_20200317120808_7ae1124b-a05f-43fa-bbbf-42cee5f14e71
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1579493137658_0002, Tracking URL = http://quickstart.cloudera:8088/proxy/application_1579493137658_0002/
Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill job_1579493137658_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-03-17 12:08:12,143 Stage-1 map = 0%,  reduce = 0%
2020-03-17 12:08:22,200 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.66 sec
2020-03-17 12:08:32,024 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.3 sec
MapReduce Total cumulative CPU time: 3 seconds 300 msec
Ended Job = job_1579493137658_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.3 sec   HDFS Read: 398974 HDFS Write: 5 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 300 msec
OK
9999
Time taken: 32.407 seconds, Fetched: 1 row(s)
9) create a join table
hive> create table out1(custno string, firstname string, age int, profession string, amount double, product string
    > row format delimited fields terminated by ',';
OK
Time taken: 0.106 seconds

hive> insert overwrite table out1
    > select a.custno,a.firstname,a.age,a.profession,b.amount,b.product
    > from customer a JOIN txnrecords b ON a.custno=b.custno;
Query ID = cloudera_20200412100909_d6678e43-b571-4cdc-ad33-0ae36178bdfd
Total jobs = 1

hive> select * from out1 limit 10;
OK
4007024 Cameron 59 Actor 40.33 Cardio Machine Accessories
4006742 Gregory 36 Accountant 198.44 Weightlifting Gloves
4009775 Ruby 44 Designer 5.58 Weightlifting Machine Accessories
4002199 Keith 44 Police officer 198.19 Gymnastics Rings
4002613 Hugh 43 Engineering technician 98.81 Field Hockey
4007591 Jennifer 54 Electrician 193.63 Camping & Backpacking & Hiking
4002190 Sheryl 62 Designer 27.89 Jigsaw Puzzles
4002964 Ken 67 Recreation and fitness worker 96.01 Sandboxes
4007361 Terri 52 Loan officer 10.44 Snowmobiling
4004798 Geoffrey 65 Chemist 152.46 Bungee Jumping
Time taken: 0.08 seconds, Fetched: 10 row(s)

Displaying the header of the records:
hive> set hive.cli.print.header=true;
hive> select * from out1 limit 10;
OK
out1.custno out1.firstname out1.age out1.profession out1.amount out1.product
4007024 Cameron 59 Actor 40.33 Cardio Machine Accessories
4006742 Gregory 36 Accountant 198.44 Weightlifting Gloves
4009775 Ruby 44 Designer 5.58 Weightlifting Machine Accessories
4002199 Keith 44 Police officer 198.19 Gymnastics Rings
4002613 Hugh 43 Engineering technician 98.81 Field Hockey
4007591 Jennifer 54 Electrician 193.63 Camping & Backpacking & Hiking
4002190 Sheryl 62 Designer 27.89 Jigsaw Puzzles
4002964 Ken 67 Recreation and fitness worker 96.01 Sandboxes
4007361 Terri 52 Loan officer 10.44 Snowmobiling
4004798 Geoffrey 65 Chemist 152.46 Bungee Jumping
Time taken: 0.085 seconds, Fetched: 10 row(s)

10) create out2 table

hive> create table out2 (custno int, firstname string, age int, profession string, amount double, product string, level string)
    > row format delimited fields terminated by ',';

Categorize based on Age :

hive> insert overwrite table out2 
    > select *, case
    > when age < 30 then 'low'
    > when age >= 30 and age < 50 then 'middle'
    > when age >= 50 then 'old'
    > else 'others'
    > end
    > from out1;

hive> select * from out2 limit 10;
OK
out2.custno out2.firstname out2.age out2.profession out2.amount out2.product out2.level
4007024 Cameron 59 Actor 40.33 Cardio Machine Accessories old
4006742 Gregory 36 Accountant 198.44 Weightlifting Gloves middle
4009775 Ruby 44 Designer 5.58 Weightlifting Machine Accessories middle
4002199 Keith 44 Police officer 198.19 Gymnastics Rings middle
4002613 Hugh 43 Engineering technician 98.81 Field Hockey middle
4007591 Jennifer 54 Electrician 193.63 Camping & Backpacking & Hiking old
4002190 Sheryl 62 Designer 27.89 Jigsaw Puzzles old
4002964 Ken 67 Recreation and fitness worker 96.01 Sandboxes old
4007361 Terri 52 Loan officer 10.44 Snowmobiling old
4004798 Geoffrey 65 Chemist 152.46 Bungee Jumping old
Time taken: 0.112 seconds, Fetched: 10 row(s)

11) create a final table for displaying the results.

hive> create table out3 (level string, amount double)
    > row format delimited fields terminated by ',';

hive> insert overwrite table out3
    > select level,sum(amount) from out2 group by level;

hive> select * from out3;
OK
out3.level out3.amount
low 725221.3399999988
middle 1855861.669999996
old 2529100.310000011

*********************************************************

BigData Hive Concepts

Hive Concepts

- Hive is a bigdata query language.
- SQL interface to Hadoop - Hive Query Language
- Hive is a data warehouse
- Schema on write - RDBMS -> create a table with 5 columns, you must insert 5 column data.
- Schema on read - Data warehouse -> create a table with 5 columns, you can insert 10 column data.
- The data warehouse does not support indexing & stored procedures but support views.


[cloudera@quickstart ~]$ hive

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive>
hive> show databases;
OK
default
jan2020
retail
Time taken: 0.679 seconds, Fetched: 3 row(s)

hive> create database feb2020;

hive> use feb2020;

hive> show tables;
OK
Time taken: 0.177 seconds

hive> create database mar2020;
OK
Time taken: 0.564 seconds
hive> use mar2020;
OK
Time taken: 0.02 seconds
hive> show tables;
OKTime taken: 0.139 seconds

create Table
hive> create table students_mar(
    > name string,
    > location string,
    > phone int,
    > course string)
    > row format delimited fields terminated by ',';
OK
Time taken: 0.101 seconds

insert data into Table or Load
hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt' into table students_mar;
Loading data to table mar2020.students_mar
Table mar2020.students_mar stats: [numFiles=1, totalSize=106]
OK
Time taken: 0.508 seconds

hive> select * from students_mar;
OK
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
Time taken: 0.051 seconds, Fetched: 4 row(s)

describe Table
hive> describe formatted students_mar;
OK
# col_name            data_type            comment             
 
name                string                                  
location            string                                                 
numFiles             1                   
totalSize            106                 
transient_lastDdlTime 15838
phone                int                                      
course              string                                  
 
# Detailed Table Information  
Database:            mar2020               
Owner:              cloudera             
CreateTime:          Mon Mar 09 18:49:18 PDT 2020  
LastAccessTime:      UNKNOWN               
Protect Mode:        None                 
Retention:          0                     
Location:            hdfs://quickstart.cloudera:8020/user/hive/warehouse/mar2020.db/students_mar  
Table Type:          MANAGED_TABLE         
Table Parameters:  
COLUMN_STATS_ACCURATE true 05314          
 
# Storage Information  
SerDe Library:      org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
InputFormat:        org.apache.hadoop.mapred.TextInputFormat  
OutputFormat:        org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
Compressed:          No                   
Num Buckets:        -1                   
Bucket Columns:      []                   
Sort Columns:        []                   
Storage Desc Params:  
field.delim          ,                   
serialization.format ,                   
Time taken: 0.344 seconds, Fetched: 33 row(s)

Hive folder structure in HDFS

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/
Found 4 items
drwxrwxrwx   - cloudera supergroup          0 2020-02-12 17:36 /user/hive/warehouse/feb2020.db
drwxrwxrwx   - cloudera supergroup          0 2020-01-21 10:24 /user/hive/warehouse/jan2020.db
drwxrwxrwx   - cloudera supergroup          0 2020-03-09 18:49 /user/hive/warehouse/mar2020.db
drwxrwxrwx   - cloudera supergroup          0 2020-02-01 09:49 /user/hive/warehouse/retail.db
[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db
Found 1 items
drwxrwxrwx   - cloudera supergroup          0 2020-03-09 18:55 /user/hive/warehouse/mar2020.db/students_mar
[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar
Found 1 items
-rwxrwxrwx   1 cloudera supergroup        106 2020-03-09 18:55 /user/hive/warehouse/mar2020.db/students_mar/students_hive.txt

[cloudera@quickstart ~]$ 

Load the same file 2nd time

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt' into table students_mar;
Loading data to table mar2020.students_mar
Table mar2020.students_mar stats: [numFiles=2, totalSize=212]

Data gets appended
hive> select * from students_mar;
OK
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
Time taken: 0.07 seconds, Fetched: 8 row(s)

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar
Found 2 items
-rwxrwxrwx   1 cloudera supergroup        106 2020-03-09 18:55 /user/hive/warehouse/mar2020.db/students_mar/students_hive.txt
-rwxrwxrwx   1 cloudera supergroup        106 2020-03-17 10:42 /user/hive/warehouse/mar2020.db/students_mar/students_hive_copy_1.txt
[cloudera@quickstart ~]$ 

Replace/Overwrite the data 
hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt' overwrite into table students_mar;
Loading data to table mar2020.students_mar
Table mar2020.students_mar stats: [numFiles=1, numRows=0, totalSize=106, rawDataSize=0]

hive> select * from students_mar;
OK
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
Time taken: 0.098 seconds, Fetched: 4 row(s)

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar
Found 1 items
-rwxrwxrwx   1 cloudera supergroup        106 2020-03-17 10:49 /user/hive/warehouse/mar2020.db/students_mar/students_hive.txt
[cloudera@quickstart ~]$ 

Schema On-Read
Take a sample file(students_new.txt) with more columns.

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_new.txt' overwrite into table students_mar;
Loading data to table mar2020.students_mar
Table mar2020.students_mar stats: [numFiles=1, numRows=0, totalSize=162, rawDataSize=0]

hive> select * from students_mar;
OK
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
NULL NULL NULL
Time taken: 0.067 seconds, Fetched: 5 row(s)

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar
Found 1 items
-rwxrwxrwx   1 cloudera supergroup        162 2020-03-17 10:57 /user/hive/warehouse/mar2020.db/students_mar/students_new.txt

Managed Table
-Metadata inside the table is managed by Hive. By default, all the tables are Managed Tables.
- Drop table - The table will be deleted & Data will be gone.

External Table
- Data will be read from an external directory. If you want to load the data you can load using the external files at any location.
- Drop table - The table will be deleted & But Data will be there. Data will be present in the External table location.

hive> create external table students_ext(
    > name string,
    > location string,
    > phone int,
    > course string)
    > row format delimited fields terminated by ','
    > location '/MARCH';

[cloudera@quickstart ~]$ hdfs dfs -ls /
Found 9 items
drwxr-xr-x   - cloudera supergroup          0 2020-02-14 22:06 /FEBRUARY
drwxr-xr-x   - cloudera supergroup          0 2020-01-21 19:51 /JANUARY
drwxr-xr-x   - cloudera supergroup          0 2020-03-17 11:08 /MARCH
drwxrwxrwx   - hdfs     supergroup          0 2017-10-23 09:15 /benchmarks
drwxr-xr-x   - hbase    supergroup          0 2020-01-19 20:06 /hbase
drwxr-xr-x   - solr     solr                0 2017-10-23 09:18 /solr
drwxrwxrwt   - hdfs     supergroup          0 2019-05-01 17:48 /tmp
drwxr-xr-x   - hdfs     supergroup          0 2017-10-23 09:17 /user
drwxr-xr-x   - hdfs     supergroup          0 2017-10-23 09:17 /var

hive> select * from students_ext;
OK
Time taken: 0.053 seconds

[cloudera@quickstart ~]$ hdfs dfs -put file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt /MARCH
[cloudera@quickstart ~]$ 

hive> select * from students_ext;
OK
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
Time taken: 0.051 seconds, Fetched: 4 row(s)

Job History Server link
http://quickstart.cloudera:19888/jobhistory/job/job_1579493137658_0001

Hadoop Cluster Applications link
http://quickstart.cloudera:8088/cluster/apps/

Partitioning 
When the table has multiple files and the user wants to search for a particular file, the query will search all the files and get the results(which is very time-consuming process). By using partitioning, we will query the specific file in the Table. Using partition your query will be very fast.

Eg:- retail/sales/month.txt
when the user query for "select * from sales where month='May';

Create partition based on month name.
Eg:- retail/sales/MAY/month.txt

- Partitioning is not mandatory
- Partitioning is completely based on the Query & data. (Eg:- sometimes where clause may vary like where country='USA' in that case, we need to define another partition)
- Partitioning on unique/primary column is not possible, We use Bucketing for this.
Eg:- select * from sales where CustomerNo='12345';

Partitioning Types 
Dynamic - This is the column I want to do the partitioning, this info we need to provide. When loading the data the Hive will automatically be partitioning the data.
Eg:- retail/sales/MAY/month.txt

Static - You can define the partitions and upload the data.

Partitioning Modes 
Strict Mode - Hive will allow queries based on Partitions.
select * from sales where month='May' -> will success
select * from sales where country='USA' -> will fail

Non Strict Mode - Hive will allow all queries.
select * from sales where month='May' -> will success
select * from sales where country='USA' -> will success

Bucketing or Clustering
Partitioning on unique/primary column is not possible, We use Bucketing for this.

- You tell hive I want x no of Buckets based on the column.
Eg:- I want 10 buckets based on CustomerNo.
- Buckets are nothing but folders.