Tuesday, April 14, 2020

BigData Hive Concepts

Hive Concepts

- Hive is a bigdata query language.
- SQL interface to Hadoop - Hive Query Language
- Hive is a data warehouse
- Schema on write - RDBMS -> create a table with 5 columns, you must insert 5 column data.
- Schema on read - Data warehouse -> create a table with 5 columns, you can insert 10 column data.
- The data warehouse does not support indexing & stored procedures but support views.


[cloudera@quickstart ~]$ hive

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive>
hive> show databases;
OK
default
jan2020
retail
Time taken: 0.679 seconds, Fetched: 3 row(s)

hive> create database feb2020;

hive> use feb2020;

hive> show tables;
OK
Time taken: 0.177 seconds

hive> create database mar2020;
OK
Time taken: 0.564 seconds
hive> use mar2020;
OK
Time taken: 0.02 seconds
hive> show tables;
OKTime taken: 0.139 seconds

create Table
hive> create table students_mar(
    > name string,
    > location string,
    > phone int,
    > course string)
    > row format delimited fields terminated by ',';
OK
Time taken: 0.101 seconds

insert data into Table or Load
hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt' into table students_mar;
Loading data to table mar2020.students_mar
Table mar2020.students_mar stats: [numFiles=1, totalSize=106]
OK
Time taken: 0.508 seconds

hive> select * from students_mar;
OK
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
Time taken: 0.051 seconds, Fetched: 4 row(s)

describe Table
hive> describe formatted students_mar;
OK
# col_name            data_type            comment             
 
name                string                                  
location            string                                                 
numFiles             1                   
totalSize            106                 
transient_lastDdlTime 15838
phone                int                                      
course              string                                  
 
# Detailed Table Information  
Database:            mar2020               
Owner:              cloudera             
CreateTime:          Mon Mar 09 18:49:18 PDT 2020  
LastAccessTime:      UNKNOWN               
Protect Mode:        None                 
Retention:          0                     
Location:            hdfs://quickstart.cloudera:8020/user/hive/warehouse/mar2020.db/students_mar  
Table Type:          MANAGED_TABLE         
Table Parameters:  
COLUMN_STATS_ACCURATE true 05314          
 
# Storage Information  
SerDe Library:      org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  
InputFormat:        org.apache.hadoop.mapred.TextInputFormat  
OutputFormat:        org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat  
Compressed:          No                   
Num Buckets:        -1                   
Bucket Columns:      []                   
Sort Columns:        []                   
Storage Desc Params:  
field.delim          ,                   
serialization.format ,                   
Time taken: 0.344 seconds, Fetched: 33 row(s)

Hive folder structure in HDFS

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/
Found 4 items
drwxrwxrwx   - cloudera supergroup          0 2020-02-12 17:36 /user/hive/warehouse/feb2020.db
drwxrwxrwx   - cloudera supergroup          0 2020-01-21 10:24 /user/hive/warehouse/jan2020.db
drwxrwxrwx   - cloudera supergroup          0 2020-03-09 18:49 /user/hive/warehouse/mar2020.db
drwxrwxrwx   - cloudera supergroup          0 2020-02-01 09:49 /user/hive/warehouse/retail.db
[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db
Found 1 items
drwxrwxrwx   - cloudera supergroup          0 2020-03-09 18:55 /user/hive/warehouse/mar2020.db/students_mar
[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar
Found 1 items
-rwxrwxrwx   1 cloudera supergroup        106 2020-03-09 18:55 /user/hive/warehouse/mar2020.db/students_mar/students_hive.txt

[cloudera@quickstart ~]$ 

Load the same file 2nd time

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt' into table students_mar;
Loading data to table mar2020.students_mar
Table mar2020.students_mar stats: [numFiles=2, totalSize=212]

Data gets appended
hive> select * from students_mar;
OK
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
Time taken: 0.07 seconds, Fetched: 8 row(s)

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar
Found 2 items
-rwxrwxrwx   1 cloudera supergroup        106 2020-03-09 18:55 /user/hive/warehouse/mar2020.db/students_mar/students_hive.txt
-rwxrwxrwx   1 cloudera supergroup        106 2020-03-17 10:42 /user/hive/warehouse/mar2020.db/students_mar/students_hive_copy_1.txt
[cloudera@quickstart ~]$ 

Replace/Overwrite the data 
hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt' overwrite into table students_mar;
Loading data to table mar2020.students_mar
Table mar2020.students_mar stats: [numFiles=1, numRows=0, totalSize=106, rawDataSize=0]

hive> select * from students_mar;
OK
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
Time taken: 0.098 seconds, Fetched: 4 row(s)

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar
Found 1 items
-rwxrwxrwx   1 cloudera supergroup        106 2020-03-17 10:49 /user/hive/warehouse/mar2020.db/students_mar/students_hive.txt
[cloudera@quickstart ~]$ 

Schema On-Read
Take a sample file(students_new.txt) with more columns.

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_new.txt' overwrite into table students_mar;
Loading data to table mar2020.students_mar
Table mar2020.students_mar stats: [numFiles=1, numRows=0, totalSize=162, rawDataSize=0]

hive> select * from students_mar;
OK
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
NULL NULL NULL
Time taken: 0.067 seconds, Fetched: 5 row(s)

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar
Found 1 items
-rwxrwxrwx   1 cloudera supergroup        162 2020-03-17 10:57 /user/hive/warehouse/mar2020.db/students_mar/students_new.txt

Managed Table
-Metadata inside the table is managed by Hive. By default, all the tables are Managed Tables.
- Drop table - The table will be deleted & Data will be gone.

External Table
- Data will be read from an external directory. If you want to load the data you can load using the external files at any location.
- Drop table - The table will be deleted & But Data will be there. Data will be present in the External table location.

hive> create external table students_ext(
    > name string,
    > location string,
    > phone int,
    > course string)
    > row format delimited fields terminated by ','
    > location '/MARCH';

[cloudera@quickstart ~]$ hdfs dfs -ls /
Found 9 items
drwxr-xr-x   - cloudera supergroup          0 2020-02-14 22:06 /FEBRUARY
drwxr-xr-x   - cloudera supergroup          0 2020-01-21 19:51 /JANUARY
drwxr-xr-x   - cloudera supergroup          0 2020-03-17 11:08 /MARCH
drwxrwxrwx   - hdfs     supergroup          0 2017-10-23 09:15 /benchmarks
drwxr-xr-x   - hbase    supergroup          0 2020-01-19 20:06 /hbase
drwxr-xr-x   - solr     solr                0 2017-10-23 09:18 /solr
drwxrwxrwt   - hdfs     supergroup          0 2019-05-01 17:48 /tmp
drwxr-xr-x   - hdfs     supergroup          0 2017-10-23 09:17 /user
drwxr-xr-x   - hdfs     supergroup          0 2017-10-23 09:17 /var

hive> select * from students_ext;
OK
Time taken: 0.053 seconds

[cloudera@quickstart ~]$ hdfs dfs -put file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt /MARCH
[cloudera@quickstart ~]$ 

hive> select * from students_ext;
OK
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
Time taken: 0.051 seconds, Fetched: 4 row(s)

Job History Server link
http://quickstart.cloudera:19888/jobhistory/job/job_1579493137658_0001

Hadoop Cluster Applications link
http://quickstart.cloudera:8088/cluster/apps/

Partitioning 
When the table has multiple files and the user wants to search for a particular file, the query will search all the files and get the results(which is very time-consuming process). By using partitioning, we will query the specific file in the Table. Using partition your query will be very fast.

Eg:- retail/sales/month.txt
when the user query for "select * from sales where month='May';

Create partition based on month name.
Eg:- retail/sales/MAY/month.txt

- Partitioning is not mandatory
- Partitioning is completely based on the Query & data. (Eg:- sometimes where clause may vary like where country='USA' in that case, we need to define another partition)
- Partitioning on unique/primary column is not possible, We use Bucketing for this.
Eg:- select * from sales where CustomerNo='12345';

Partitioning Types 
Dynamic - This is the column I want to do the partitioning, this info we need to provide. When loading the data the Hive will automatically be partitioning the data.
Eg:- retail/sales/MAY/month.txt

Static - You can define the partitions and upload the data.

Partitioning Modes 
Strict Mode - Hive will allow queries based on Partitions.
select * from sales where month='May' -> will success
select * from sales where country='USA' -> will fail

Non Strict Mode - Hive will allow all queries.
select * from sales where month='May' -> will success
select * from sales where country='USA' -> will success

Bucketing or Clustering
Partitioning on unique/primary column is not possible, We use Bucketing for this.

- You tell hive I want x no of Buckets based on the column.
Eg:- I want 10 buckets based on CustomerNo.
- Buckets are nothing but folders.




No comments:

Post a Comment