BigData: BigData Hive Concepts

Hive Concepts

- Hive is a bigdata query language.
- SQL interface to Hadoop - Hive Query Language
- Hive is a data warehouse
- Schema on write - RDBMS -> create a table with 5 columns, you must insert 5 column data.
- Schema on read - Data warehouse -> create a table with 5 columns, you can insert 10 column data.
- The data warehouse does not support indexing & stored procedures but support views.

[cloudera@quickstart ~]$ hive

Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
WARNING: Hive CLI is deprecated and migration to Beeline is recommended.
hive>
hive> show databases;
OK
default
jan2020
retail
Time taken: 0.679 seconds, Fetched: 3 row(s)

hive> create database feb2020;

hive> use feb2020;

hive> show tables;

Time taken: 0.177 seconds

hive> create database mar2020;
OK
Time taken: 0.564 seconds
hive> use mar2020;
OK
Time taken: 0.02 seconds
hive> show tables;
OKTime taken: 0.139 seconds

create Table

hive> create table students_mar(

> name string,

> location string,

> phone int,

> course string)

> row format delimited fields terminated by ',';

Time taken: 0.101 seconds

insert data into Table or Load

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt' into table students_mar;

Loading data to table mar2020.students_mar

Table mar2020.students_mar stats: [numFiles=1, totalSize=106]

Time taken: 0.508 seconds

hive> select * from students_mar;

John Banglore 12345 BiGData

Mike Hyderabad 112233 Java

Bob Delhi 11111 DataScience

Smith Mumbai 22222 DBA

Time taken: 0.051 seconds, Fetched: 4 row(s)

describe Table

hive> describe formatted students_mar;

# col_name data_type comment

name string

location string

numFiles 1

totalSize 106

transient_lastDdlTime 15838

phone int

course string

# Detailed Table Information

Database: mar2020

Owner: cloudera

CreateTime: Mon Mar 09 18:49:18 PDT 2020

LastAccessTime: UNKNOWN

Protect Mode: None

Retention: 0

Location: hdfs://quickstart.cloudera:8020/user/hive/warehouse/mar2020.db/students_mar

Table Type: MANAGED_TABLE

Table Parameters:

COLUMN_STATS_ACCURATE true 05314

# Storage Information

SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

InputFormat: org.apache.hadoop.mapred.TextInputFormat

OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

Compressed: No

Num Buckets: -1

Bucket Columns: []

Sort Columns: []

Storage Desc Params:

field.delim ,

serialization.format ,

Time taken: 0.344 seconds, Fetched: 33 row(s)

Hive folder structure in HDFS

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/
Found 4 items
drwxrwxrwx - cloudera supergroup 0 2020-02-12 17:36 /user/hive/warehouse/feb2020.db
drwxrwxrwx - cloudera supergroup 0 2020-01-21 10:24 /user/hive/warehouse/jan2020.db
drwxrwxrwx - cloudera supergroup 0 2020-03-09 18:49 /user/hive/warehouse/mar2020.db
drwxrwxrwx - cloudera supergroup 0 2020-02-01 09:49 /user/hive/warehouse/retail.db
[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db
Found 1 items
drwxrwxrwx - cloudera supergroup 0 2020-03-09 18:55 /user/hive/warehouse/mar2020.db/students_mar
[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar
Found 1 items
-rwxrwxrwx 1 cloudera supergroup 106 2020-03-09 18:55 /user/hive/warehouse/mar2020.db/students_mar/students_hive.txt

[cloudera@quickstart ~]$

Load the same file 2nd time

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt' into table students_mar;

Loading data to table mar2020.students_mar

Table mar2020.students_mar stats: [numFiles=2, totalSize=212]

Data gets appended

hive> select * from students_mar;

John Banglore 12345 BiGData

Mike Hyderabad 112233 Java

Bob Delhi 11111 DataScience

Smith Mumbai 22222 DBA

John Banglore 12345 BiGData

Mike Hyderabad 112233 Java

Bob Delhi 11111 DataScience

Smith Mumbai 22222 DBA

Time taken: 0.07 seconds, Fetched: 8 row(s)

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar

Found 2 items

-rwxrwxrwx 1 cloudera supergroup 106 2020-03-09 18:55 /user/hive/warehouse/mar2020.db/students_mar/students_hive.txt

-rwxrwxrwx 1 cloudera supergroup 106 2020-03-17 10:42 /user/hive/warehouse/mar2020.db/students_mar/students_hive_copy_1.txt

[cloudera@quickstart ~]$

Replace/Overwrite the data

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt' overwrite into table students_mar;
Loading data to table mar2020.students_mar
Table mar2020.students_mar stats: [numFiles=1, numRows=0, totalSize=106, rawDataSize=0]

hive> select * from students_mar;
OK
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
Time taken: 0.098 seconds, Fetched: 4 row(s)

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar

Found 1 items

-rwxrwxrwx 1 cloudera supergroup 106 2020-03-17 10:49 /user/hive/warehouse/mar2020.db/students_mar/students_hive.txt

[cloudera@quickstart ~]$

Schema On-Read

Take a sample file(students_new.txt) with more columns.

hive> load data local inpath 'file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_new.txt' overwrite into table students_mar;
Loading data to table mar2020.students_mar
Table mar2020.students_mar stats: [numFiles=1, numRows=0, totalSize=162, rawDataSize=0]

hive> select * from students_mar;
OK
John Banglore 12345 BiGData
Mike Hyderabad 112233 Java
Bob Delhi 11111 DataScience
Smith Mumbai 22222 DBA
NULL NULL NULL
Time taken: 0.067 seconds, Fetched: 5 row(s)

[cloudera@quickstart ~]$ hdfs dfs -ls /user/hive/warehouse/mar2020.db/students_mar

Found 1 items

-rwxrwxrwx 1 cloudera supergroup 162 2020-03-17 10:57 /user/hive/warehouse/mar2020.db/students_mar/students_new.txt

Managed Table

-Metadata inside the table is managed by Hive. By default, all the tables are Managed Tables.

- Drop table - The table will be deleted & Data will be gone.

External Table

- Data will be read from an external directory. If you want to load the data you can load using the external files at any location.

- Drop table - The table will be deleted & But Data will be there. Data will be present in the External table location.

hive> create external table students_ext(

> name string,

> location string,

> phone int,

> course string)

> row format delimited fields terminated by ','

> location '/MARCH';

[cloudera@quickstart ~]$ hdfs dfs -ls /

Found 9 items

drwxr-xr-x - cloudera supergroup 0 2020-02-14 22:06 /FEBRUARY

drwxr-xr-x - cloudera supergroup 0 2020-01-21 19:51 /JANUARY

drwxr-xr-x - cloudera supergroup 0 2020-03-17 11:08 /MARCH

drwxrwxrwx - hdfs supergroup 0 2017-10-23 09:15 /benchmarks

drwxr-xr-x - hbase supergroup 0 2020-01-19 20:06 /hbase

drwxr-xr-x - solr solr 0 2017-10-23 09:18 /solr

drwxrwxrwt - hdfs supergroup 0 2019-05-01 17:48 /tmp

drwxr-xr-x - hdfs supergroup 0 2017-10-23 09:17 /user

drwxr-xr-x - hdfs supergroup 0 2017-10-23 09:17 /var

hive> select * from students_ext;

Time taken: 0.053 seconds

[cloudera@quickstart ~]$ hdfs dfs -put file:///home/cloudera/Desktop/Basha/Basha2019/HIVE_Practicals/students_hive.txt /MARCH

[cloudera@quickstart ~]$

hive> select * from students_ext;

John Banglore 12345 BiGData

Mike Hyderabad 112233 Java

Bob Delhi 11111 DataScience

Smith Mumbai 22222 DBA

Time taken: 0.051 seconds, Fetched: 4 row(s)

Job History Server link
http://quickstart.cloudera:19888/jobhistory/job/job_1579493137658_0001

Hadoop Cluster Applications link
http://quickstart.cloudera:8088/cluster/apps/

Partitioning
When the table has multiple files and the user wants to search for a particular file, the query will search all the files and get the results(which is very time-consuming process). By using partitioning, we will query the specific file in the Table. Using partition your query will be very fast.

Eg:- retail/sales/month.txt
when the user query for "select * from sales where month='May';

Create partition based on month name.
Eg:- retail/sales/MAY/month.txt

- Partitioning is not mandatory
- Partitioning is completely based on the Query & data. (Eg:- sometimes where clause may vary like where country='USA' in that case, we need to define another partition)
- Partitioning on unique/primary column is not possible, We use Bucketing for this.
Eg:- select * from sales where CustomerNo='12345';

Partitioning Types
Dynamic - This is the column I want to do the partitioning, this info we need to provide. When loading the data the Hive will automatically be partitioning the data.
Eg:- retail/sales/MAY/month.txt

Static - You can define the partitions and upload the data.

Partitioning Modes

Strict Mode - Hive will allow queries based on Partitions.
select * from sales where month='May' -> will success
select * from sales where country='USA' -> will fail

Non Strict Mode - Hive will allow all queries.
select * from sales where month='May' -> will success
select * from sales where country='USA' -> will success

Bucketing or Clustering
Partitioning on unique/primary column is not possible, We use Bucketing for this.

- You tell hive I want x no of Buckets based on the column.
Eg:- I want 10 buckets based on CustomerNo.
- Buckets are nothing but folders.

BigData

Tuesday, April 14, 2020

BigData Hive Concepts

No comments:

Post a Comment