Querying with Apache Hive and Impala Tutorial

Welcome to the second lesson of the Impala Training course. This lesson provides an overview of querying with Hive and Impala.

Let us explore the objectives of this lesson.

Objectives

After completing this lesson, you will be able to:

  • Discuss the SQL, DDL, and DML statements of Impala

  • Explain how to query data using Impala SQL

  • Recall how to use different SQL statements to perform CRUD operations in Impala

  • Explain how to load data into Impala tables

  • Differentiate between SQL statements in Hive and Impala

Let us begin with the SQL language statements in the next section.

SQL Language Statements

Impala uses Structure Query Language or SQL to perform a database query.

Standard and Compatibility

It follows the SQL-92 syntax version of the language. Also, Impala is compatible with the Hive Query Language or HiveQL to query the data stored in Hadoop Distributed File System, HDFS, or HBase.

Accessibility

To record information related to table structure and properties, both Impala and Hive use the same metadata store. Therefore, Impala can access tables either:

- defined by using the native Impala command that is CREATE TABLE, or

- the tables that are created in Hive using data definition language or DDL.

Standard and Compatibility

Similar to HiveQL, Impala supports most of the data manipulation or DML statements and clauses, such as JOIN, AGGREGATE, DISTINCT, UNION ALL, ORDER BY, LIMIT, and uncorrelated subquery in the FROM clause. In addition, it also supports INSERT INTO and INSERT OVERWRITE.

Let us discuss about DDL statements in the next section.

DDL Statements

Data Definition Language or DDL is a standard for the statements and commands that help create, modify, and remove objects such as tables, indexes, and users from a database.

Impala DDL statements are almost similar to the SQL statements used by people in the relational databases or data warehouse products.

Typically, these statements start with the keywords such as CREATE, DROP, or ALTER.

Impala supported DDL statements are

  • ALTER TABLE

  • ALTER VIEW

  • COMPUTE STATS

  • CREATE DATABASE

  • CREATE FUNCTION

  • CREATE ROLE

  • CREATE TABLE

  • CREATE VIEW

  • DROP TABLE

  • DROP FUNCTION

  • DROP ROLE

  • GRANT AND REVOKE

Let us learn about DML statements in the next section.

DML Statements

Data Manipulation Language or DML is a subset of the SQL statements that helps to modify the data stored in tables. Impala supports only a small set of DML statements, such as the INSERT statement and the Load Data statement.

INSERT statement’ -  It is optimized to insert a large amount of data in a single statement, thereby making effective use of the multi-megabyte HDFS blocks. To insert one or a few rows at a time INSERT into table VALUES statement be used.

LOAD DATA statement’ - It is used to move existing data files into an Impala table directory so that, they are immediately available for Impala queries.

Let us understand the method of creating databases in the next section.

CREATE DATABASE

In Impala, you can create two types of databases. They are the logical construct and the physical construct.

Logical Construct  - It is used for grouping related tables, views, and functions under same names.
Physical Construct - It is denoted by a tree directory in HDFS. Objects such as tables, partitions, and data files are all stored under this directory. After creating the database, you can use it with the help of the Database command.

Note that, a database created in Impala can also be used in Hive.

Let us understand the ‘CREATE Table’ command in the next section.

CREATE TABLE

In Impala, a table can be created by using the ‘CREATE Table’ command. Let us understand the general syntax of creating a table in Impala with the help of the commands shown below: 

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name

[(col_name data_type [COMMENT 'col_comment '], ...)] [COMMENT 'table_comment']

[PARTITIONED BY (col_name data_type [COMMENT 'col_comment '], ...)]

[WITH SERDEPROPERTIES ('key1'='value1','key2'='value2', ...)]     [

[ROW FORMAT row_formatJ [STORED AS file_format] ] [LOCATION 'dfs_path']

[TBLPROPERTIES ('key1'='value1', 'key2'='value2 ',...)] [CACHED IN 'pool_name']

The ‘PARTITIONED BY’ clause partitions data files based on one or more specified columns values. When the STORED AS clause is used the underlying data files formats are identified, by default.

When no ‘STORED AS’ clause is used, the Impala table data files are saved as text files.

Typically, for an external table, you add a ‘LOCATION’ clause to specify the path to the HDFS directory where Impala reads and writes files for the table. External table will be explained in detail later in the lesson.

Let us look at some examples of the ‘CREATE Table’ command in the next section.

CREATE TABLE - Examples

Some examples of ‘CREATE table’ command are as follows:  

 

Ex 1.
create table logs (field1 string, field2 string, field3 string) row format delimited fields terminated by ',';    

 

Ex 2.
create table logsPartitoned (field1 string, field2 string, field3 string) partitioned by (year string, month string, day string, host string) row format delimited fields terminated by ',';

 

Ex 3.
create external table customerAddress

( ca_address_sk int, ca_address_id string, ca_street_number string, ca_street_name string, ca_street_type string, ca_suite_number string, ca_city string, ca_county string, ca_state string, ca_zip string, ca_country string,ca_gmt_offset float, ca_location_type string ) row format delimited fields terminated by '|’ location '/user/hive/tpcds/customer_address';

  • The first command shown above helps to create a managed table in the Impala. For the managed table, data has to be loaded explicitly.

  • The second command helps to create a Partition managed table, by specifying the ‘’PARTITIONED BY’ clause in the create SQL statement.

  • The third command helps to create an external table whose data is in the HDFS directory. The path of the directory is /user/hive/tpcds/customer_address. You need not load data explicitly for the external tables, data will be automatically loaded into this table based on the location clause in the CREATE Table command.

Let us look at the two types of tables, that is, Internal and External in the next section.

Internal and External Tables

Two types of tables can be created in Impala. They are—Managed or Internal table and External table.
Internal Table:

  • For Managed or Internal tables, Impala manages a directory in HDFS and stores the data files.

  • Impala also manages the underlying data files and deletes the related data files when a table is dropped.

External Table:

  • External Table point to existing data files that potentially reside in HDFS out of Impala directories.

  • You need to issue the ‘REFRESH’ command when adding or replacing data using HDFS Operation.

  • In addition, the DROP TABLE statement does not physically remove the underlying data.

Let us understand the method of Loading Data to an Impala Table in the next section.

Loading Data into Impala Table

For the Managed or Internal table, data needs to be loaded by executing the ‘load data inpath’ command. The general syntax of loading data into managed Impala table is given below.

LOAD DATA INPATH ‘hdfs_file_or_directory_path’ [OVERWRITE] INTO TABLE tablename

[PARTITION (partcol1=val1, partcol2=val2…)]


In addition, the load command can be used to load the data into the Partitioned table. An example of the command used to load data into a partitioned table is given below.
 

Load DATA INPATH '/log/2013/07/28/log.txt' into table logsPartitoned partition

(year="2013", month="07", day="28", host=" host1") ;

Load DATA INPATH '/log/log .txt' into table log;

Let us understand how to alter the properties of Impala table using the ‘ALTER TABLE Statement’ in the next section.

ALTER TABLE

An ALTER TABLE statement is used to change the structure or properties of an existing Impala table such as:

Rename table:

ALTER TABLE [old_db_name.]old_table_name RENAME TO [new_db_name.]new_table_name

Add new column, drop existing column:

ALTER TABLE name ADD COLUMNS (col_spec[, col_spec …])

ALTER TABLE name DROP [COLUMN] column_name

 Change column name:

ALTER TABLE name CHANGE column_name new_name new_type

 Replace column data type:

ALTER TABLE name REPLACE COLUMNS (col_spec[, col_spec …])

 Add or drop partition:

ALTER TABLE name { ADD | DROP } PARTITION (partition_spec)

ALTER TABLE name PARTITION (partition_spec)] SET { CACHED IN ‘pool_name’ | UNCACHED }

 ALTER TABLE in Impala is a logical function that updates the table metadata in the common metastore shared with Hive.

Let us discuss about the Drop Table statement in the next section.

DROP TABLE

The Drop SQL table statement allows the deletion of an Impala table. On using this statement, the associated data files for the table and the HDFS directory is removed by default. However, if the table is created with the EXTERNAL clause, then Impala does not delete the files and directories.

The general syntax of DROP TABLE command is given below:
 

DROP TABLE [IF EXISTS] [db_name.]table_name


Note that, before issuing a DROP DATABASE statement, use a DROP TABLE statement and remove all the tables and data files in that database.

Let us learn about the ‘DROP DATABASE’ command in the next section.

DROP DATABASE

Drop Database command is used to delete a database. However, before using the Drop Database command, use a combination of DROP TABLE; DROP VIEW; ALTER TABLE; and ALTER VIEW statements, to delete all the tables and views in the database or move them to other databases.

The general syntax of DROP DATABASE command is given below:

DROP (DATABASE | SCHEMA) [IF EXISTS] database_name;

Let us discuss the Describe Statement in the next section.

DESCRIBE Statement

A table metadata, such as column names and data values can be viewed by executing the Describe statement. The ‘DESCRIBE FORMATTED’ variation shows further information in a format that the Apache Hive users are familiar with.

This additional information includes minor details, such as:

  • The table type is Internal or external,

  • Date of creation,

  • File format, and

  • Data location in HDFS.

It also shows if the object is a table or a view. If the object is a view, it shows query text from the view definition.

The general syntax of Describe Statement is shown below:

DESCRIBE [FORMATTED] table

Let us understand the Explain Statement in the next section.

Want to check the course preview of our Impala Training Course? Click here to watch!

EXPLAIN Statement

Explain Statement’ returns a plan to execute a statement. You can review the query plan generated by the Explain statement to check whether the query is performing efficiently or not. You can also alter the query and/or the schema if they are not performing as expected. Some performance enhancement steps that can be initiated are as follows:

  1. Change the tests in the WHERE clause

  2. Add hints to make join operations more efficient and introduce subqueries,

  3. Change the order of tables in a join operation

  4. Add or change partitioning for a table, and

  5. Collect column statistics and/or table statistics in Hive

An example of EXPLAIN statement is given below:

EXPLAIN { select_query | ctas_stmt | insert_stmt }

The EXPLAIN Statement notifies if table or column statistics are missing from any table involved in the query. These statistics are important for optimizing queries involving large tables or multi-table joins.

Let us learn about the ‘Show Table’ Statement in the next section.

SHOW TABLE Statement

You can see all the tables and views present inside a database by executing the Show Table command. Now the question arises:

  • ‘What happens when the authorization in the database is enabled?’ ​

- When the authorization is enabled, the SHOW table command displays only those objects for which you have privilege.

Let us learn about the Insert Statement in the next section.

INSERT Statement

INSERT statement is used to insert data into tables and partitions. The two types of INSERT statements are as follows:

  • INSERT INTO syntax - It appends data to a table. In this syntax, the existing data files are left untouched whereas, the inserted data is put into one or more new data files.

  • INSERT OVERWRITE syntax - It replaces the data in the table. The overwritten data files are deleted immediately, thus they do not go through the HDFS trash mechanism.

The general syntax of using the INSERT Statement is given below:

[with_clause] INSERT { INTO I OVERWRITE } [TABLE ] table_name [(column_list)][ PARTITION (partition_clause) ] {[hint_clause]  select_statementI VALUES (value [, value ...]) [, (value [, value ...]) ...] } partition_clause ::= col_name [= constant] [, col_name [= constant] ...] hint_clause ::= [SHUFFLE] I

[NOSHUFFLE]      (Note: The square brackets are part of the syntax.)

 Let us look at a few examples of the Insert Statement in the next section.

INSERT Statement - Examples

Some examples of ‘INSERT statement’ are as follows:

  • insert into sample1 values (1,true,100.0);

  • insert overwrite sample1 values (10,false,pow( 2,5)), (50,true,10/3);

  • insert overwrite sample2 partition (year, month,day) select * from source;

The first statement is for inserting a row into a table called sample.

The second Insert statement is for inserting two rows in the table.

The third Insert command is for inserting data in Partition.

Let us learn about the Select Statement in the next section.

SELECT Statement

The SELECT statement performs queries, retrieves data from tables and produces result sets consisting of rows and columns. In a Select statement, you can use following clauses: JOIN, ORDER BY, GROUP BY, HAVING, LIMIT, OFFSET, UNION, and WITH clause.

In this statement, DISTINCT Operator, subqueries, and hints like; SHUFFLE or NONSHUFFLE, can be included.

Typically, the Insert Statement in Impala for copying data from one table to another ends with a SELECT statement.

Let us discuss the Impala data type in the next section.

Data Type

The correct data type for the columns must be chosen while defining the table schema. The different data types supported in Impala are:

  • BIGINT Data type: It is an 8-byte integer data type.

  • BOOLEAN Data type: It is a data type that represents a single true or false choice. Note that, Impala generally does not convert a data type to BOOLEAN values automatically. To convert to BOOLEAN, you must use an explicit call to the CAST function.

  • CHAR Data type or CDH 5.2 or higher only: It is a fixed-length character type, which is followed by trailing spaces to achieve the specified length. If the values are longer than the specified length, then Impala truncates the trailing characters, if any.

  • DECIMAL Data type or CDH 5.1 or higher only: It is a numeric data type with fixed scale and precision. This data type is suitable for financial and other arithmetic calculations.

  • DOUBLE Data type: It is a floating-point data type with double precision and 8-byte memory.

  • FLOAT Data Type: It is a single precision floating-point data type that contains 4-byte on the disk.

  • INT Data Type: It is an integer data type that contains 4-byte on the disk.

  • REAL Data Type: It is an alias for the DOUBLE data type.

  • SMALLINT Data Type: It is a 2-byte integer data type.

  • STRING Data Type: It is a data type represents varying character set.

Let us learn about the SQL Operators in the next section.

Operators

SQL operators are the comparison functions used within the WHERE clauses of SELECT statements. Following operators are supported in Impala.

  • The ‘BETWEEN Operator’ is used in a ‘WHERE clause’ to compare an expression to a lower and upper bound.

  • Comparison Operators are used in WHERE clause to compare and check equality and sort the column data types sequence.

  • The IN operator compares an argument value to a set of values. It returns TRUE, if the argument finds any matches in the set.

  • The IS NULL operator, and the IS NOT NULL operator in a WHERE clause, test whether a specified value is NULL or not.

  • The LIKE Operator is used for STRING data. It has the basic wildcard capability and uses ‘_’ to find a single character and the ‘%’ to find multiple characters.

  • The REGEXP (Regular Expression) Operator in the WHERE Clause tests if a value matches a regular expression.

  • The RLIKE Operator is a synonym for Regular Expression.

Let us discuss the built-in Functions in the next section.

Functions

Impala supports several built-in functions. They are:

  • Mathematical

  • Type conversion

  • Date and time

  • Conditional

  • String and

  • Aggregate

These functions let you perform mathematical calculations, string manipulation, date calculations, and any data transformation tasks directly in SELECT statements. These built-in functions let a SQL query return quick results with formatting, calculation, and type conversions applied.

Let us look at the Query-specific SQL statements in the next section.

CREATE VIEW in Impala

The CREATE VIEW statement is used to simplify the complicated queries. CREATE VIEW is just a logical construct that does not metalize on the disk. The syntax to create view in Impala is given below:

CREATE VIEW [IF NOT EXISTS] view_name [(column_list)] AS select_statement

By using CREATE VIEW, you can query data from VIEW the same way you query data from an Impala table. These queries can be executed from applications, scripts, or interactive queries in the Impala-shell.

CREATE VIEW in a SQL statement:

  • simplifies complicated queries that involve joins between multiple tables, complicated expressions, and SQL syntax that are difficult to understand and debug.

  • can also be used to hide the underlying table and column names.

Note that, to modify the created view, ALTER VIEW statement is used.

Let us discuss the differences between Hive and Impala Query Syntax in the next section.

Looking for more information on Hive and Impala? Watch out our Course Preview!

SQL Difference Between Apache Hive and Impala

In some cases, Impala SQL and HiveQL use similar SQL statements and clause names. However, the semantics of Impala SQL statements are different from Hive SQL as shown below: 

Impala SQL

Hive SQL

  • Uses different syntax and names, such as [SHUFFLE] and [NONSHUFFLE].

  • does not expose MapReduce specific features of SORT BY, DISTRIBUTE BY, or CLUSTER BY.

  • Does not require queries to include FROM clause

  • Does not implicitly cast between string and numeric or Boolean values

  • Performs implicit casts among the numeric types

  • Performs implicit casts from string to timestamp.
    and has a restricted set of literal formats for the TIMESTAMP data type and the from_unixtime format string.

  • Does not use syntax

  • Exposes MAPReduce specific features.

  • Require queries to include a FROM clause

  • Implicit casts between string and numeric or Boolean values

  • Does not perform implicit casts among the numeric types.

  • does not perform implicit casts from string to timestamp.

Summary

Let us summarize the topics covered in this lesson:

  • Impala uses three types of language statements: SQL, DDL, and DML.

  • The two types of databases created in Impala are Logical construct and Physical construct.

  • The CREATE Table and the ALTER Table commands are used to create a table and change the structure or properties of an existing table respectively.

  • The built-in functions help in performing mathematical calculations, such as string manipulation and date calculations.

  • The Create View statements help to simplify complicated queries.

  • The semantics used in Impala SQL queries are different from HiveQL.

Conclusion

This concludes the lesson ‘Querying with Hive and Impala’. The next lesson will focus on the ‘Data Storage and File Format’.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

We use cookies on this site for functional and analytical purposes. By using the site, you agree to be cookied and to our Terms of Use. Find out more

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)

By proceeding, you agree to our Terms of Use and Privacy Policy

We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*

By proceeding, you agree to our Terms of Use and Privacy Policy