Working with Apache Impala Tutorial

Welcome to the fourth lesson of the Impala Training Course. This lesson provides an introduction to working with Impala. Let us discuss the objectives of this lesson.

Objectives

After completing the lesson - Working with Impala, you will be able to:

  • Describe the Impala architecture

  • Explain the functions of the three main architecture components

  • Describe the complete flow of a SQL query execution in Impala

  • Provide an overview of using user-defined functions in Impala

  • List the factors that improve Impala performance.

Let us begin with understanding the Impala architecture in the next section

Impala Architecture

Impala server is a SQL query execution engine of Hadoop. Some of the features of Impala architecture are:

  • A massively parallel processing or MPP engine for distributed clustering environment.

  • Consists of various daemon (pronounce as demon) processes that run on specific hosts within your Hadoop cluster.

  • The three main components of Impala are Impala daemon, Impala statestore (state-store), and Impala catalog service, represented by the daemons impalad (impala-dee), statestored (statestore-dee), and catalogd (catalog-dee) respectively.

The diagram shown below depicts the internal architecture of Impala.
Impala Architecture

In the next section, let us understand the impalad (impala-dee) daemon process.

Want to check the course preview of Impala Training? Watch out here!

Impala Daemon

The core component of Impala is the daemon process running on each Impala cluster node. The daemon processes are run in the form of physical impalad process.

Functions of Impalad are as follows:

  • The impalad process reads and writes to data files

  • It handles queries sent via impala-shell command, such as Hue (Hadoop User Experience), JDBC, or ODBC.

  • It logically divides a query into smaller parallel queries and distributes them to different nodes in the Impala cluster. When you submit a query to the Impala daemon running on any node, the node serves as the coordinator node for that query.

  • Impalad transmits intermediate query results back to the central coordinator node. The coordinator constructs the final query output. When you run an experiment using the impala-shell command, it may connect you to the same Impala daemon process for convenience.

In the next section, let us understand the statestored (statestore-dee) daemon process.

Impala Statestore

The Impala daemons and the statestore are in continuous communication to identify the nodes that are healthy and are capable of accepting new work. The statestore component then relays this information to the daemons. The name of the Impala statestore daemon process is statestored.

Impala Statestore
In the next section, we will learn about the catalogd (catalog-dee) daemon process.

Impala Catalog Service

The Catalog Service Impala component broadcast the metadata changes from Impala SQL statements to all the nodes in a cluster. This broadcast is physically represented by catalogd. Such a process is required only on one node in the cluster. The request then passes through statestored daemon. Therefore, the statestored and catalogd services are present on the same node.

Functions of Impala Catalog Service are as follows:

  • Catalogd sends messages to statestored when:

  • An Impala node in the cluster creates, alters, or drops any type of object and

  • An INSERT or LOAD DATA statement is processed through Impala. Catalog service is a relatively new component that was introduced in Impala version 1.2. (In earlier versions, if you issued the CREATE DATABASE, DROP DATABASE, CREATE TABLE, ALTER TABLE, or DROP TABLE commands on one node, you need to issue the command INVALIDATE METADATA on other nodes before making a query request. Otherwise, the changes to schema objects would not have been picked up.)

  • The catalog service component eliminates the need for the REFRESH and INVALIDATE METADATA statements if the metadata changes are made by Impala statements. However, when you perform tasks such as creating a table and loading data through Hive, you need to issue these statements before executing a query.

Let us take a look at the query execution flow in Impala in the next section.

Query Execution Flow in Impala

The query execution process works in the following manner:
Query Execution Flow in Impala

  • The client sends a query to impalad (impala-dee) using HiveQL via the Thrift Application Program Interface or API.

  • The frontend planner generates a Query Plan Tree using the metadata information.

  • Meanwhile, the backend coordinator sends an execution request to all the query execution engines

  • the backend query execution engine executes a query fragment. HDFS Scan, Aggregation, and Merge are examples of this process.

Note that statestore notifies impalad when cluster state changes.

Let us next discuss user-defined functions in Impala.

What are you waiting for? Ready to take up the Impala Training Course? Click to watch our Course Preview!

User - Defined Functions

In an Impala query, a user-defined function or UDF lets you code your own application logic for adding column values. Different types of UDFs produce different numbers of input and output values.
They are as follows:
User Defined Function

  • The most general kind is typically referred to by the abbreviation UDF.

  • It requires one input value and gives one output value.

  • Example:

select customerName,

is_HNI_customerName(customerId) from customers;

User-defined aggregate function or UDAF

  • Takes a group of values and produces a single output value.

  • Used for summarizing the values in a group of rows.

  • Example:

select_most_optimal_production(plant_id, operationCost, tax_rate, depreciation) from vendors_data group by year; 

Currently, Impala supports User Defined Functions and User-Defined Aggregate Functions, but it does not support other categories of UDFs such as user-defined table functions. Let us next understand how to run UDFs written for Hive in Impala.

Hive UDFs with Impala

A few pointers on using Hive UDFs with Impala are:

  • In Impala 1.1, The Hive shell was required to use UDFs.

  • From Impala 1.2, Impala can run both UDFs in C++ (C Plus Plus) and Hive UDFs written in Java.

  • For enhanced performance, Impala supports the writing of UDFs in the native C++ language rather than in Java. Impala can run Java-based UDFs written for Hive without any changes if they fulfill two conditions:

  • Condition 1 - The parameters and return value must use Impala supported data types. Currently, Impala does not support Hive UDFs that accept or return the TIMESTAMP data type.

  • Condition 2 - The return type should be a "writable" type such as Text or IntWritable and not a Java primitive type such as String or int. Otherwise, the UDF will return NULL. Impala does not support Hive UDAFs and UDTFs.

Typically, a Java UDF is executed at a lesser speed in Impala than the equivalent native UDF written in C++.

In the next section, we will look at a demo on UDFs.

Improving Impala Performance

The four factors that help improve Impala performance are:

Partitioning Impala tables

- Partitioning is a technique that physically divides the data in frequently queried columns based on different values.

- Thus, it allows queries to skip reading a large percentage of the data in a table.

Performance considerations for Join queries

- Joins are the main class of queries that you can improve at the SQL level. You do not have to change physical factors such as the file format or the hardware configuration to improve the performance of joins.

Collecting table and column statistics

- You can gather table and column statistics using the COMPUTE STATS statement. This helps Impala to automatically optimize the performance for Join queries without making any changes to SQL query statements.

Controlling Impala resource usage

- Memory utilization is directly proportional to performance in Impala. Greater the memory Impala utilizes, better is the query performance. In a cluster running other kinds of workloads, make adjustments and ensure that all Hadoop components have enough memory to perform well so that Impala can utilize maximum memory.

Are you curious to know, what Impala Training is all about? Watch our Course Preview for free!

Summary

Let us summarize what we learned in this lesson:

  • The three important components of the Impala architecture are Impala daemon, Impala statestore, and Impala catalog service

  • User-defined functions and user-defined aggregate functions in Java and native C++ can extend the functionalities of Impala SQL statements.

  • The recommended performance optimization techniques for Impala are partitioning Impala tables, performance consideration of Joins, collecting Table Statistics and Column Statistics, and Controlling Impala Resource Usage.

Conclusion

This concludes the lesson Working with Impala.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

We use cookies on this site for functional and analytical purposes. By using the site, you agree to be cookied and to our Terms of Use. Find out more

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)

By proceeding, you agree to our Terms of Use and Privacy Policy

We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*

By proceeding, you agree to our Terms of Use and Privacy Policy