Introduction to Apache Impala Tutorial

Welcome to the first lesson of the Impala Training Course. This lesson provides an introduction to Impala.

Let us look at the objectives of this lesson.

Objectives

After completing the lesson ‘Introduction to Impala’, you will be able to:

  • Describe Impala    

  • Explain the main benefits of Impala    

  • Describe the steps to install Impala    

  • Demonstrate how to get started with Impala

  • Describe the functions of different Impala shell commands

Let us begin with understanding Impala in the next section.

What is Apache Impala?

Apache Impala is a massively parallel processing (MPP) SQL (Pronounce as Sequel) query execution engine that runs on the Hadoop platform.

Using Impala:

  • You can run a query, evaluate the results immediately, and fine-tune the query, if necessary. This engine was introduced in October 2012 with a public beta test distribution, and the final version was made available in May 2013.
  • Analysts and data scientists use Impala to analyze Hadoop data via SQL or other business intelligence tools.

Using Impala’s MPP style execution along with other Hadoop processing MapReduce frameworks, you can perform interactive, ad-hoc and batch queries together in the Hadoop system.
Let us discuss some benefits of Impala in the next section.

You should also consider taking an Impala Certification Training course here!

Benefits of Apache Impala

Following are some of the benefits of Impala:

  • Impala is a flexible engine that integrates well with the existing Hadoop components. This enables the use of files stored in HDFS, different data formats available in HDFS, security, metadata and storage management used by MapReduce, Apache Hive, and other Hadoop software.

  • Further, Impala adds capabilities that make SQL querying easier than before.

  • The Impala architecture also enhances SQL query speed on Hadoop data. The fast turnaround of Impala queries enables new categories of solutions.

  • Impala introduces high flexibility to the familiar database Extract, Transform, and Load process. You can access data with a combination of different Impala and Hadoop components without duplicating or converting the data.

  • When the query speed is slow, use the Parquet (pahr-key) columnar file format for a faster response. This format easily reorganizes data for maximum performance of data warehouse-style queries.

  • For users and business intelligence tools that use SQL, Impala introduces an effective development model to handle new kinds of analysis.

  • The combination of Big Data and Impala makes SQL easy to use. Impala also provides flexibility for your Big Data workflow. SQL capabilities of Impala, such as filtering, calculating, sorting, and formatting, let you perform these operations in Impala. This helps organize the query results for presentation.

In the next section, let us understand the role of Impala in exploratory business intelligence.

Exploratory Business Intelligence

Prior to Using Impala:

  • Business intelligence data was typically condensed into a manageable chunk of high-value information.

  • The information then passed through a complicated ETL cycle before it was uploaded to a database.

Using Impala:

  • The data arrives in Hadoop after fewer steps, and Impala queries it immediately.

  • The high-capacity and high-speed storage system of a Hadoop cluster let you bring in all the data. As Impala can query raw data files, you can skip the time-consuming steps of loading and reorganizing data. This provides new possibilities for querying analytic data.

  • You can use exploratory data analysis and data discovery techniques to query this type of data.

In the next section, let us look at the requirements to install Impala.

Apache Impala Installation

To start using Impala, download Cloudera QuickStart Virtual Machine from the link shown:
https://www.cloudera.com/downloads/quickstart_vms/5-13.html

Since this is a 64-bit VM, it requires a 64-bit host Operating System or OS and a virtualization product that can support a 64-bit guest OS. To play this VM, you can use VMware, KVM or Oracle’s Virtual Box. In addition, this VM requires a minimum of 4 GB RAM. Therefore, your host machine must have around 8 GB RAM.

You can also install Impala in a clustered environment with or without Cloudera Manager. For complete details of this process, refer the Cloudera Installation manual.

In the next section, we will discuss about starting and stopping Impala.

Starting and Stopping Impala

If you install Impala with Cloudera Manager, you can use Cloudera Manager to start and stop the services. The Cloudera Manager GUI (G-U-I) lets you conveniently check if all services are running and set configuration options using form fields in a browser.

To start the Impala statestore and Impalad(impala-dee) from the command line or a script, you can use the service command. Alternatively, you can start the daemons directly through the impalad (impala-dee), statestored (statestore-dee), and catalogd (category-dee) executables. Start the Impala statestore and then start impalad instances.

You can modify the values of the service initialization scripts when starting the statestore and Impala by editing /etc/default/impala

Start the statestore service using a command such as sudo service impala-state-store start

Start the catalog service using a command such as sudo service impala-catalog start Start the Impala service on each data node using a command, such as sudo service impala-server start.

In the next section, we will look at data storage in Impala.

How about investing your time in Impala Training Certification? Check out our Course Preview here!

Data Storage in Apache Impala

Apache Impala uses two media to store its data: Hadoop Distributed File System or HDFS and HBase.

  1. Hadoop Distributed File System or HDFS:

- Impala depends on the redundancy provided by HDFS to protect from hardware or network outages on individual nodes.

- In HDFS, Impala table data is represented as data files in HDFS file formats and compression codecs. For creating a new table, Impala reads these files regardless of their file names. New data is added in the files with names controlled by Impala.

  1. HBase:

- It is a database storage system built on top of HDFS without built-in SQL support.

- It provides an alternative storage medium for Impala data. When you define a table in Impala and map it to its equivalent table in HBase, you can query the data of the HBase tables through Impala.

- In addition, you can perform join queries including both Impala and HBase tables.

Let us discuss about managing metadata in the next section.

Managing Metadata

For tracking metadata of schema objects such as tables and columns, Impala uses the same infrastructure as Hive.

Impala maintains table definition information in a central database called the metastore. You can use MySQL (My-sequel) or PostgreSQL (Post Gray sequel) to act as a common metastore database for both Impala and Hive.

Each Impala node caches all the metadata to reuse in future queries against the same table. Therefore, you need to make a metadata update for an Impala if metadata change that occurs is made from another Impalad instance in your cluster or through Hive.

A metadata change also occurs if a change is made to a database to which clients such as the Impala shell or ODBC connect directly.

Database and table metadata is typically modified by:

- Hive—via ALTER, CREATE DROP or INSERT operations; and

- Impalad—via CREATE TABLE, ALTER TABLE, and INSERT operations.

INVALIDATE METADATA marks table metadata as stale and reloads when the table is referenced next.

Let us learn about controlling data access in the next section.

Controlling Access to Data

You can control data access in Impala through Authorization, Authentication, and Auditing.

Features of Authorization are:

  • You can use the Sentry open source project for user authorization. Sentry includes a detailed authorization framework for Hadoop.

  • When authorization is enabled, Impala picks the user ID of the OS where the impala-shell or other client programs are run.

  • It then associates various privileges with each user of the computer. You can control access to Impala data by using authorization techniques.

Let us now discuss impala shell commands in the next section.

Impala Shell Commands and Interface

The Impala shell tool and impala-shell help perform functions, such as creating databases and tables, inserting data, and issuing queries. For ad-hoc queries and exploration, SQL statements can be used in an interactive session.

In an interactive session:

  • -q (dash Q) option: This option allows issuing of a single query from the command line. You can do this without the help of the interactive interpreter. You can use the -q option to run Impala-shell from a shell script. You can also use the –q option with the command invocation syntax using scripts such as Python or Perl.

  • -o (dash O) option: This option lets you save the query output as a file.

  • -b (dash B) option: This option lets you print a text file as an output with comma-separated, tab-separated, or other delimited values.

  • stdout: When printed in a non-interactive mode, the query output gets printed to the stdout (Standard Output) format or to any file that the -o option specifies.

  • stderr: Incidental output, on the other hand, is printed to stderr (Standard Error). This allows you to process only the query output as part of a Unix (you-nix) pipeline.

In an interactive mode, the readline command is used by impala-shell to recall or edit any previous commands.

Let us look at more Impala Shell commands.

  • Use the impala shell tool to run the following commands: Connect, Describe, Explain, Help, History, Insert, Quit, Refresh, Select, Set, Shell, Show, Use, and Version.

  • Commands such as ALTER, Compute stats, Explain, Insert, and Connect can be used to pass requests to the impalad daemon (demon) that the shell is connected to within impala-shell.

Willing to take up a course in Impala Training? Check out our Course here!

Summary

Let us summarize the topics covered in this lesson.

  • Cloudera Impala is a massively parallel processing SQL query engine or database that runs on Apache Hadoop.

  • Impala is a flexible engine that integrates well with the existing Hadoop components.

  • The Impala architecture also enhances SQL query speed on Hadoop data.

  • With Impala, the business intelligence data arrives in Hadoop after fewer steps, and Impala queries it immediately.

  • The Cloudera Manager GUI lets you conveniently check if all services are running and set configuration options using form fields in a browser.

  • Impala uses two media to store its data: Hadoop Distributed File System or HDFS and HBase.

  • For tracking metadata of schema objects such as tables and columns, Impala uses the same infrastructure as Hive. You can control data access in Impala through Authorization, Authentication, and Auditing.

Conclusion

This concludes the lesson Introduction to Impala. The next lesson will focus on SQL query execution in Impala.

  • Disclaimer
  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.

We use cookies on this site for functional and analytical purposes. By using the site, you agree to be cookied and to our Terms of Use. Find out more

Request more information

For individuals
For business
Name*
Email*
Phone Number*
Your Message (Optional)

By proceeding, you agree to our Terms of Use and Privacy Policy

We are looking into your query.
Our consultants will get in touch with you soon.

A Simplilearn representative will get back to you in one business day.

First Name*
Last Name*
Email*
Phone Number*
Company*
Job Title*

By proceeding, you agree to our Terms of Use and Privacy Policy