APACHE Hive vs Apache Impala: principales diferencias

Si es nuevo en el análisis de big data, la gran cantidad de herramientas de Apache podría estar en su radar; sin embargo, la variedad de herramientas diferentes puede volverse confusa y, a veces, abrumadora.
Esta publicación resolve this confusion and explain what Apache Hive and Impala are and what makes them different from one another!
APACHE Hive
APACHE Hive is a SQL data access interface for the Apache Hadoop plat. Hive allows you to query, aggregate, and analyze data using SQL syntax.
A read access scheme is used for data in the HDFS file system, allowing you to treat data as with an ordinary table or relational DBMS. HiveQL queries are translated into Java code for MapReduce puestos de trabajo.

Hive queries are written in the HiveQL query language, which is based on the SQL language but does not have full support for the SQL-92 standard.
However, this language allows programmers to use their queries when it is inconvenient or inefficient to use HiveQL features. HiveQL can be extended with user-defined scalar functions (UDFs), aggregations (UDAF codes), and table functions (UDTFs).
How does Apache Hive trabajo
APACHE Hive traducirates programs written in HiveQL language (close to SQL) into one or more MapReduce, Apache Tez, or Apache Spark tasks. These are three execution engines that can be launched on Hadoop. Then, Apache Hive organizes the data into an array for the Hadoop Distributed File System (HDFS) file to run the jobs on a cluster to produce a response.
APACHE Hive tables are similar to relational databases, and data units are organized from the most significant unit to the most granular. Databases are arrays composed of partitions, which can again be broken down into “buckets.”

The data is accessible via HiveQL. Within each database, the data is numbered, and each table corresponds to an HDFS directory.
Multiple interfaces are available within the Apache Hive architecture, such as web interface, CLI, or external clients.
Indeed, the “Apache Hive Thrift” server allows remote clients to submit commands and requests to Apache Hive usando varios lenguajes de programación. apache Hive’s central directory is a “metastore” containing all the information.
The engine that makes Hive work is called “the driver.” It bundles a compiler and an optimizer to determine the optimal execution plan.
finalally, security is provided by Hadoop. It, therefore, relies on Kerberos para autenticación mutua between the client and server. The permission for newly created files in Apache Hive es dictated by HDFS, allowing user, group, or otherwise autorización.
Caracteristicas de Hive
- Admite el motor informático de Hadoop y Spark
- Utiliza HDFS y funciona como un almacenamiento de datos.
- Utiliza MapReduce y es compatible con ETL
- Debido a HDFS, tiene una tolerancia a fallas similar a Hadoop
APACHE Hive: Beneficios
APACHE Hive is an ideal solution for queries and data analysis. It makes it possible to obtain qualitative insights, providing a competitive advantage and facilitating responsiveness to market demand.
Among the main advantages of Apache Hive, we can mention the ease of use linked to its “SQL-friendly” language. In addition, it speeds up the initial insertion of data since the data does not need to be read or numbered from a disk in the internal database format.
Knowing that the data is stored in HDFS, storing large datasets of up to hundreds of petabytes of data on Apache Hive is possible. This solution is much more scalable than a traditional database. Knowing that it is a cloud service, Apache Hive allows users to quickly launch virtual servers based on fluctuations in workloads (i.e., tasks).
Security is also an aspect where Hive performs better, with its ability to replicate recovery-critical workloads in the event of a problem. Finally, the work capacity is unparalleled since it can perform up to 100,000 requests per hour.
apache impala
apache impala es un motor de consulta SQL masivamente paralelo para la ejecución interactiva de consultas SQL sobre datos almacenados en Apache Hadoop, escrito en C++ y distribuido bajo la licencia Apache 2.0.
Impala is also called an MPP (Massively Parallel Processing) engine, a distributed DBMS, and even a SQL-on-Hadoop stack database.

Impala operates in distributed mode, where process instances run on different cluster nodes, receiving, scheduling, and coordinating client requests. In this case, parallel execution of fragments of the SQL query is possible.
Los clientes son usuarios y aplicaciones que envían consultas SQL contra datos almacenados en Apache Hadoop (HBase y HDFS) o Amazon S3. La interacción con Impala se produce a través de la interfaz web HUE (Experiencia de usuario de Hadoop), ODBC, JDBC y el shell de línea de comandos de Impala Shell.
Impala depends infrastructurally on another popular SQL-on-Hadoop tool, Apache Hive, using its metadata store. In particular, the Hive Metastore lets Impala know about the availability and structure of the databases.
When creating, modifying, and deleting schema objects or loading data into tables via SQL statements, the corresponding metadata changes are automatically untadoated to all Impala nodes using a specialized directory service.
Los componentes clave de Impala son los siguientes ejecutables:
- Impalad or Impala daemon is a system service that schedules and executes queries on HDFS, HBase, and Amazon S3 data. One impalad process runs on each cluster node.
- Statestore is a naming service that keeps track of the location and status of all impalad instances in the cluster. One instance of this system service runs on each node and the main server (Name Node).
- Catalog is a metadata coordination service that propagates changes from Impala DDL and DML statements to all affected Impala nodes so that new tables or newly loaded data are immediately visible to any node in the cluster. It is recommended that one instance of Catalog be running on the same cluster host as the Statestored daemon.
¿Cómo funciona Apache Impala?
Impala, like Apache Hive, uses a similar declarative query language, Hive Lenguaje de consulta (HiveQL), which is a subset of SQL92, instead of SQL.
La ejecución real de la solicitud en Impala es la siguiente:
La aplicación cliente envía una consulta SQL conectándose a cualquier impalad a través de interfaces de controlador ODBC o JDBC estandarizadas. El impalad conectado se convierte en el coordinador de la solicitud actual.
La consulta SQL se analiza para determinar las tareas para las instancias de impalad en el clúster; luego, se construye el plan óptimo de ejecución de consultas.
Impalad directly accesses HDFS and HBase using local instances of system services to provide data. Unlike Apache Hive, such direct interaction significantly saves query execution time, as intermediate results are not saved.
En respuesta, cada daemon devuelve datos al impalad coordinador y envía los resultados al cliente.

caracteristicas de impala
- Support for real-time in-memory processIng.
- compatible con SQL
- Admite sistemas de almacenamiento como HDFS, Apache HBase y Amazon S3
- Supports integration with BI tools such as Pentaho and Tableau
- Utiliza materiales de HiveQL syntax
Apache Impala: Beneficios
Impala avoids possible startup overhead because all system daemon processes are started directly at boot time. It significantly saves query execution time. An additional increase in the speed of Impala is because this SQL tool for Hadoop, unlike Hive, does not store intermediate results and accesses HDFS or HBase directly.
In addition, Impala generates program code at runtime and not at compilation, as Hive does. However, a side effect of Impala’s high-speed performance is reduced reliability.
In particular, if the data node goes down during the execution of a SQL query, the Impala instance will restart, and Hive will continue to keep a connection to the data source, providing fault tolerance.
Otros beneficios de Impala incluyen soporte integrado para un protocolo de autenticación de red seguro Kerberos, priorización y la capacidad de administrar la cola de solicitudes y soporte para formatos populares de Big Data como LZO, Avro, RCFile, Parquet y Sequence.
Hive Vs Impala: Similarities
Hive and Impala are freely distributed under the Apache Software Foundation license and refer to SQL tools for working with data stored in a Hadoop cluster. In addition, they also use the HDFS distributed file system.
Impala y Hive implement different tasks with a common focus on SQL processing of big data stored in an Apache Hadoop cluster. Impala provides a SQL-like interface, allowing you to read and write Hive tables, thus enabling easy data exchange.
At the same time, Impala makes SQL operations on Hadoop quite fast and efficient, allowing using this DBMS in Big Data analytics research projects. Whenever possible, Impala works with an existing Apache Hive infrastructure already used to execute long-running SQL batch queries.
Also, Impala stores its table definitions in a metastore, a traditional MySQL or PostgreSQL database, i.e., in the same place where Hive stores similar data. It allows Impala to access Hive tables as long as all columns use Impala’s supported data types, file formats, and compression codecs.
Hive Vs Impala: Differences

Lenguaje de programación
Hive is written in Java, whereas Impala is written in C++. However, Impala also uses some Java-based Hive UDF.
Use cases
Data Engineers use Hive en ETL processes (Extract, Transform, Load), for example, for long-running batch jobs on large data sets, for example, in travel aggregators and airport information systems. In turn, Impala is intended mainly for analysts and data scientists and is mainly used in tasks like inteligencia empresarial .
Rendimiento
Impala executes SQL queries in real-time, while Hive is characterized by low data processing speed. With simple SQL queries, Impala can run 6-69 times faster than Hive. Sin embargo, Hive handles complex queries better.
Latency/throughput
The throughput of Hive is significantly higher than that of Impala. The LLAP (Live Long and Process) feature, which enables query caching in memory, gives Hive good low-level performance.
LLAP includes long-term system services (daemons), which allow you to directly interact with HDFS data nodes and replace the tightly integrated DAG query structure (Directed acyclic graph) – a graph model actively used in Big Data computing.
Tolerancia a fallos
Hive is a fault-tolerant system that preserves all intermediate results. It also positively affects scalability but leads to a decrease in data processing speed. In turn, Impala cannot be called a fault-tolerant platform because it’s more memory bound.
Conversión de código
Hive eneroates query expressions at compile time, while Impala generates them at runtime. Hive is characterized by a “cold start” problem the first time the application is launched; queries are converted slowly due to the need to establish a connection to the data source.
Impala does not have this kind of startup overhead. The necessary system services (daemons) for processing SQL queries are started at boot time, which speeds up the work.
Soporte de almacenamiento
Impala supports LZO, Avro, and Parquet formats, while Hive works with Plain Text and ORC. However, both support the RCFIle and Sequence formats.
APACHE Hive | apache impala | |
Idioma | Java | C + + |
Casos de uso | Ingeniería de datos | Análisis y analítica |
Rendimiento | Alto para consultas simples | Comparativamente bajo |
Latency | Más yoatency due to caching | Less latent |
La tolerancia a fallos | Más tolerante debido a MapReduce | Menos tolerante debido a MPP |
Conversión | Lento por arranque en frio | Conversión más rápida |
Soporte de almacenamiento | Texto sin formato y ORC | LZO, Avro, parquet |
Palabras finales
Hive and Impala do not compete but rather effectively complement each other. Even though there are significant differences between the two, there is also quite a lot in common and choosing one over the other depends on the data and particular requirements of the project.
También puede explorar comparaciones directas entre Hadoop y Spark.
.