Abstract:
Due to the increasing amount of data in storing and processing, the traditional RDBMS encounters performance bottleneck. As a data warehouse built on Hadoop for providing data analysis and summarization as a right alternative to the traditional RDBMS, Hive becomes the first choice for many enterprises to deal with big data for its massive scale out and fault tolerance capabilities. In traditional enterprises particularly, a wide variety of legacy applications depend on the traditional RDBMS. Therefore, when migrating these applications to Hive necessary, a large number of queries need to be translated, which will consume hug cost of labor and time via manual way. This paper proposes a query-tree based approach for automatically translating SQL in RDBMS into proper HiveQL. The SQL parser is applied to parsing SQL sentence to query trees that will be supplied with correspondence between tables and columns during pretreatment. By taking into account of set operations, correlated subqueries, and other structures that HiveQL support weakly, this paper proposes eight different rewriting strategies to reconstruct query trees, and in turn to transform those queries in HiveQL sentences. A translation tool called DFMapper may provide a strategies loader to dynamically adjust the specific strategies according to actual requirements, e.g., the version of Hive, SQL dialect, etc., via the modification of externalized configuration. Besides, a validator is designed to verify the accuracy of translation by comparing the result sets of queries executed in RDBMS and Hive, respectively. It is demonstrated via the experiments on the TPC-DS benchmark composed of 99 different queries and covering a varity of ANSI SQL syntax that DFMapper can correctly translate the vast majority of universal queries with strong extensibility.