When Hadoop met Teradata
21 Aug 2014In 2009, I got my first full-time job with Teradata. It was a R&D position. Since then, I have migrated my focus to the parallel DBMS and the distributed computation.
Hadoop was rising as a super-star in the big data movement. Comparing with proprietary software, Hadoop is open source (i.e., FREE) and has an very active community, getting favors from small or middle business.
However, Teradata sells its product and service, in most cases, very expensively. Most of its clients are big companies. As one of the important players in the field, Teradata does not want to be left behind in the race, even though it does actually lead in many aspects. One big move occurred in 2011, when Teradata acquired Aster Data Systems, to address the challenge.
The common idea shared by both Hadoop and Teradata EDW (Enterprise Data Warehouse) is that data are partitioned across multiple nodes thus the computation can be done in parallel. Therefore one of my work then was to explore the collaboration opportunities between them.
-
One application was to load data from the HDFS into the Teradata data warehouse1. It enables a Hadoop-Teradata EDW co-existing environment: Teradata EDW is the first tier as the data storage and for the main data analytics, whereas Hadoop is the second tier as the intermediate data storage and processing. Here Hadoop is mainly used as part of an ETL process, instead of serving for data analytics.
-
On the other hand, it is possible to run the Hadoop MapReduce job on data stored in the Teradata EDW, through the customized input format classes2. Thus Hadoop provides a more flexible but maybe less efficient solution to the complex data analytics, comparing with the UDF (User Defined Function).
-
One interesting technique we developed is to optimize the data assignment from the HDFS (Hadoop File System) to the Teradata EDW3, when the Hadoop and Teradata EDW are located in the same hardware cluster. It was exciting to see the problem can reduce to min-cost flow in the bipartite network.
These projects had a long-lasting influence on my career in the following years. (1) They led me into a field where thinking at scale is necessary. (2) Another important lesson I learned is that some idea may come back and forth, like that of partitioning data for parallel computing, but people do pay more attentions to the problem it could solve. Therefore, problem-driven can be a better strategy in the field.