Make MySQL, Hadoop and Hive Work Together
Companies are always looking for cheaper solutions of data storage. Recently some of them switched RDBMS from ORACLE to PostgreSQL, then switched ETL from ORACLE RAC to Greenplum.
In a lot of Web 2.0 companies, MySQL is the first choice of RDBMS. But MySQL is not good at data warehouse. Not only it is much more difficult to scale out than RAC and Greenplum are, but also the parser module inside MySQL is not as delicate as ORACLE. That’s the same reason we cannot use MongoDB or other NoSQL database for data warehouse. How to glue RDBMS and ETL perfectly in a cheap way is very interesting to a DBA.
There are several import parts in this architect:
- We have a lot of MySQL running in EC2 instances. Most of them are Percona MySQL 5.5.
- Each master node has 2 or 3 slaves currently, which enables us to scale out beyond the capacity of a single database deployment for read-heavy database workloads.
- Xtrabackup takes care of MySQL backup as a crontab job. To ensure safety of backup files, we use HDFS to store them. Daily backup files will be kept for one week.
- The HDFS used for storing backup files is a part of Hadoop.
- Hadoop cluster is composed of primary/secondary namenodes and several datanodes.
- Each datanode has several EBS volumes. And each EBS volume has 500 GB capacity.
- Backup only takes up a small part of Hadoop’s capacity. To fully use it, we plan to use map/reduce too.
- Map/reduce provided by Hadoop suits analytic jobs very well, but it also requires a lot java programming work. To avoid this problem, we employ Hive to do coding for us.
- Hive is installed in the Hadoop nodes.
- Hive takes care of the presentation of data in Hadoop. But we need a convenient tool to load data from MySQL into Hadoop. This job is much more complicated than to copy backup files into HDFS.
- Sqoop can copy schema from MySQL to Hive easily, and it takes little work to import data into Hadoop.
- Application servers generate lots of information which is too much for MySQL to store, and save it in local disks.
- Before rotating the application log files, parser scripts will extract useful information and format it into a readable text file, then upload to Hadoop.
- After that, parser scripts will create a new table or partition in Hive and bind the text file to it.