Companies are always looking for cheaper solutions of data storage. Recently some of them switched RDBMS from ORACLE to PostgreSQL, then switched ETL from ORACLE RAC to Greenplum.
In a lot of Web 2.0 companies, MySQL is the first choice of RDBMS. But MySQL is not good at data warehouse. Not only it is much more difficult to scale out than RAC and Greenplum are, but also the parser module inside MySQL is not as delicate as ORACLE. That’s the same reason we cannot use MongoDB or other NoSQL database for data warehouse. How to glue RDBMS and ETL perfectly in a cheap way is very interesting to a DBA.
In my current company, I find that MySQL and Hive work pretty well together. Here is a picture to show the whole architect.
In my previous company Taobao, the method used to back up MySQL depends on NAS severely. I don’t know how much does NAS cost, but it must be very expensive because the guy in storage team asked me a lot of questions after I applied for a NAS volume.
Now, Nubee builds everything on Amazon’s cloud. There are no more expensive devices, and there are no more one-stop service. Although I miss them a lot, I still need to face the existing conditions. Before I left Taobao, someone in DBA team was doing research about using Hadoop for backup. I think that is a brilliant idea. With the convenience of management, Hadoop also provides a linear capacity and performance growth, which makes itself a perfect substitute for NAS. Then I tried, and I succeed to back up MySQL on Hadoop (HDFS precisely). That’s kind of fashion to do all of these things on EC2 instances, and here are the details for curious viewers.