Facebook's Approach to Big Data Storage Challenge

「ツール」は右上に移動しました。

利用したサーバー: wtserver1

3いいね 1,604 views回再生

Facebook's Approach to Big Data Storage Challenge

Hadoop Distributed Filesystem (HDFS) is one of the core storage solutions in use at Facebook. One of the most notable use cases of HDFS at Facebook is our Hive data warehouse, used for collecting Facebook users behaviors from the front-end. The warehouse cluster stores more than 100PB of data, with 500+ terabytes of data entered into the clusters every day. To meet the capacity requirement of future data growth, storing data in a cost-effective way becomes a top priority because a petabyte of disk space saved translates to hundreds of thousands of dollars of savings. This talk will present various solutions we use to reduce our warehouse cluster`s data footprint: (1) Smart retention: suggest hive table retention modification automatically based on partition access history; (2) Sort hive partitions using selective columns to increase RCFile compression ratio; (3) HDFS file-level raiding to reduce the replication factor of warm and cold large files from 3 to a much lower ratio using XOR Code and Reed Solomon Code (4) Raiding millions of warm small files at the directory level (5) Compact cold small files into large files in a raid-aware way to achieve the most replication factor reduction from file-level raiding. We will discuss in detail how each technique works, the challenges faced, lessons learned during deployment, and finally the results we have achieved.

Facebook's Approach to Big Data Storage Challenge

コメント