Friday, July 3, 2015

Notes about Google file system paper (SOSP 2003)


Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS operating systems review. Vol. 37. No. 5. ACM, 2003.

Paper link: http://research.google.com/archive/gfs.html

1. As another paper from google, one of successful model in google is to use many cheap commodity hardware to achieve high scale and good performance with low cost. Since the number of machines are high (hundreds or thousands) and the hardware is cheap, that means the hardware failure is norm. The core logic for HA and fault tolerance is in software to deal with these hardware failures. This is in contrast to use reliable but expensive hardware, which cost more in hardware and has less demand in software.

To deal with hardware failure in design phase. The basic idea it replica and quick start (with logging). Google file system (GFS) uses one master server and many chunk servers. Each chunk of data is duplicate on multiple chunk servers. To handle master failure, GFS uses a shadow master (which is slightly delayed in time from the master) and can replace the master by replaying the action logs from master.

2. To achieve high performance with a simple design, GFS does not stick to existing standard file system interface (such as POSIX interface). GFS designers analyzed the work load and discard some normal file system feature and APIs. The benefit is the new interface is cleaner and easier to scale. The drawback is the application has to be re-written to use new APIs.

First example of such design is they use 64MB chunk size, which is much bigger than the one in a normal file system. The rationale is that the most work loads are using huge files. Bigger chunk size means less metadata and easier to scale. The drawback is internal fragmentation. Also, GFS uses checksums for each 64KB data inside the chunk to detect data corruption with finer granularity.

Another example is the modification to append to file API. The new append does not allow the caller to pass in the offset. Instead, the server (master) will decide that, and returns offset to the caller. This makes concurrent appending from many clients much easier to implement.

Yet another example is the GFS discards the inode concept. The directory metadata does not contains the files inside the directory. The goal is to avoid the bottleneck when multiple files need to be modified in one directory.

3. Lesson learned:
 a) need to understand workload first, and system design can be tailored (optimized) to a specific kind of workload.
 b) standard (old) API can be modified to satisfy new requirements. In this case, there is no backward compatibility issue.

No comments: