The Hadoop Distributed Filesystem
1. Why HDFS ?
When a dataset outgrows the storage capacity of a single physical machine, it becomes necessary to partition it across a number of separate machines.
HDFS只是分布式文件管理系统中的一种
Moving Computation is Cheaper than Moving Data
2. The Design of HDFS
2.1 优点
-
HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.
Very large files(适合大数据处理)
-
Streaming data access(流式数据访问)
HDFS is built around the idea that the most efficient data processing pattern is a write-once, read-many-times pattern. (一写多读)
A dataset is typically generated or copied from source, and then various analyses are performed on that dataset over time.(不支持文件的修改)
-
Commodity hardware
- Hadoop doesn’t require expensive, highly reliable hardware. It’s designed to run on clusters of commodity hardware(可构建在廉价机器上,通过多副本机制,提高可用性)
2.2 缺点
不适合低延时数据访问,比如毫秒级的存储数据,是做不到的。
-
无法高效的对大量小文件进行存储
- 存储大量小文件的话,它会占用 NameNode 大量的内存来存储文件、目录和块信息。这样是不可取的,因为 NameNode 的内存总是有限的。
- 面试常问:如何优化HDFS对于小文件的存储
* 小文件存储的寻址时间会超过读取时间,它违反了 HDFS 的设计目标。
* 上传的文件过小,上传花费时间只有几秒,但是寻址时间过长也是不合适的(访问时间和传输时间达到某一比例,效率才最佳)
-
并发写入、文件随机修改
- 一个文件只能有一个写,不允许多个线程同时写。
*仅支持数据 append(追加),不支持文件的随机修改。
- 一个文件只能有一个写,不允许多个线程同时写。
Block(面试题)
128 MB by default.Like in a filesystem for a single disk, files in HDFS are broken into block-sized chunks,which are stored as independent units. (默认大小在hadoop2.x中是128M,老版本中是64M,但是如果在本地运行,块大小就是64M)
Unlike a filesystem for a single disk, a file in HDFS that is smaller than a single block does not occupy a full block’s worth of under‐lying storage.
Block抽象的好处
he first benefit is the most obvious: a file can be larger than any single disk in the network(文件大小可以非常大)
Second, making the unit of abstraction a block rather than a file simplifies the storage subsystem. (简化存储子系统)
-
Furthermore, blocks fit well with replication for providing fault tolerance and availability. (数据冗余备份具有容错性)
- each block is replicated to a small number of physically separate machines (typically three). If a block becomes unavailable, a copy can be read from another location in a way that is transparent to the client. (每一个block备份三份在不同的机器上,一般备份三份,如果其中一个unavailable,系统会从其他位置读取副本)
HDFS Architecture
3. Namenodes and Datanodes
- An HDFS cluster has two types of nodes operating in a master−worker pattern: a namenode (the master) and a number of datanodes (workers). (集群中有两种节点:namenode和datanode)
Namenode
The namenode manages the filesystem namespace. It maintains the filesystem tree and the metadata for all the files and directories in the tree. This information is stored persistently on the local disk in the form of two files: the namespace image and the edit log. (namenode管理者文件系统的命名空间,存放元数据)
Without the namenode, the filesystem cannot be used. all the files on the filesystem would be lost since there would be no way of knowing how to reconstruct the files from the blocks on the datanodes(没有namenode,文件系统将不可能再被使用)
A client accesses the filesystem on behalf of the user by communicating with the namenode and datanodes. (客户端通过与名称节点和数据节点来访问文件系统)so the user code does not need to know about thenamenode and datanodes to function.
The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.(NameNode执行命名空间操作:open,close,rename,以及决定Datanode的mapping of blocks)
Datanode
- Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing.(Datanode是文件系统工作者,负责存储和提取块并且向namenode汇报block的存储信息)
4. The File System Namespace
- The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode.
Hadoop提供的两种复原namenode的机制
The first way is to back up the files that make up the persistent state of the filesystem metadata.(第一种是备份文件系统元数据的持久化状态到本地磁盘以及远程NFS挂载)
-
It is also possible to run a secondary namenode, which despite its name does not act as a namenode. (设置一个secondNode)
- Its main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.(作用为:定期mergenamespace和edit log并保存,防止namenode信息量过大)
5. Data Replication
All blocks in a file except the last block are the same size
-
An application can specify the number of replicas of a file
- The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once (except for appends and truncates) and have strictly one writer at any time.
The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster.(Namenode定期接收DataNode的汇报)
A Blockreport contains a list of all blocks on a DataNode
-
Block Replication
5.1 Replica Placement: The First Baby Steps
For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on the local machine if the writer is on a datanode, otherwise on a random datanode, another replica on a node in a different (remote) rack, and the last on a different node in the same remote rack. (当备份数量为3时,local mechine 一个,剩下两个放在remote机架上的两个node上)
The chance of rack failure is far less than that of node failure;
the replicas of a file do not evenly distribute across the racks. One third of replicas are on one node, two thirds of replicas are on one rack, and the other third are evenly distributed across the remaining racks.(文件的副本不均匀分布在机架上。三分之一的副本位于一个节点上,三分之二个副本在一个与三分之一副本同一机架上,其余第三个均匀分布在不同机架上的随机节点。)
5.2 Replica Selection
- To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader.
5.3 Safemode
- On startup, the NameNode enters a special state called Safemode. Replication of data blocks does not occur when the NameNode is in the Safemode state.
5.3.1 The Persistence of File System Metadata
The HDFS namespace is stored by the NameNode.
The NameNode uses a transaction log called the EditLog to persistently record every change that occurs to file system metadata. For example, creating a new file in HDFS causes the NameNode to insert a record into the EditLog indicating this. Similarly, changing the replication factor of a file causes a new record to be inserted into the EditLog. (EditLog存放的是文件元数据的改变记录以及每个文件的备份的改变记录)
The NameNode uses a file in its local host OS file system to store the EditLog.
The entire file system namespace, including the mapping of blocks to files and file system properties, is stored in a file called the FsImage. The FsImage is stored as a file in the NameNode’s local file system too.(FsImage中存放的是文件系统命名空间和block映射和文件系统属性)
The NameNode keeps an image of the entire file system namespace and file Blockmap in memory.
The purpose of a checkpoint is to make sure that HDFS has a consistent view of the file system metadata by taking a snapshot of the file system metadata and saving it to FsImage.(checkpoint的作用,每一次checkpoint会把文件系统元数据刷新到FsImage)
Even though it is efficient to read a FsImage, it is not efficient to make incremental edits directly to a FsImage. Instead of modifying FsImage for each edit, we persist the edits in the Editlog.(持续的将日志edits直接转为FsImage是非常不高效的,因此我们选择将日志记录在EditLog中)
When a DataNode starts up, it scans through its local file system, generates a list of all HDFS data blocks that correspond to each of these local files, and sends this report to the NameNode. The report is called the Blockreport.(BlokReport:向namenode汇报data blocks)
-
A checkpoint can be triggered at a given time interval (dfs.namenode.checkpoint.period) expressed in seconds, or after a given number of filesystem transactions have accumulated (dfs.namenode.checkpoint.txns). If both of these properties are set, the first threshold to be reached triggers a checkpoint.
1.dfs.namenode.checkpoint.period 默认为3600秒 ——The number of seconds between two periodic checkpoints. 2.dfs.namenode.checkpoint.txns 默认为1000000 ——The Secondary NameNode or CheckpointNode will create a checkpoint of the namespace every 'dfs.namenode.checkpoint.txns' transactions, regardless of whether 'dfs.namenode.checkpoint.period' has expired.
5.4 Data Disk Failure, Heartbeats and Re-Replication
- Each DataNode sends a Heartbeat message to the NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them.(每一个datanode定期向namenode发送一个heartbeat message)
5.5 Metadata Disk Failure
The FsImage and the EditLog are central data structures of HDFS.
Another option to increase resilience against failures is to enable High Availability using multiple NameNodes either with a shared storage on NFS or using a distributed edit log (called Journal). The latter is the recommended approach.
6. Block Caching
- Normally a datanode reads blocks from disk, but for frequently accessed files the blocks may be explicitly cached in the datanode’s memory, in an off-heap block cache. By default, a block is cached in only one datanode’s memory(对于使用较频繁的文件可放在DataNode的内存中)
7. HDFS Federation
Under federation, each namenode manages a namespace volume, which is made up of the metadata for the namespace, and a block pool containing all the blocks for the files in the namespace. Namespace volumes are independent of each other, which means namenodes do not communicate with one another, and furthermore the failure of one
namenode does not affect the availability of the namespaces managed by other namenodes. (多个namenode,互不相关,管理着不同目录下的文件信息)The namenode keeps a reference to every file and block in the filesystem in memory,which means that on very large clusters with many files, memory becomes the limiting factor for scaling
8. HDFS High Availability
虽然有进行备份元数据的持久化状态和设置SecondNamenode,但是namenode仍旧是文件系统的关键,一旦他遭到损坏,除非有新的namenode,否则文件系统将不会提供服务
hadoop添加了HA支持
8.1 HA内容
In this implementation, there are a pair of namenodes in an active-standby configuration. In the event of the failure of the active namenode, the standby takes over its duties to continue servicing client requests without a significant interruption.(此时用到了一对namenodes)
The namenodes must use highly available shared storage to share the edit log. When a standby namenode comes up, it reads up to the end of the shared edit log to synchronize its state with the active namenode(active namenode和standbynamenode共享edit log的存储)
对于高实用性共享存储有两种选择:NFS文件,QJM(quorum journal manager)。QJM专注于HDFS的实现,其唯一目的就是提供一个高实用性的可编辑日志,也是大多是HDFS安装时所推荐的。
Datanodes must send block reports to both namenodes because the block mappings are stored in a namenode’s memory, and not on disk.(Datanode同时向两个namenodes汇报block的情况)
The secondary namenode’s role is subsumed by the standby, which takes periodic checkpoints of the active namenode’s namespace
If the active namenode fails, the standby can take over very quickly (in a few tens of seconds) because it has the latest state available in memory: both the latest edit log entries
and an up-to-date block mapping. (如果active namenode发生故障 ,standby namenode会迅速接管任务(在数秒内),因为在内存中备份节点有最新的可用状态,包括最新的可编辑日志记录和块映射信息。)
从活动主节点到备份节点的故障切换是由系统中一个新的实体——故障切换控制器来管理的。虽然有多种版本的故障切换控制器,但是hadoop默认的是ZooKeeper,它也可确保只有一个namenode是处于活动状态。每一个namenode节点上都运行一个轻量级的故障切换控制器进程,它的任务就是去监控namenode的故障,一旦namenode发生故障,它就会触发故障切换。
HA的实现会竭尽全力的去确保之前的活动主节点不会做出任何导致故障的有害举动,这个方法就是fencing。
9. The Java Interface
Reading Data from a Hadoop URL
- 查看FileSystem.md
10. Data Flow——Anatomy of a File Read
-
step 1
- The client opens the file it wishes to read by calling open() on the FileSystem object,which for HDFS is an instance of DistributedFileSystem(open方法)
-
step 2
DistributedFileSystem calls the namenode, using remote procedure calls (RPCs), to determine the locations of the first few blocks in the file.(DistributeFileSystem调用namenode,返回了文件的block locations)
For each block, the namenode returns the addresses of the datanodes that have a copy of that block. Furthermore, the datanodes are sorted according to their proximity to the client. (对于每一块,namenode返回一个有block的最近的节点上的地址)
If the client is itself a datanode, the client will read from the local datanode if that datanode hosts a copy of the block。(还是就近原则,如果client本身为一个datanode,而且存在保存了文件的block,则客户端会直接从本地读取数据)
The DistributedFileSystem returns an FSDataInputStream (an input stream that supports file seeks) to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O.(DistributedFileSystem向client客户端返回了一个FSDataInputStream,FSDataInpuStream中还包裹着DFSInputtream,用来控制namenode和datanode之间的IO操作)
-
step 3
- The client then calls read() on the stream(客户端通过DistributedFileSystem返回的FSDataInputStream对象调用read()方法)
-
step 4
- DFSInputStream, which has stored the datanode addresses for the first few blocks in the file, then connects to the first (closest) datanode for the first block in the file. Data is streamed from the datanode back to the client, which calls read() repeatedly on the stream.(通过DFSInputStream中储存的datanode的地址信息不断地从block中读取文件内容)
-
step 5
When the end of the block is reached, DFSInputStream will close the connection to the datanode, then find the best datanode for the next block. This happens transparently to the client,which from its point of view is just reading a continuous stream.(当地一个block读完后,DFSdataInputStream将会关闭与此datanode的连接,而找下一个block继续进行读取,从client的角度来说,这些操作均为透明的,意思是从客户端的角度以为是在持续的读取一个流)
Blocks are read in order, with the DFSInputStream opening new connections to datanodes as the client reads through the stream. It will also call the namenode to retrieve the datanode locations for the next batch of blocks as needed.(当客户端通过DFSInputStream打开的与某一节点上的块的连接从而读取数据时,DFSInputStream同样通知namenode去获得下一批需要的block的地址)
-
step 6
- When the client has finished reading, it calls close() on the FSDataInputStream (当数据读取完毕后,DFSInputStream)
-
Notes
During reading, if the DFSInputStream encounters an error while communicating with a datanode, it will try the next closest one for that block. It will also remember datanodes that have failed so that it doesn’t needlessly retry them for later blocks(DFSiInputStream在于DataNode交互遇到error时,会继续选择最近的block。而且会记录failed datanodes)
The DFSInput Stream also verifies checksums for the data transferred to it from the datanode. If a corrupted block is found, the DFSInputStream attempts to read a replica of the block from another datanode; it also reports the corrupted block to the namenode.(DFSInputStream会向namenode报告corrupted block)
11. Network Topology and Hadoop
如何判断网络中两个节点是近邻的?
Hadoop takes a simple approach in which the network is represented as a tree and the distance between two nodes is the sum of their distances to their closest common ancestor(将网络看做了颗树,两个节点之间的距离是这两个节点到他们共同祖先的距离和)
-
For example, imagine a node n1 on rack r1 in data center d1. This can be represented as /d1/r1/n1. Using this notation, here are the distances for the four scenarios:
distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)——相同节点
distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)——同一机架上不同节点
distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center)——同一数据中心不同机架上不同节点
distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)——不同数据中心节点
11. Data Flow——Anatomy of a File Write
- We’re going to consider the case of creating a new file, writing data to it, then closing the file.
-
step 1 && step 2
The client creates the file by calling create() on DistributedFileSystem .DistributedFileSystem makes an RPC call to the namenode to create a new file in the filesystem’s namespace, with no blocks associated with it.(通过DistributedFileSystem通知namenode在namespace创建一个新文件,此时没有任何与之相关联的blocks)
-
The namenode performs various checks to make sure the file doesn’t already exist and that the client has the right permissions to create the file. (namenode会进行文件是否存在检查以及cilent是否有权限进行文件创建)
此时想到一个问题:若无权限,则运行程序时显示:Permission denied
-
解决方法是:
* 修改hdfs-site.xml为属性:dfs.permissions.enabled的值为false,意思是不进行是否允许检测
If these checks pass, the namenode makes a record of the new file;(两个检查均通过,则在namenode中创建一条记录)
The DistributedFileSystem returns an FSDataOutputStream for the client to start writing data to. FSDataOutputStream wraps a DFSOutputStream, which handles communication with the datanodes and namenode.(DistributedFileSystem向客户端返回了包裹着DFSOutputStream的FSDataOutputStream。)
-
step 3
As the client writes data , the DFSOutputStream splits it into packets, which it writes to an internal queue called the data queue. The data queue is consumed by the DataStreamer, which is responsible for asking the namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas. (当客户端写出数据时,DFSOutputStream将数据分为packets,并写入到data queue,data queue 被DataStreamer进行管理,DataStreamer是用来向namenode申请分配block的)
The list of datanodes forms a pipeline(datanodes 形成管道)
-
step 4
- The DataStreamer streams the packets to the first datanode in the pipeline, which stores each packet and forwards it to the second datanode in the pipeline. Similarly, the second datanode stores the packet and forwards it to the third (and last) datanode in the pipeline.(DataStreamer将packets传递给第一个datanode,存储packet后传递给下一个datanode......)
-
step 5
The DFSOutputStream also maintains an internal queue of packets that are waiting to be acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only when it has been acknowledged by all the datanodes in the pipeline.
If any datanode fails while data is being written to it, then the following actions are taken, First, the pipeline is closed,and any packets in the ack queue are added to the front of the data queue so that datanodes that are downstream from the failed node will not miss any packets。(首先,关闭pipeline,packets被加入到新的data queue中)
As long as dfs.namenode.replication.min replicas (which defaults to 1) are written,the write will succeed, and the block will be asynchronously replicated across the cluster until its target replication factor is reached (dfs.replication, which defaults to 3).
-
step 6
When the client has finished writing data, it calls close() on the stream (step 6). This action flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before contacting the namenode to signal that the file is complete.
The namenode already knows which blocks the file is made up of (because Data Streamer asks for block allocations)
-
Replica Placement
Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random.(第一个在Client本地或者随机挑选一个节点)
The second replica is placed on a different rack from the first (off-rack), chosen at random.(第二个备份被放置在与第一份在不同机架上的不同节点)
The third replica is placed on the same rack as the second, but on a different node chosen at random.(第三份放在与第二份相同的机架上的不同节点)
Once the replica locations have been chosen, a pipeline is built, taking network topology into account.(备份位置选取完毕后,pipeline被成功创建)
这样三个备份存储在两个机架上
-
hadoop2.7.2以后的Replica Placement
第一个在client本地或者随机选一个节点
第二个在与第一个同一机架上的不同节点
第三个在不同机架上的随机节点
但是思想是不变的,都是三个备份,两个机架