질문은 아래 댓글로 남겨주세요
의뢰는 http://instructables.tistory.com/64로 오세요
http://www.widriksson.com/raspberry-pi-hadoop-cluster/
https://developer.ibm.com/recipes/tutorials/building-a-hadoop-cluster-with-raspberry-pi/
http://www.widriksson.com/raspberry-pi-hadoop-cluster/
Raspberry PI Hadoop Cluster
If you like Raspberry Pi’s and like to get into Distributed Computing and Big Data processing
what could be a better than creating your own Raspberry Pi Hadoop Cluster?
만약 네가 좋아한다면 RPi를 and like to 진입하고싶다면 분배된 계산 and 빅데이터 처리로,
뭐가 될수있을까 ? a 더나은 than creating하는것보다 네 own Raspberry Pi Hadoop Cluser를 ?
The tutorial does not assume that you have any previous knowledge of Hadoop. Hadoop is a framework for storage and processing of large amount of data.
Or “Big Data” which is a pretty common buzzword those days.
튜토리얼은 doesn't 가정하고있지않다. that 네가 갖고있다고 어느 사전 지식을 of Hadoop의.
Hadoop은 framework이다. for 저장 and 처리를위한 of large amount of data를
Or "빅데이터" which is 꽤나 일반적인 키워드이다. those days.
The performance of running Hadoop on a Rasperry PI
is probably terrible
but I hope to be able to make a small and fully functional little cluster
to see how it works and perform.
그 퍼포먼스 of 동작하는 Hadoop이 on RPi상에서
는 아마 대단할것이다.
but 난 바란다. to be 가능케되길 to 만드는데 a 작고 and fully functional little cluster로
to 보기위해 어떻게 그게 작동하는지 and 동작하는지.
For a tutorial on Hadoop 2 please see my newer post:
http://www.widriksson.com/raspberry-pi-2-hadoop-2-cluster/
In this tutorial we start with using one Raspberry PI at first
and then adding two more after we have a working single node.
We will also do some simple performance tests
to compare the impact of adding more nodes to the cluster.
Last we try to improve and optimize Hadoop for Raspberry Pi cluster.
이 튜토리얼에서, 우린 시작한다. with 사용함으로써 one RPi를 at 처음에
and then 추가한다. 두개이상을 after 우리가 가진다음 a 동작하는 single node를.
우린 will also 할것이다. some simple performance tests를
to 비교하기위해 the impact of 추가하기의 more nodes를 to the cluster에.
마지막으로 우린 시도한다. to improve하는걸 and optimize하는걸 Hadoop을 for RPi Clust를 위해.
Fundamentals of Hadoop
What is Hadoop?
“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets
across clusters of computers using simple programming models.
Apache Hadoop 소프트웨어 라이브러리는 a 프레임웍이다. that 허락하는 for the 분배 처리를 of 큰 데이터 셋들의
across 클러스터들을 지나서 of 컴퓨터들의 using 심플 프로그래밍 모델들을 써서.
It is designed to scale up from single servers to thousands of machines,
each offering local computation and storage.
이건 디자인되있다. to 스케일업하기위해 from single servers로부터 to 수천개의 머신들로,
각 제공하는 local 계산 and 저장공간의.
Rather than rely on hardware to deliver high-availability,
the library itself is designed to detect and handle failures at the application layer,
so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”
rather than 의존보다는 on 하드웨어에 to 전달하기위해 High-Availability에,
the 라이브러리 그자체는 디자인되있다. to 감지하기위해 and 다루기위해 failures를
Components of Hadoop
Hadoop is built up by a number of components
and Open Source frameworks which makes it quite flexible and modular.
However before diving deeper into Hadoop
it is easier to view it as two main parts – data storage (HDFS)
and data processing (MapReduce):
하둡은 구축되있다. by a 수많은 구성요소들로
amd 오픈소스 프레임웍스 which 만드는 그걸 꽤나 flexible하게 and modular하게.
그러나 before 뛰어들기전에 더깊이 into Hadoop에
이건 좀더쉽다. to 보는게 그걸 as 두개의 main parts로써 - data storage HDFS
and data processing (MapReduce):
■HDFS – Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) was designed to run on low cost hardware
and is higly fault tolerant.
Files are split up into blocks that are replicated to the DataNodes.
By default blocks have a size of 64MB
and are replicated to 3 nodes in the cluster.
However those settings can be adjusted to specific needs.
and are 대체된다. to 3개 노드들로 in the 클러스터에.
하지만 those settings는 can be 조절될수있다. to 특정 노드들로.
Overview of HDFS File System architecture:
■MapReduce
MapReduce is a software framework written in Java
that is used to create application that can process large amount of data.
Although its written in Java there are other languages available to write a MapReduce application.
MapReduce는 소프트웨어 프레임웍이다. 씌여진 in Java로
that is 사용된 to 만들기위해 application을 that can 처리한다. 대용량 데이터를.
비록 그게 씌였더라도 in Java로 there are 다른 언어가있다. 사용가능한 to 쓰기위해 a MapReduce application을.
As with HDFS it is built to be fault tolerant and to work in large-scale cluster environments. The framework have the ability to split up input data into smaller tasks (map tasks) that can be executed in parallel processes.
The output from the map tasks are then reduced (reduce task) and usually saved to the file system.
Below you will see the MapReduce flow of the WordCount sample program that we will use later. WordCount takes a text file as input, divides it into smaller parts and then count each word and outputs a file with a count of all words within the file.
MapReduce flow overview (WordCount example):
Daemons/services
Daemon/service | Description |
NameNode | Runs on a Master node. Manages the HDFS file system on the cluster. |
Secondary NameNode | Very misleading name. It is NOT a backup for the NameNode. It make period checks/updates so in case the NameNode fails it can be restarted without the need to restart the data nodes. – http://wiki.apache.org/hadoop/FAQ#What_is_the_purpose_of_the_secondary_name-node.3F |
JobTracker | Manages MapReduce jobs and distributes them to the nodes in the cluster. |
DataNode | Runs on a slave node. Act as HDFS file storage. |
TaskTracker | Runs MapReduce jobs which are received from the JobTracker. |
Master and Slaves
■Master
Is the node in the cluster
that has the namenode and jobtracker.
In this tutorial we will also configure our master node to act as both master and slave.
■Slave
Node in the cluster that act as a DataNode and TaskTracker.
Note: When a node is running a job the TaskTracker will try to use local data (in its “own” DataNode”) if possible. Hence the benefit of having both the DataNode and TaskTracker on the same node since there will be no overhead network traffic. This also implies that it is important to know how data is distributed and stored in HDFS.
Start/stop scripts
Script | Description |
start-dfs.sh | Starts NameNode, Secondary NameNode and DataNode(s) |
stop-dfs.sh | Stops NameNode, Secondary NameNode and DataNode(s) |
start-mapred.sh | Starts JobTracker and TaskTracker(s) |
stop-mapred.sh | Stops JobTracker and TaskTracker(s) |
The above scripts should be executed from the NameNode. Through SSH connections daemons will be started on all the nodes in the cluster (all nodes defined in conf/slaves)