BigData (=HaDoop)

RaspBerry PI/BigData

2017. 5. 19. 21:55

336x280(권장), 300x250(권장), 250x250, 200x200 크기의 광고 코드만 넣을 수 있습니다.

질문은 아래 댓글로 남겨주세요

의뢰는 http://instructables.tistory.com/64로 오세요

http://www.widriksson.com/raspberry-pi-hadoop-cluster/

https://developer.ibm.com/recipes/tutorials/building-a-hadoop-cluster-with-raspberry-pi/

http://www.widriksson.com/raspberry-pi-hadoop-cluster/

Raspberry PI Hadoop Cluster

If you like Raspberry Pi’s and like to get into Distributed Computing and Big Data processing

what could be a better than creating your own Raspberry Pi Hadoop Cluster?

만약 네가 좋아한다면 RPi를 and like to 진입하고싶다면 분배된 계산 and 빅데이터 처리로,

뭐가 될수있을까 ? a 더나은 than creating하는것보다 네 own Raspberry Pi Hadoop Cluser를 ?

The tutorial does not assume that you have any previous knowledge of Hadoop. Hadoop is a framework for storage and processing of large amount of data.

Or “Big Data” which is a pretty common buzzword those days.

튜토리얼은 doesn't 가정하고있지않다. that 네가 갖고있다고 어느 사전 지식을 of Hadoop의.

Hadoop은 framework이다. for 저장 and 처리를위한 of large amount of data를

Or "빅데이터" which is 꽤나 일반적인 키워드이다. those days.

The performance of running Hadoop on a Rasperry PI

is probably terrible

but I hope to be able to make a small and fully functional little cluster

to see how it works and perform.

그 퍼포먼스 of 동작하는 Hadoop이 on RPi상에서

는 아마 대단할것이다.

but 난 바란다. to be 가능케되길 to 만드는데 a 작고 and fully functional little cluster로

to 보기위해 어떻게 그게 작동하는지 and 동작하는지.

For a tutorial on Hadoop 2 please see my newer post:
http://www.widriksson.com/raspberry-pi-2-hadoop-2-cluster/

In this tutorial we start with using one Raspberry PI at first

and then adding two more after we have a working single node.

We will also do some simple performance tests

to compare the impact of adding more nodes to the cluster.

Last we try to improve and optimize Hadoop for Raspberry Pi cluster.

이 튜토리얼에서, 우린 시작한다. with 사용함으로써 one RPi를 at 처음에

and then 추가한다. 두개이상을 after 우리가 가진다음 a 동작하는 single node를.

우린 will also 할것이다. some simple performance tests를

to 비교하기위해 the impact of 추가하기의 more nodes를 to the cluster에.

마지막으로 우린 시도한다. to improve하는걸 and optimize하는걸 Hadoop을 for RPi Clust를 위해.

Fundamentals of Hadoop

What is Hadoop?

“The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets

across clusters of computers using simple programming models.

Apache Hadoop 소프트웨어 라이브러리는 a 프레임웍이다. that 허락하는 for the 분배 처리를 of 큰 데이터 셋들의

across 클러스터들을 지나서 of 컴퓨터들의 using 심플 프로그래밍 모델들을 써서.

It is designed to scale up from single servers to thousands of machines,

each offering local computation and storage.

이건 디자인되있다. to 스케일업하기위해 from single servers로부터 to 수천개의 머신들로,

각 제공하는 local 계산 and 저장공간의.

Rather than rely on hardware to deliver high-availability,

the library itself is designed to detect and handle failures at the application layer,

so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.”

rather than 의존보다는 on 하드웨어에 to 전달하기위해 High-Availability에,

the 라이브러리 그자체는 디자인되있다. to 감지하기위해 and 다루기위해 failures를

– http://hadoop.apache.org/

Components of Hadoop

Hadoop is built up by a number of components

and Open Source frameworks which makes it quite flexible and modular.

However before diving deeper into Hadoop

it is easier to view it as two main parts – data storage (HDFS)

and data processing (MapReduce):

하둡은 구축되있다. by a 수많은 구성요소들로

amd 오픈소스 프레임웍스 which 만드는 그걸 꽤나 flexible하게 and modular하게.

그러나 before 뛰어들기전에 더깊이 into Hadoop에

이건 좀더쉽다. to 보는게 그걸 as 두개의 main parts로써 - data storage HDFS

and data processing (MapReduce):

■HDFS – Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) was designed to run on low cost hardware

and is higly fault tolerant.

Files are split up into blocks that are replicated to the DataNodes.

By default blocks have a size of 64MB

and are replicated to 3 nodes in the cluster.

However those settings can be adjusted to specific needs.

■HDFS

하둡 분배 파일 시스템은 디자인됬다. to 동작하기위해 on 저가 하드웨어상에서

and is 높은 실패 허용률이있다.

파일은 나뉜다. up into 블록들로 that are 대체되는 to the DataNodes로.

초기에 Blocks는 갖고있다. a size of 64MB를

and are 대체된다. to 3개 노드들로 in the 클러스터에.

하지만 those settings는 can be 조절될수있다. to 특정 노드들로.

Overview of HDFS File System architecture:

■MapReduce
MapReduce is a software framework written in Java

that is used to create application that can process large amount of data.

Although its written in Java there are other languages available to write a MapReduce application.

MapReduce는 소프트웨어 프레임웍이다. 씌여진 in Java로

that is 사용된 to 만들기위해 application을 that can 처리한다. 대용량 데이터를.

비록 그게 씌였더라도 in Java로 there are 다른 언어가있다. 사용가능한 to 쓰기위해 a MapReduce application을.

As with HDFS it is built to be fault tolerant and to work in large-scale cluster environments. The framework have the ability to split up input data into smaller tasks (map tasks) that can be executed in parallel processes.

The output from the map tasks are then reduced (reduce task) and usually saved to the file system.

Below you will see the MapReduce flow of the WordCount sample program that we will use later. WordCount takes a text file as input, divides it into smaller parts and then count each word and outputs a file with a count of all words within the file.

MapReduce flow overview (WordCount example):

Daemons/services

Daemon/service	Description
NameNode	Runs on a Master node. Manages the HDFS file system on the cluster.
Secondary NameNode	Very misleading name. It is NOT a backup for the NameNode. It make period checks/updates so in case the NameNode fails it can be restarted without the need to restart the data nodes. – http://wiki.apache.org/hadoop/FAQ#What_is_the_purpose_of_the_secondary_name-node.3F
JobTracker	Manages MapReduce jobs and distributes them to the nodes in the cluster.
DataNode	Runs on a slave node. Act as HDFS file storage.
TaskTracker	Runs MapReduce jobs which are received from the JobTracker.

Master and Slaves

■Master

Is the node in the cluster

that has the namenode and jobtracker.

In this tutorial we will also configure our master node to act as both master and slave.

Master는 node이다. in the cluster안의

that 갖고있는 namenode를 and jobtracker를.

in this tutorial에서, 우린 will also 구성할것이다. 우리 마스터 노드를 to 작용시키기위해 as both 마스터 and 슬레이브로써.

■Slave

Node in the cluster that act as a DataNode and TaskTracker.

노드 in the 클러스터안의 that 동작시키는 as a Datanode and TaskTracker로써.

Note: When a node is running a job the TaskTracker will try to use local data (in its “own” DataNode”) if possible. Hence the benefit of having both the DataNode and TaskTracker on the same node since there will be no overhead network traffic. This also implies that it is important to know how data is distributed and stored in HDFS.

Start/stop scripts

Script	Description
start-dfs.sh	Starts NameNode, Secondary NameNode and DataNode(s)
stop-dfs.sh	Stops NameNode, Secondary NameNode and DataNode(s)
start-mapred.sh	Starts JobTracker and TaskTracker(s)
stop-mapred.sh	Stops JobTracker and TaskTracker(s)

The above scripts should be executed from the NameNode. Through SSH connections daemons will be started on all the nodes in the cluster (all nodes defined in conf/slaves)