Hadoop에서 여러 MapReduce 작업 연결

124

MapReduce를 적용하는 많은 실제 상황에서 최종 알고리즘은 여러 MapReduce 단계가됩니다.

즉, Map1, Reduce1, Map2, Reduce2 등.

따라서 다음 맵에 대한 입력으로 필요한 마지막 감소의 출력이 있습니다.

중간 데이터는 파이프 라인이 성공적으로 완료되면 (일반적으로) 유지하고 싶지 않은 것입니다. 또한이 중간 데이터는 일반적으로 일부 데이터 구조 (예 : '맵'또는 '세트')이기 때문에 이러한 키-값 쌍을 작성하고 읽는 데 너무 많은 노력을 기울이고 싶지 않습니다.

Hadoop에서 권장하는 방법은 무엇입니까?

나중에 정리를 포함하여이 중간 데이터를 올바른 방식으로 처리하는 방법을 보여주는 (간단한) 예가 있습니까?

hadoop mapreduce

— Niels Basjes
소스

2

어떤 mapreduce 프레임 워크를 사용하고 있습니까?

— skaffman

1

나는 Hadoop에 대해 이야기하고 있음을 명확히하기 위해 질문을 편집했습니다.

— Niels Basjes

나는이에 대한 키우는 보석을 권하고 싶습니다 : github.com/Ganglion/swineherd 최고, 토비아스

— 토비아스

57

야후의 개발자 네트워크에 대한이 튜토리얼이 이것에 도움이 될 것이라고 생각합니다 : Chaining Jobs

당신은 JobClient.runJob(). 첫 번째 작업의 데이터 출력 경로가 두 번째 작업의 입력 경로가됩니다. 이를 구문 분석하고 작업에 대한 매개 변수를 설정하려면 적절한 코드를 사용하여 작업에 인수로 전달해야합니다.

그러나 위의 방법이 지금은 이전에 매핑 된 API가 수행 한 방식 일 수 있지만 여전히 작동해야한다고 생각합니다. 새로운 mapreduce API에도 비슷한 방법이 있지만 그것이 무엇인지 잘 모르겠습니다.

작업이 완료된 후 중간 데이터를 제거하는 한 코드에서이를 수행 할 수 있습니다. 내가 전에 한 방식은 다음과 같은 것을 사용하는 것입니다.

FileSystem.delete(Path f, boolean recursive);

경로는 데이터의 HDFS에있는 위치입니다. 다른 작업에 필요하지 않은 경우에만이 데이터를 삭제해야합니다.

— 바이너리 대단하다
소스

3

Yahoo 튜토리얼 링크에 감사드립니다. Chaining Jobs는 실제로 두 가지가 동일한 실행에 있다면 원하는 것입니다. 내가 찾고 있던 것은 당신이 그것들을 개별적으로 실행할 수 있기를 원한다면 쉬운 방법입니다. 언급 된 튜토리얼에서 나는 SequenceFileOutputFormat "Writes binary files 적합한 후속 MapReduce 작업"과 일치하는 SequenceFileInputFormat을 발견했습니다. 감사.

— Niels Basjes

20

할 수있는 방법에는 여러 가지가 있습니다.

(1) 계단식 작업

첫 번째 작업에 대한 JobConf 개체 "job1"을 만들고 "input"을 inputdirectory로, "temp"를 출력 디렉터리로 모든 매개 변수를 설정합니다. 이 작업 실행 :

JobClient.run(job1).

바로 아래에 두 번째 작업에 대한 JobConf 개체 "job2"를 만들고 "temp"를 inputdirectory로, "output"을 출력 디렉터리로 모든 매개 변수를 설정합니다. 이 작업 실행 :

JobClient.run(job2).

(2) 두 개의 JobConf 객체를 만들고 JobClient.run을 사용하지 않는다는 점을 제외하고는 (1) 과 같이 모든 매개 변수를 설정합니다 .

그런 다음 jobconf를 매개 변수로 사용하여 두 개의 Job 객체를 만듭니다.

Job job1=new Job(jobconf1); 
Job job2=new Job(jobconf2);

jobControl 개체를 사용하여 작업 종속성을 지정한 다음 작업을 실행합니다.

JobControl jbcntrl=new JobControl("jbcntrl");
jbcntrl.addJob(job1);
jbcntrl.addJob(job2);
job2.addDependingJob(job1);
jbcntrl.run();

(3) Map + | 감소 | Map *에서는 Hadoop 버전 0.19 이상과 함께 제공되는 ChainMapper 및 ChainReducer 클래스를 사용할 수 있습니다.

— user381928
소스

7

실제로이를 수행하는 방법에는 여러 가지가 있습니다. 두 가지에 집중하겠습니다.

하나는 Riffle ( http://github.com/cwensel/riffle )을 통해 종속 항목을 식별하고 종속성 (토폴로지) 순서로 '실행'하는 주석 라이브러리입니다.

또는 Cascading ( http://www.cascading.org/ ) 에서 Cascade (및 MapReduceFlow)를 사용할 수 있습니다 . 향후 버전은 Riffle 주석을 지원하지만 이제는 원시 MR JobConf 작업에서 잘 작동합니다.

이에 대한 변형은 MR 작업을 전혀 관리하지 않고 Cascading API를 사용하여 애플리케이션을 개발하는 것입니다. 그런 다음 JobConf 및 작업 체인은 Cascading planner 및 Flow 클래스를 통해 내부적으로 처리됩니다.

이렇게하면 Hadoop 작업 등을 관리하는 메커니즘이 아니라 문제에 집중하는 데 시간을 할애 할 수 있습니다. 다른 언어 (예 : clojure 또는 jruby)를 계층화하여 개발 및 애플리케이션을 더욱 단순화 할 수도 있습니다. http://www.cascading.org/modules.html

— Cwensel
소스

6

JobConf 객체를 차례로 사용하여 작업 체인을 수행했습니다. 작업을 연결하기 위해 WordCount 예제를 사용했습니다. 한 작업은 주어진 출력에서 단어가 몇 번 반복되는지 알아냅니다. 두 번째 작업은 첫 번째 작업 출력을 입력으로 사용하고 주어진 입력에서 총 단어를 계산합니다. 다음은 Driver 클래스에 배치해야하는 코드입니다.

    //First Job - Counts, how many times a word encountered in a given file 
    JobConf job1 = new JobConf(WordCount.class);
    job1.setJobName("WordCount");

    job1.setOutputKeyClass(Text.class);
    job1.setOutputValueClass(IntWritable.class);

    job1.setMapperClass(WordCountMapper.class);
    job1.setCombinerClass(WordCountReducer.class);
    job1.setReducerClass(WordCountReducer.class);

    job1.setInputFormat(TextInputFormat.class);
    job1.setOutputFormat(TextOutputFormat.class);

    //Ensure that a folder with the "input_data" exists on HDFS and contains the input files
    FileInputFormat.setInputPaths(job1, new Path("input_data"));

    //"first_job_output" contains data that how many times a word occurred in the given file
    //This will be the input to the second job. For second job, input data name should be
    //"first_job_output". 
    FileOutputFormat.setOutputPath(job1, new Path("first_job_output"));

    JobClient.runJob(job1);


    //Second Job - Counts total number of words in a given file

    JobConf job2 = new JobConf(TotalWords.class);
    job2.setJobName("TotalWords");

    job2.setOutputKeyClass(Text.class);
    job2.setOutputValueClass(IntWritable.class);

    job2.setMapperClass(TotalWordsMapper.class);
    job2.setCombinerClass(TotalWordsReducer.class);
    job2.setReducerClass(TotalWordsReducer.class);

    job2.setInputFormat(TextInputFormat.class);
    job2.setOutputFormat(TextOutputFormat.class);

    //Path name for this job should match first job's output path name
    FileInputFormat.setInputPaths(job2, new Path("first_job_output"));

    //This will contain the final output. If you want to send this jobs output
    //as input to third job, then third jobs input path name should be "second_job_output"
    //In this way, jobs can be chained, sending output one to other as input and get the
    //final output
    FileOutputFormat.setOutputPath(job2, new Path("second_job_output"));

    JobClient.runJob(job2);

이러한 작업을 실행하는 명령은 다음과 같습니다.

bin / hadoop jar TotalWords.

명령에 대한 최종 작업 이름을 제공해야합니다. 위의 경우 TotalWords입니다.

— psrklr
소스

5

코드에 주어진 방식으로 MR 체인을 실행할 수 있습니다.

참고 : 드라이버 코드 만 제공되었습니다.

public class WordCountSorting {
// here the word keys shall be sorted
      //let us write the wordcount logic first

      public static void main(String[] args)throws IOException,InterruptedException,ClassNotFoundException {
            //THE DRIVER CODE FOR MR CHAIN
            Configuration conf1=new Configuration();
            Job j1=Job.getInstance(conf1);
            j1.setJarByClass(WordCountSorting.class);
            j1.setMapperClass(MyMapper.class);
            j1.setReducerClass(MyReducer.class);

            j1.setMapOutputKeyClass(Text.class);
            j1.setMapOutputValueClass(IntWritable.class);
            j1.setOutputKeyClass(LongWritable.class);
            j1.setOutputValueClass(Text.class);
            Path outputPath=new Path("FirstMapper");
            FileInputFormat.addInputPath(j1,new Path(args[0]));
                  FileOutputFormat.setOutputPath(j1,outputPath);
                  outputPath.getFileSystem(conf1).delete(outputPath);
            j1.waitForCompletion(true);
                  Configuration conf2=new Configuration();
                  Job j2=Job.getInstance(conf2);
                  j2.setJarByClass(WordCountSorting.class);
                  j2.setMapperClass(MyMapper2.class);
                  j2.setNumReduceTasks(0);
                  j2.setOutputKeyClass(Text.class);
                  j2.setOutputValueClass(IntWritable.class);
                  Path outputPath1=new Path(args[1]);
                  FileInputFormat.addInputPath(j2, outputPath);
                  FileOutputFormat.setOutputPath(j2, outputPath1);
                  outputPath1.getFileSystem(conf2).delete(outputPath1, true);
                  System.exit(j2.waitForCompletion(true)?0:1);
      }

}

순서는

( JOB1 ) MAP-> REDUCE-> ( JOB2 ) MAP
키를 정렬하기 위해 수행되었지만 트리 맵을 사용하는 것과 같은 더 많은 방법
이 있지만 작업이 연결된 방식에주의를 집중하고 싶습니다! !
감사합니다

— 아니 루다 신하
소스

4

MapReduce 작업을 처리하는 barch에 oozie를 사용할 수 있습니다. http://issues.apache.org/jira/browse/HADOOP-5303

— user300313
소스

3

Apache Mahout 프로젝트에는 여러 MapReduce 작업을 연결하는 예제가 있습니다. 예 중 하나는 다음에서 찾을 수 있습니다.

RecommenderJob.java

http://search-lucene.com/c/Mahout:/core/src/main/java/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.java%7C%7CRecommenderJob

— Christie 영어
소스

3

waitForCompletion(true)Job 메서드를 사용하여 작업 간의 종속성을 정의 할 수 있습니다 .

내 시나리오에서 나는 서로 의존하는 3 개의 직업을 가졌다. 드라이버 클래스에서 아래 코드를 사용했으며 예상대로 작동합니다.

public static void main(String[] args) throws Exception {
        // TODO Auto-generated method stub

        CCJobExecution ccJobExecution = new CCJobExecution();

        Job distanceTimeFraudJob = ccJobExecution.configureDistanceTimeFraud(new Configuration(),args[0], args[1]);
        Job spendingFraudJob = ccJobExecution.configureSpendingFraud(new Configuration(),args[0], args[1]);
        Job locationFraudJob = ccJobExecution.configureLocationFraud(new Configuration(),args[0], args[1]);

        System.out.println("****************Started Executing distanceTimeFraudJob ================");
        distanceTimeFraudJob.submit();
        if(distanceTimeFraudJob.waitForCompletion(true))
        {
            System.out.println("=================Completed DistanceTimeFraudJob================= ");
            System.out.println("=================Started Executing spendingFraudJob ================");
            spendingFraudJob.submit();
            if(spendingFraudJob.waitForCompletion(true))
            {
                System.out.println("=================Completed spendingFraudJob================= ");
                System.out.println("=================Started locationFraudJob================= ");
                locationFraudJob.submit();
                if(locationFraudJob.waitForCompletion(true))
                {
                    System.out.println("=================Completed locationFraudJob=================");
                }
            }
        }
    }

— Shivaprasad
소스

귀하의 대답은 실행 측면에서 이러한 작업에 참여하는 방법에 대한 것입니다. 원래 질문은 최고의 데이터 구조에 관한 것이 었습니다. 따라서 귀하의 답변은이 특정 질문과 관련이 없습니다.

— Niels Basjes 2013 년

2

새로운 클래스 org.apache.hadoop.mapreduce.lib.chain.ChainMapper는이 시나리오를 도와줍니다.

— 자비
소스

1

답변 좋다 -하지만 사람들이 최대 - 투표 할 수 있도록 당신은 API 참조에 대한 링크가 무엇을하는지에 대해 또는 적어도 좀 더 세부 사항을 추가해야합니다

— 제레미 HAJEK

ChainMapper 및 ChainReducer는 Reduce 이전에 1 개 이상의 매퍼와 Reduce, 사양 이후에 0 개 이상의 매퍼를 갖는 데 사용됩니다. (매퍼 +) 줄임 (매퍼 *). 내가 분명히 틀렸다면 저를 정정하십시오. 그러나이 접근 방식은 OP가 요청한 것처럼 일련의 작업을 수행한다고 생각하지 않습니다.

— oczkoisse 2017-04-12

1

oozie와 같은 복잡한 서버 기반 Hadoop 워크 플로우 엔진이 있지만 여러 Hadoop 작업을 워크 플로우로 실행할 수있는 간단한 Java 라이브러리가 있습니다. 작업 간 종속성을 정의하는 작업 구성 및 워크 플로는 JSON 파일에 구성됩니다. 모든 것이 외부에서 구성 가능하며 워크 플로의 일부가되기 위해 기존 맵 축소 구현을 변경할 필요가 없습니다.

자세한 내용은 여기에서 확인할 수 있습니다. 소스 코드와 jar는 github에서 사용할 수 있습니다.

http://pkghosh.wordpress.com/2011/05/22/hadoop-orchestration/

프라 납

— 프라 납
소스

1

oozie는 결과적인 작업이 이전 작업에서 직접 입력을받을 수 있도록 도와 준다고 생각합니다. 이렇게하면 jobcontrol로 수행되는 I / o 작업이 방지됩니다.

— 철저히
소스

1

작업을 프로그래밍 방식으로 연결하려면 JobControl을 사용합니다. 사용법은 매우 간단합니다.

JobControl jobControl = new JobControl(name);

그런 다음 ControlledJob 인스턴스를 추가합니다. ControlledJob은 종속성이있는 작업을 정의하므로 작업 "체인"에 맞게 입력 및 출력을 자동으로 연결합니다.

    jobControl.add(new ControlledJob(job, Arrays.asList(controlledjob1, controlledjob2));

    jobControl.run();

체인을 시작합니다. 당신은 그것을 뾰족한 실에 넣고 싶을 것입니다. 이를 통해 체인이 실행되는 동안 상태를 확인할 수 있습니다.

    while (!jobControl.allFinished()) {
        System.out.println("Jobs in waiting state: " + jobControl.getWaitingJobList().size());
        System.out.println("Jobs in ready state: " + jobControl.getReadyJobsList().size());
        System.out.println("Jobs in running state: " + jobControl.getRunningJobList().size());
        List<ControlledJob> successfulJobList = jobControl.getSuccessfulJobList();
        System.out.println("Jobs in success state: " + successfulJobList.size());
        List<ControlledJob> failedJobList = jobControl.getFailedJobList();
        System.out.println("Jobs in failed state: " + failedJobList.size());
    }

— 에릭 슈 미겔로
소스

0

MRJob1의 o / p가 MRJob2의 i / p가되도록 요구 사항에서 언급했듯이이 사용 사례에 대해 oozie 워크 플로 사용을 고려할 수 있습니다. 또한 다음 MRJob에서 사용하므로 중간 데이터를 HDFS에 쓰는 것을 고려할 수 있습니다. 작업이 완료된 후 중간 데이터를 정리할 수 있습니다.

<start to="mr-action1"/>
<action name="mr-action1">
   <!-- action for MRJob1-->
   <!-- set output path = /tmp/intermediate/mr1-->
    <ok to="end"/>
    <error to="end"/>
</action>

<action name="mr-action2">
   <!-- action for MRJob2-->
   <!-- set input path = /tmp/intermediate/mr1-->
    <ok to="end"/>
    <error to="end"/>
</action>

<action name="success">
        <!-- action for success-->
    <ok to="end"/>
    <error to="end"/>
</action>

<action name="fail">
        <!-- action for fail-->
    <ok to="end"/>
    <error to="end"/>
</action>

<end name="end"/>

— 네하 쿠마리
소스