如何将.txt文件转换为Hadoop的序列文件格式

Z时代
2024-01-10
分类：问答

为了有效地利用Hadoop中的 map-

reduce作业，我需要将数据以hadoop的序列文件格式存储。但是，当前数据仅是平面.txt格式。有人可以建议我将.txt文件转换为序列文件的方法吗？

回答：

因此，最简单的答案就是只有一个具有SequenceFile输出的“身份”工作。

在Java中看起来像这样：

    public static void main(String[] args) throws IOException,
        InterruptedException, ClassNotFoundException {
    Configuration conf = new Configuration();
    Job job = new Job(conf);
    job.setJobName("Convert Text");
    job.setJarByClass(Mapper.class);
    job.setMapperClass(Mapper.class);
    job.setReducerClass(Reducer.class);
    // increase if you need sorting or a special number of files
    job.setNumReduceTasks(0);
    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);
    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setInputFormatClass(TextInputFormat.class);
    TextInputFormat.addInputPath(job, new Path("/lol"));
    SequenceFileOutputFormat.setOutputPath(job, new Path("/lolz"));
    // submit and wait for completion
    job.waitForCompletion(true);
   }