public class ArcSegmentCreator extends Configured implements Tool
The ArcSegmentCreator is a replacement for fetcher that will
take arc files as input and produce a nutch segment as output.
Arc files are tars of compressed gzips which are produced by both the internet archive project and the grub distributed crawler project.
| Modifier and Type | Class and Description |
|---|---|
static class |
ArcSegmentCreator.ArcSegmentCreatorMapper |
| Modifier and Type | Field and Description |
|---|---|
static String |
URL_VERSION |
| Constructor and Description |
|---|
ArcSegmentCreator() |
ArcSegmentCreator(Configuration conf)
Constructor that sets the job configuration.
|
| Modifier and Type | Method and Description |
|---|---|
void |
close() |
void |
createSegments(Path arcFiles,
Path segmentsOutDir)
Creates the arc files to segments job.
|
static String |
generateSegmentName()
Generates a random name for the segments.
|
static void |
main(String[] args) |
int |
run(String[] args) |
getConf, setConfclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitgetConf, setConfpublic static final String URL_VERSION
public ArcSegmentCreator()
public ArcSegmentCreator(Configuration conf)
Constructor that sets the job configuration.
conf - public static String generateSegmentName()
public void close()
public void createSegments(Path arcFiles, Path segmentsOutDir) throws IOException, InterruptedException, ClassNotFoundException
Creates the arc files to segments job.
arcFiles - The path to the directory holding the arc filessegmentsOutDir - The output directory for writing the segmentsIOException - If an IO error occurs while running the job.InterruptedExceptionClassNotFoundExceptionCopyright © 2021 The Apache Software Foundation