| Interface | Description |
|---|---|
| CommonCrawlFormat |
Interface for all CommonCrawl formatter.
|
| Class | Description |
|---|---|
| AbstractCommonCrawlFormat |
Abstract class that implements { @see org.apache.nutch.tools.CommonCrawlFormat } interface.
|
| Benchmark | |
| Benchmark.BenchmarkResults | |
| CommonCrawlConfig | |
| CommonCrawlDataDumper |
The Common Crawl Data Dumper tool enables one to reverse generate the raw
content from Nutch segment data directories into a common crawling data
format, consumed by many applications.
|
| CommonCrawlFormatFactory |
Factory class that creates new
CommonCrawlFormat objects (a.k.a. |
| CommonCrawlFormatJackson |
This class provides methods to map crawled data on JSON using Jackson Streaming APIs.
|
| CommonCrawlFormatJettinson |
This class provides methods to map crawled data on JSON using Jettinson APIs.
|
| CommonCrawlFormatSimple |
This class provides methods to map crawled data on JSON using a StringBuilder object.
|
| CommonCrawlFormatWARC | |
| DmozParser |
Utility that converts DMOZ
RDF into a flat file of URLs to be injected.
|
| FileDumper |
The file dumper tool enables one to reverse generate the raw content from
Nutch segment data directories.
|
| FreeGenerator |
This tool generates fetchlists (segments to be fetched) from plain text files
containing one URL per line.
|
| FreeGenerator.FG | |
| FreeGenerator.FG.FGMapper | |
| FreeGenerator.FG.FGReducer | |
| ResolveUrls |
A simple tool that will spin up multiple threads to resolve urls to ip
addresses.
|
| ShowProperties |
Tool to list properties and their values set by the current Nutch
configuration
|
| WARCUtils |
Copyright © 2021 The Apache Software Foundation