public class ReplaceIndexer extends Object implements IndexingFilter
index-replace to your
plugin.includes. Example:
<property>
<name>plugin.includes</name>
<value>protocol-(http)|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|replace)|urlnormalizer-(pass|regex|basic)|indexer-solr</value>
</property>
And then add the index.replace.regexp property to
conf/nutch-site.xml. This contains a list of replacement
instructions per field name, one per line. eg.
fieldname=/regexp/replacement/[flags]
<property>
<name>index.replace.regexp</name>
<value>
hostmatch=.\*\.com
title=/search/replace/2
</value>
</property>
hostmatch= and urlmatch= lines indicate the match
pattern for a host or url. The field replacements that follow this line will
apply only to pages from the matching host or url. Replacements run in the
order specified. Field names may appear multiple times if multiple
replacements are needed.
The property format is defined in greater detail in
conf/nutch-default.xml.X_POINT_ID| Constructor and Description |
|---|
ReplaceIndexer() |
| Modifier and Type | Method and Description |
|---|---|
NutchDocument |
filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a
parse.
|
Configuration |
getConf() |
void |
setConf(Configuration conf) |
public void setConf(Configuration conf)
setConf in interface Configurablepublic Configuration getConf()
getConf in interface Configurablepublic NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
filter in interface IndexingFilterdoc - document instance for collecting fieldsparse - parse data instanceurl - page urldatum - crawl datum for the page (fetch datum from segment containing
fetch status and fetch time)inlinks - page inlinksIndexingExceptionCopyright © 2021 The Apache Software Foundation