public class ExemptionUrlFilter extends RegexURLFilter implements URLExemptionFilter
URLExemptionFilter uses regex configuration
to check if URL is eligible for exemption from 'db.ignore.external'.
When this filter is enabled, the external urls will be checked against configured sequence of regex rules.
The exemption rule file defaults to db-ignore-external-exemptions.txt in the classpath but can be
overridden using the property "db.ignore.external.exemptions.file" in ./conf/nutch-*.xml
URLExemptionFilter,
RegexURLFilter| Modifier and Type | Field and Description |
|---|---|
static String |
DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE |
URLFILTER_REGEX_FILE, URLFILTER_REGEX_RULEShasHostDomainRulesX_POINT_IDX_POINT_ID| Constructor and Description |
|---|
ExemptionUrlFilter() |
| Modifier and Type | Method and Description |
|---|---|
boolean |
filter(String fromUrl,
String toUrl)
Checks if toUrl is exempted when the ignore external is enabled
|
List<Pattern> |
getExemptions() |
protected Reader |
getRulesReader(Configuration conf)
Gets reader for regex rules
|
static void |
main(String[] args) |
createRule, createRulefilter, getConf, main, setConfclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitgetConf, setConfpublic static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE
public boolean filter(String fromUrl, String toUrl)
URLExemptionFilterfilter in interface URLExemptionFilterfromUrl - : the source url which generated the outlinktoUrl - : the destination url which needs to be checked for exemptionprotected Reader getRulesReader(Configuration conf) throws IOException
getRulesReader in class RegexURLFilterconf - is the current configuration.IOExceptionpublic static void main(String[] args)
Copyright © 2021 The Apache Software Foundation