public class RegexURLNormalizer extends Configured implements URLNormalizer
This class uses the urlnormalizer.regex.file property. It should be set to the file name of an xml file which should contain the patterns and substitutions to be done on encountered URLs.
This class also supports different rules depending on the scope. Please see
the javadoc in URLNormalizers for more details.
X_POINT_ID| Constructor and Description |
|---|
RegexURLNormalizer()
The default constructor which is called from UrlNormalizerFactory
(normalizerClass.newInstance()) in method: getNormalizer()*
|
RegexURLNormalizer(Configuration conf) |
RegexURLNormalizer(Configuration conf,
String filename)
Constructor which can be passed the file name, so it doesn't look in the
configuration files for it.
|
| Modifier and Type | Method and Description |
|---|---|
HashMap<String,List<org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.Rule>> |
getScopedRules() |
static void |
main(String[] args)
Spits out patterns and substitutions that are in the configuration file.
|
String |
normalize(String urlString,
String scope) |
String |
regexNormalize(String urlString,
String scope)
This function does the replacements by iterating through all the regex
patterns.
|
void |
setConf(Configuration conf) |
getConfclone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitgetConfpublic RegexURLNormalizer()
public RegexURLNormalizer(Configuration conf)
public RegexURLNormalizer(Configuration conf, String filename) throws IOException, PatternSyntaxException
IOExceptionPatternSyntaxExceptionpublic HashMap<String,List<org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.Rule>> getScopedRules()
public void setConf(Configuration conf)
setConf in interface ConfigurablesetConf in class Configuredpublic String regexNormalize(String urlString, String scope)
public String normalize(String urlString, String scope) throws MalformedURLException
normalize in interface URLNormalizerMalformedURLExceptionpublic static void main(String[] args) throws PatternSyntaxException, IOException
PatternSyntaxExceptionIOExceptionCopyright © 2021 The Apache Software Foundation