public class TLDScoringFilter extends Object implements ScoringFilter
X_POINT_ID| Constructor and Description |
|---|
TLDScoringFilter() |
| Modifier and Type | Method and Description |
|---|---|
CrawlDatum |
distributeScoreToOutlink(Text fromUrl,
Text toUrl,
ParseData parseData,
CrawlDatum target,
CrawlDatum adjust,
int allCount,
int validCount) |
CrawlDatum |
distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Distribute score value from the current page to all its outlinked pages.
|
float |
generatorSortValue(Text url,
CrawlDatum datum,
float initSort)
This method prepares a sort value for the purpose of sorting and selecting
top N scoring pages during fetchlist generation.
|
Configuration |
getConf() |
float |
indexerScore(Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
This method calculates a indexed document score/boost.
|
void |
initialScore(Text url,
CrawlDatum datum)
Set an initial score for newly discovered pages.
|
void |
injectedScore(Text url,
CrawlDatum datum)
Set an initial score for newly injected pages.
|
void |
passScoreAfterParsing(Text url,
Content content,
Parse parse)
Currently a part of score distribution is performed using only data coming
from the parsing process.
|
void |
passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content)
This method takes all relevant score information from the current datum
(coming from a generated fetchlist) and stores it into
Content metadata. |
void |
setConf(Configuration conf) |
void |
updateDbScore(Text url,
CrawlDatum old,
CrawlDatum datum,
List<CrawlDatum> inlinked)
This method calculates a new score of CrawlDatum during CrawlDb update,
based on the initial value of the original CrawlDatum, and also score
values contributed by inlinked pages.
|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitorphanedScorepublic float indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore) throws ScoringFilterException
ScoringFilterindexerScore in interface ScoringFilterurl - url of the pagedoc - indexed document. NOTE: this already contains all information
collected by indexing filters. Implementations may modify this
instance, in order to store/remove some information.dbDatum - current page from CrawlDb. NOTE:
fetchDatum - datum from FetcherOutput (containing among others the fetching
status)parse - parsing result. NOTE: changes made to this instance are not
persisted.inlinks - current inlinks from LinkDb. NOTE: changes made to this instance
are not persisted.initScore - initial boost value for the indexed document.ScoringFilterExceptionpublic CrawlDatum distributeScoreToOutlink(Text fromUrl, Text toUrl, ParseData parseData, CrawlDatum target, CrawlDatum adjust, int allCount, int validCount) throws ScoringFilterException
ScoringFilterExceptionpublic float generatorSortValue(Text url, CrawlDatum datum, float initSort) throws ScoringFilterException
ScoringFiltergeneratorSortValue in interface ScoringFilterurl - url of the pagedatum - page's datum, should not be modifiedinitSort - initial sort value, or a value from previous filters in chainScoringFilterExceptionpublic void initialScore(Text url, CrawlDatum datum) throws ScoringFilterException
ScoringFilterinitialScore in interface ScoringFilterurl - url of the pagedatum - new datum. Filters will modify it in-place.ScoringFilterExceptionpublic void injectedScore(Text url, CrawlDatum datum) throws ScoringFilterException
ScoringFilterinjectedScore in interface ScoringFilterurl - url of the pagedatum - new datum. Filters will modify it in-place.ScoringFilterExceptionpublic void passScoreAfterParsing(Text url, Content content, Parse parse) throws ScoringFilterException
ScoringFilterpassScoreAfterParsing in interface ScoringFilterurl - page urlcontent - original content. NOTE: modifications to this value are not
persisted.parse - target instance to copy the score information to. Implementations
may modify this in-place, primarily by setting some metadata
properties.ScoringFilterExceptionpublic void passScoreBeforeParsing(Text url, CrawlDatum datum, Content content) throws ScoringFilterException
ScoringFilterContent metadata. This is needed in order
to pass this value(s) to the mechanism that distributes it to outlinked
pages.passScoreBeforeParsing in interface ScoringFilterurl - url of the pagedatum - source datum. NOTE: modifications to this value are not persisted.content - instance of content. Implementations may modify this in-place,
primarily by setting some metadata properties.ScoringFilterExceptionpublic void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked) throws ScoringFilterException
ScoringFilterupdateDbScore in interface ScoringFilterurl - url of the pageold - original datum, with original score. May be null if this is a
newly discovered page. If not null, filters should use score
values from this parameter as the starting values - the
datum parameter may contain values that are no longer
valid, if other updates occurred between generation and this
update.datum - the new datum, with the original score saved at the time when
fetchlist was generated. Filters should update this in-place, and
it will be saved in the crawldb.inlinked - (partial) list of CrawlDatum-s (with their scores) from links
pointing to this page, found in the current update batch.ScoringFilterExceptionpublic Configuration getConf()
getConf in interface Configurablepublic void setConf(Configuration conf)
setConf in interface Configurablepublic CrawlDatum distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount) throws ScoringFilterException
ScoringFilterdistributeScoreToOutlinks in interface ScoringFilterfromUrl - url of the source pageparseData - ParseData instance, which stores relevant score value(s) in its
metadata. NOTE: filters may modify this in-place, all changes will
be persisted.targets - <url, CrawlDatum> pairs. NOTE: filters can modify this
in-place, all changes will be persisted.adjust - a CrawlDatum instance, initially null, which implementations may
use to pass adjustment values to the original CrawlDatum. When
creating this instance, set its status to
CrawlDatum.STATUS_LINKED.allCount - number of all collected outlinks from the source pageCrawlDatum.STATUS_LINKED, which contains
adjustments to be applied to the original CrawlDatum score(s) and
metadata. This can be null if not needed.ScoringFilterExceptionCopyright © 2021 The Apache Software Foundation