public class HttpRobotRulesParser extends RobotRulesParser
RobotRulesParser class and contains Http protocol
specific implementation for obtaining the robots file.| Modifier and Type | Field and Description |
|---|---|
protected boolean |
allowForbidden |
agentNames, CACHE, conf, EMPTY_RULES, FORBID_ALL_RULES, whiteList| Constructor and Description |
|---|
HttpRobotRulesParser(Configuration conf) |
| Modifier and Type | Method and Description |
|---|---|
protected void |
addRobotsContent(List<Content> robotsTxtContent,
URL robotsUrl,
Response robotsResponse)
Append
Content of robots.txt to robotsTxtContent |
protected static String |
getCacheKey(URL url)
Compose unique key to store and access robot rules in cache for given URL
|
crawlercommons.robots.BaseRobotRules |
getRobotRulesSet(Protocol http,
URL url,
List<Content> robotsTxtContent)
Get the rules from robots.txt which applies for the given
url. |
void |
setConf(Configuration conf)
Set the
Configuration object |
getConf, getRobotRulesSet, isWhiteListed, main, parseRules, runpublic HttpRobotRulesParser(Configuration conf)
public void setConf(Configuration conf)
RobotRulesParserConfiguration objectsetConf in interface ConfigurablesetConf in class RobotRulesParserprotected static String getCacheKey(URL url)
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http, URL url, List<Content> robotsTxtContent)
url.
Robot rules are cached for a unique combination of host, protocol, and
port. If no rules are found in the cache, a HTTP request is send to fetch
{{protocol://host:port/robots.txt}}. The robots.txt is then parsed and the
rules are cached to avoid re-fetching and re-parsing it again.getRobotRulesSet in class RobotRulesParserhttp - The Protocol objecturl - URLrobotsTxtContent - container to store responses when fetching the robots.txt file for
debugging or archival purposes. Instead of a robots.txt file, it
may include redirects or an error page (404, etc.). Response
Content is appended to the passed list. If null is passed
nothing is stored.BaseRobotRules object for the rulesprotected void addRobotsContent(List<Content> robotsTxtContent, URL robotsUrl, Response robotsResponse)
Content of robots.txt to robotsTxtContentrobotsTxtContent - container to store robots.txt response contentrobotsUrl - robots.txt URLrobotsResponse - response object to be storedCopyright © 2021 The Apache Software Foundation