public class File extends Object implements Protocol
FileResponse object and gets the content of the url from it.
Configurable parameters are file.content.limit and
file.crawl.parent in nutch-default.xml defined under
"file properties" section.| Modifier and Type | Field and Description |
|---|---|
protected static org.slf4j.Logger |
LOG |
X_POINT_ID| Constructor and Description |
|---|
File() |
| Modifier and Type | Method and Description |
|---|---|
Configuration |
getConf()
Get the
Configuration object |
ProtocolOutput |
getProtocolOutput(Text url,
CrawlDatum datum)
Creates a
FileResponse object corresponding to the url and return a
ProtocolOutput object as per the content received |
crawlercommons.robots.BaseRobotRules |
getRobotRules(Text url,
CrawlDatum datum,
List<Content> robotsTxtContent)
No robots parsing is done for file protocol.
|
static void |
main(String[] args)
Quick way for running this class.
|
void |
setConf(Configuration conf)
Set the
Configuration object |
void |
setMaxContentLength(int maxContentLength)
Set the length after at which content is truncated.
|
public void setConf(Configuration conf)
Configuration objectsetConf in interface Configurablepublic Configuration getConf()
Configuration objectgetConf in interface Configurablepublic void setMaxContentLength(int maxContentLength)
public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum)
FileResponse object corresponding to the url and return a
ProtocolOutput object as per the content receivedgetProtocolOutput in interface Protocolurl - Text containing the urldatum - The CrawlDatum object corresponding to the urlProtocolOutput object for the content of the file indicated
by urlpublic static void main(String[] args) throws Exception
Exceptionpublic crawlercommons.robots.BaseRobotRules getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
getRobotRules in interface Protocolurl - URL to checkdatum - page datumrobotsTxtContent - container to store responses when fetching the robots.txt file for
debugging or archival purposes. Instead of a robots.txt file, it
may include redirects or an error page (404, etc.). Response
Content is appended to the passed list. If null is passed
nothing is stored.Copyright © 2021 The Apache Software Foundation