【已解决】R语言中如何查询库函数的语法和功能说明

【背景】

折腾:

【记录】尝试用R语言去抓取网页和提取信息

需要去搞清楚刚安装好的R语言的XML库的htmlParse的语法。

问题转化为:

R语言中,如何找到对应的帮助文件,库文件的函数说明,手册等资料。

【折腾过程】

1.然后去开始菜单中,没找到R语言的手册:

run find r lan but not found manual

2.然后是自己摸索出来了:

RGui -> Help -> R functions(text)

r lan RGui help R functions text

3.然后输入要查询的内容:

此处输入htmlParse

question htmlparse then ok

即可找到并打开(本地代理服务器的)html页面:

http://127.0.0.1:21716/library/XML/html/xmlTreeParse.html

中的解释了:

htmlParse(file, ignoreBlanks = TRUE, handlers = NULL, replaceEntities = FALSE, 
          asText = FALSE, trim = TRUE, validate = FALSE, getDTD = TRUE, 
           isURL = FALSE, asTree = FALSE, addAttributeNamespaces = FALSE, 
            useInternalNodes = TRUE, isSchema = FALSE, fullNamespaceInfo = FALSE, 
             encoding = character(), 
             useDotNames = length(grep("^\\.", names(handlers))) > 0, 
              xinclude = TRUE, addFinalizer = TRUE, 
               error = htmlErrorHandler, isHTML = TRUE,
                options = integer(), parentFirst = FALSE) 

xmlSchemaParse(file, asText = FALSE, xinclude = TRUE, error = xmlErrorCumulator())
Arguments

file

The name of the file containing the XML contents. This can contain \~ which is expanded to the user’s home directory. It can also be a URL. See isURL. Additionally, the file can be compressed (gzip) and is read directly without the user having to de-compress (gunzip) it.

ignoreBlanks

logical value indicating whether text elements made up entirely of white space should be included in the resulting ‘tree’.

handlers

Optional collection of functions used to map the different XML nodes to R objects. Typically, this is a named list of functions, and a closure can be used to provide local data. This provides a way of filtering the tree as it is being created in R, adding or removing nodes, and generally processing them as they are constructed in the C code.

In a recent addition to the package (version 0.99-8), if this is specified as a single function object, we call that function for each node (of any type) in the underlying DOM tree. It is invoked with the new node and its parent node. This applies to regular nodes and also comments, processing instructions, CDATA nodes, etc. So this function must be sufficiently general to handle them all.

replaceEntities

logical value indicating whether to substitute entity references with their text directly. This should be left as False. The text still appears as the value of the node, but there is more information about its source, allowing the parse to be reversed with full reference information.

asText

logical value indicating that the first argument, ‘file’, should be treated as the XML text to parse, not the name of a file. This allows the contents of documents to be retrieved from different sources (e.g. HTTP servers, XML-RPC, etc.) and still use this parser.

trim

whether to strip white space from the beginning and end of text strings.

validate

logical indicating whether to use a validating parser or not, or in other words check the contents against the DTD specification. If this is true, warning messages will be displayed about errors in the DTD and/or document, but the parsing will proceed except for the presence of terminal errors. This is ignored when parsing an HTML document.

getDTD

logical flag indicating whether the DTD (both internal and external) should be returned along with the document nodes. This changes the return type. This is ignored when parsing an HTML document.

isURL

indicates whether the file argument refers to a URL (accessible via ftp or http) or a regular file on the system. If asText is TRUE, this should not be specified. The function attempts to determine whether the data source is a URL by using grep to look for http or ftp at the start of the string. The libxml parser handles the connection to servers, not the R facilities (e.g. scan).

asTree

this only applies when on passes a value for the handlers argument and is used then to determine whether the DOM tree should be returned or the handlers object.

addAttributeNamespaces

a logical value indicating whether to return the namespace in the names of the attributes within a node or to omit them. If this is TRUE, an attribute such as xsi:type="xsd:string" is reported with the name xsi:type. If it is FALSE, the name of the attribute is type.

useInternalNodes

a logical value indicating whether to call the converter functions with objects of class XMLInternalNode rather than XMLNode. This should make things faster as we do not convert the contents of the internal nodes to R explicit objects. Also, it allows one to access the parent and ancestor nodes. However, since the objects refer to volatile C-level objects, one cannot store these nodes for use in further computations within R. They “disappear” after the processing the XML document is completed.

If this argument is TRUE and no handlers are provided, the return value is a reference to the internal C-level document pointer. This can be used to do post-processing via XPath expressions using getNodeSet.

This is ignored when parsing an HTML document.

isSchema

a logical value indicating whether the document is an XML schema (TRUE) and should be parsed as such using the built-in schema parser in libxml.

fullNamespaceInfo

a logical value indicating whether to provide the namespace URI and prefix on each node or just the prefix. The latter (FALSE) is currently the default as that was the original way the package behaved. However, using TRUE is more informative and we will make this the default in the future.

This is ignored when parsing an HTML document.

encoding

a character string (scalar) giving the encoding for the document. This is optional as the document should contain its own encoding information. However, if it doesn’t, the caller can specify this for the parser. If the XML/HTML document does specify its own encoding that value is used regardless of any value specified by the caller. (That’s just the way it goes!) So this is to be used as a safety net in case the document does not have an encoding and the caller happens to know theactual encoding.

useDotNames

a logical value indicating whether to use the newer format for identifying general element function handlers with the ‘.’ prefix, e.g. .text, .comment, .startElement. If this is FALSE, then the older format text, comment, startElement, … are used. This causes problems when there are indeed nodes named text or comment or startElement as a node-specific handler are confused with the corresponding general handler of the same name. Using TRUE means that your list of handlers should have names that use the ‘.’ prefix for these general element handlers. This is the preferred way to write new code.

xinclude

a logical value indicating whether to process nodes of the form <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"> to insert content from other parts of (potentially different) documents. TRUE means resolve the external references; FALSE means leave the node as is. Of course, one can process these nodes oneself after document has been parse using handler functions or working on the DOM. Please note that the syntax for inclusion using XPointer is not the same as XPath and the results can be a little unexpected and confusing. See the libxml2 documentation for more details.

addFinalizer

a logical value indicating whether the default finalizer routine should be registered to free the internal xmlDoc when R no longer has a reference to this external pointer object. This is only relevant when useInternalNodes is TRUE.

error

a function that is invoked when the XML parser reports an error. When an error is encountered, this is called with 7 arguments. See xmlStructuredStop for information about these

If parsing completes and no document is generated, this function is called again with only argument which is a character vector of length 0. This gives the function an opportunity to report all the errors and raise an exception rather than doing this when it sees th first one.

This function can do what it likes with the information. It can raise an R error or let parser continue and potentially find further errors.

The default value of this argument supplies a function that cumulates the errors

If this is NULL, the default error handler function in the package xmlStructuredStop is invoked and this will raise an error in R at that time in R.

isHTML

a logical value that allows this function to be used for parsing HTML documents. This causes validation and processing of a DTD to be turned off. This is currently experimental so that we can implement htmlParse with this same function.

options

an integer value or vector of values that are combined (OR’ed) together to specify options for the XML parser. This is the same as the options parameter for xmlParseDoc.

parentFirst

a logical value for use when we have handler functions and are traversing the tree. This controls whether we process the node before processing its children, or process the children before their parent node.

然后就可以自己看参数使用说明了。

 

【总结】

R语言是通过本地代理服务器打开HTML页面,查询帮助文件和函数说明的。



发表评论

电子邮件地址不会被公开。 必填项已用*标注

无觅相关文章插件,快速提升流量