a a!@shdZddlZddlZddlmZddlmZddlm Z ddl m Z m Z m Z zeZWneyneefZYn0zddlmZWneyddlmZYn0zddlmZWneyddlmZYn0Gd d d eZzdd lmZWneyYn0Gd d d eZeZddZdddZdddZdddZd ddZ d!ddZ!ddZ"eZ#dS)"z? An interface to html5lib that mimics the lxml.html interface. N) HTMLParser) TreeBuilder)etree)ElementXHTML_NAMESPACE_contains_block_level_tag)urlopen)urlparsec@seZdZdZdddZdS)rz*An html5lib HTML parser with lxml as tree.FcKstj|f|td|dSN)stricttree) _HTMLParser__init__rselfr kwargsr;/usr/lib64/python3.9/site-packages/lxml/html/html5parser.pyrszHTMLParser.__init__N)F__name__ __module__ __qualname____doc__rrrrrrsr) XHTMLParserc@seZdZdZdddZdS)rz+An html5lib XHTML Parser with lxml as tree.FcKstj|f|td|dSr ) _XHTMLParserrrrrrrr*szXHTMLParser.__init__N)Frrrrrr'srcCs(||}|dur|S|dt|fS)Nz{%s}%s)findr)r tagelemrrr _find_tag0s rcCs^t|tstd|durt}i}|dur8t|tr8d}|durH||d<|j|fi|S)z Parse a whole document into a string. If `guess_charset` is true, or if the input is not Unicode but a byte string, the `chardet` library will perform charset guessing on the string. string requiredNT useChardet) isinstance_strings TypeError html_parserbytesparseZgetroot)html guess_charsetparseroptionsrrrdocument_fromstring7s r+FcCst|tstd|durt}i}|dur8t|tr8d}|durH||d<|j|dfi|}|rt|dtr|r|drtd|d|d=|S)a`Parses several HTML elements, returning a list of elements. The first item in the list may be a string. If no_leading_text is true, then it will be an error if there is leading text, and it will always be a list of only elements. If `guess_charset` is true, the `chardet` library will perform charset guessing on the string. rNFr divrzThere is leading text: %r) r!r"r#r$r%Z parseFragmentstripr ParserError)r'no_leading_textr(r)r*Zchildrenrrrfragments_fromstringOs$  r0cCst|tstdt|}t|||| d}|rvt|ts>d}t|}|rrt|dtrh|d|_|d=|||S|st dt |dkrt d|d}|j r|j rt d|j d |_ |S) aParses a single HTML element; it is an error if there is more than one element, or if anything but whitespace precedes or follows the element. If 'create_parent' is true (or is a tag name) then a parent node will be created to encapsulate the HTML in a single element. In this case, leading or trailing text is allowed. If `guess_charset` is true, the `chardet` library will perform charset guessing on the string. r)r(r)r/r,rzNo elements foundzMultiple elements foundzElement followed by text: %rN) r!r"r#boolr0rtextextendrr.lentailr-)r'Z create_parentr(r)Zaccept_leading_textelementsZnew_rootresultrrrfragment_fromstringqs4       r9cCst|tstdt|||d}|dd}t|trB|dd}|}|dsb|drf|St |d }t |r||St |d }t |d kr|j r|j s|d j r|d j s|d St|rd|_nd|_|S)aParse the html, returning a single element/document. This tries to minimally parse the chunk of text, without knowing if it is a fragment or a document. 'base_url' will set the document's base_url attribute (and the tree's docinfo.URL) If `guess_charset` is true, or if the input is not Unicode but a byte string, the `chardet` library will perform charset guessing on the string. r)r)r(N2asciireplacezrrr fromstrings2        rGcCs~|dur t}t|ts(|}|dur\d}n4t|rFt|}|dur\d}nt|d}|dur\d}i}|rl||d<|j|fi|S)a*Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use ``parse(...).getroot()`` to get the document root. If ``guess_charset`` is true, the ``useChardet`` option is passed into html5lib to enable character detection. This option is on by default when parsing from URLs, off by default when parsing from file(-like) objects (which tend to return Unicode more often than not), and on by default when parsing from a file path (which is read in binary mode). NFTrbr )r$r!r"_looks_like_urlropenr&)Zfilename_url_or_filer(r)fpr*rrrr&s"   r&cCs@t|d}|sdStjdkr8|tjvr8t|dkr8dSdSdS)NrFwin32r1T)r sysplatformstring ascii_lettersr5)strschemerrrrIs   rI)NN)FNN)FNN)NN)NN)$rrMrOZhtml5librr Z html5lib.treebuilders.etree_lxmlrZlxmlrZ lxml.htmlrrrZ basestringr" NameErrorr%rQZurllib2r ImportErrorZurllib.requestr urllib.parserrZ xhtml_parserrr+r0r9rGr&rIr$rrrrsJ        " , 6 $