Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. Qt WebKit
  4. Is there a clear way to parse HTML in Qt 5.7
QtWS25 Last Chance

Is there a clear way to parse HTML in Qt 5.7

Scheduled Pinned Locked Moved Unsolved Qt WebKit
htmlparserc++windowslinux
15 Posts 5 Posters 19.1k Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • L Offline
    L Offline
    lmofallis
    wrote on 21 Dec 2016, 21:41 last edited by
    #1

    Hi,

    I would like a powerfull HTML parser working with Qt C++ (I'm now with Qt 5.7). I'm really tired of reading a lot of articles, but without finding a clair and recent Parcer.
    I found libxml2 v2.9.4 but clear examples are rare. Also, I readed about QtWebKit but it's not supported with Qt 5.7 as I understand.

    I'm an amateur programmer with VB.NET in that I can use the good "HTML Agility Pack".

    What I want is a parser that:

    • working in windows and linux OS.
    • supporting at list HTML4 (HTML5 can be perfect).
    • don't need a control or a viewer to work.
    • having simple tutorials or examples.

    I found also QXmlQuery, And I want to know if is it a good HTML parser.

    Really, I'm tired of looking more.
    Thank you.

    K 1 Reply Last reply 22 Dec 2016, 07:19
    0
    • S Offline
      S Offline
      SGaist
      Lifetime Qt Champion
      wrote on 21 Dec 2016, 21:47 last edited by
      #2

      Hi and welcome to devnet,

      QtWebEngine is the current module for web related stuff
      or, alternatively, @Konstantin-Tokarev QtWebKit reboot if you prefer QtWebKit.

      Interested in AI ? www.idiap.ch
      Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

      1 Reply Last reply
      0
      • L Offline
        L Offline
        lmofallis
        wrote on 21 Dec 2016, 22:01 last edited by lmofallis
        #3

        @SGaist Thank you.

        I migrated from VB.Net to Qt C++ sinse 3 month. So I still a newbie.
        I want to know if QtWebEngine let me to parse directly HTML source (from URL or a file) to extract or remove wanted data without rendering it in control (like webbrowser in .NET).

        1 Reply Last reply
        0
        • C Offline
          C Offline
          cochise
          wrote on 21 Dec 2016, 22:05 last edited by
          #4

          Hi.
          You can try html-qt, which ps a phraser, not a web engine.
          https://github.com/cutelyst/html-qt
          Sadly there is nor a good documentation of it. =[.

          http://cochise.tumblr.com

          L 1 Reply Last reply 21 Dec 2016, 22:08
          1
          • C cochise
            21 Dec 2016, 22:05

            Hi.
            You can try html-qt, which ps a phraser, not a web engine.
            https://github.com/cutelyst/html-qt
            Sadly there is nor a good documentation of it. =[.

            L Offline
            L Offline
            lmofallis
            wrote on 21 Dec 2016, 22:08 last edited by
            #5

            @cochise Thank you.
            Is it a good parser?

            I will try it.

            1 Reply Last reply
            0
            • C Offline
              C Offline
              cochise
              wrote on 22 Dec 2016, 00:20 last edited by
              #6

              I cant really answer that, as I never used a HTML parser myself. I'm aware of it because it is a spin of of a project I use, Cutelyst, a web framework. I use HTML as output, via grantlee templates, not as input or woring format.

              http://cochise.tumblr.com

              L 1 Reply Last reply 22 Dec 2016, 11:52
              0
              • L lmofallis
                21 Dec 2016, 21:41

                Hi,

                I would like a powerfull HTML parser working with Qt C++ (I'm now with Qt 5.7). I'm really tired of reading a lot of articles, but without finding a clair and recent Parcer.
                I found libxml2 v2.9.4 but clear examples are rare. Also, I readed about QtWebKit but it's not supported with Qt 5.7 as I understand.

                I'm an amateur programmer with VB.NET in that I can use the good "HTML Agility Pack".

                What I want is a parser that:

                • working in windows and linux OS.
                • supporting at list HTML4 (HTML5 can be perfect).
                • don't need a control or a viewer to work.
                • having simple tutorials or examples.

                I found also QXmlQuery, And I want to know if is it a good HTML parser.

                Really, I'm tired of looking more.
                Thank you.

                K Offline
                K Offline
                Konstantin Tokarev
                wrote on 22 Dec 2016, 07:19 last edited by
                #7

                @lmofallis said in Is there a clear way to parse HTML in Qt 5.7:

                Hi,

                I would like a powerfull HTML parser working with Qt C++ (I'm now with Qt 5.7). I'm really tired of reading a lot of articles, but without finding a clair and recent Parcer.
                I found libxml2 v2.9.4 but clear examples are rare.

                I can confirm that libxml2 can parse HTML but don't have example handy
                Also there are other parsers, for example https://github.com/google/gumbo-parser

                Also, I readed about QtWebKit but it's not supported with Qt 5.7 as I understand.

                Use QtWebKit if you need at least one of these things:

                • process DOM of web page that uses JavaScript to modify its content
                • use CSS queries to find interesting elements in DOM
                • render HTML

                I'm an amateur programmer with VB.NET in that I can use the good "HTML Agility Pack".

                What I want is a parser that:

                • working in windows and linux OS.
                • supporting at list HTML4 (HTML5 can be perfect).
                • don't need a control or a viewer to work.
                • having simple tutorials or examples.

                I found also QXmlQuery, And I want to know if is it a good HTML parser.

                Qt provides implementations of XML parsers (QXmlStreamReader, QDomDocument, QXmlReader) and XQuery (QtXmlPatterns). You can use any XML tools to process XHTML, i.e. HTML that is valid XML document.

                Really, I'm tired of looking more.
                Thank you.

                L 1 Reply Last reply 22 Dec 2016, 12:09
                1
                • C cochise
                  22 Dec 2016, 00:20

                  I cant really answer that, as I never used a HTML parser myself. I'm aware of it because it is a spin of of a project I use, Cutelyst, a web framework. I use HTML as output, via grantlee templates, not as input or woring format.

                  L Offline
                  L Offline
                  lmofallis
                  wrote on 22 Dec 2016, 11:52 last edited by lmofallis
                  #8

                  @cochise said in Is there a clear way to parse HTML in Qt 5.7:

                  Cutelyst

                  I heared about Cutelyst, but I don't know about all its features.

                  1 Reply Last reply
                  0
                  • K Konstantin Tokarev
                    22 Dec 2016, 07:19

                    @lmofallis said in Is there a clear way to parse HTML in Qt 5.7:

                    Hi,

                    I would like a powerfull HTML parser working with Qt C++ (I'm now with Qt 5.7). I'm really tired of reading a lot of articles, but without finding a clair and recent Parcer.
                    I found libxml2 v2.9.4 but clear examples are rare.

                    I can confirm that libxml2 can parse HTML but don't have example handy
                    Also there are other parsers, for example https://github.com/google/gumbo-parser

                    Also, I readed about QtWebKit but it's not supported with Qt 5.7 as I understand.

                    Use QtWebKit if you need at least one of these things:

                    • process DOM of web page that uses JavaScript to modify its content
                    • use CSS queries to find interesting elements in DOM
                    • render HTML

                    I'm an amateur programmer with VB.NET in that I can use the good "HTML Agility Pack".

                    What I want is a parser that:

                    • working in windows and linux OS.
                    • supporting at list HTML4 (HTML5 can be perfect).
                    • don't need a control or a viewer to work.
                    • having simple tutorials or examples.

                    I found also QXmlQuery, And I want to know if is it a good HTML parser.

                    Qt provides implementations of XML parsers (QXmlStreamReader, QDomDocument, QXmlReader) and XQuery (QtXmlPatterns). You can use any XML tools to process XHTML, i.e. HTML that is valid XML document.

                    Really, I'm tired of looking more.
                    Thank you.

                    L Offline
                    L Offline
                    lmofallis
                    wrote on 22 Dec 2016, 12:09 last edited by
                    #9

                    @Konstantin-Tokarev Thank you.

                    So, I don't need QtWebKit, because I don't want to render anything. I only want to download a HTML source or load it from a file, and then parse it.

                    For implementated XML parsers with Qt, are they also good as libxml2 or gumbo-parser?
                    Meanwhile for your answer, I will give an other try for libxml2 and gumbo-parser.

                    K 1 Reply Last reply 22 Dec 2016, 12:18
                    0
                    • L lmofallis
                      22 Dec 2016, 12:09

                      @Konstantin-Tokarev Thank you.

                      So, I don't need QtWebKit, because I don't want to render anything. I only want to download a HTML source or load it from a file, and then parse it.

                      For implementated XML parsers with Qt, are they also good as libxml2 or gumbo-parser?
                      Meanwhile for your answer, I will give an other try for libxml2 and gumbo-parser.

                      K Offline
                      K Offline
                      Konstantin Tokarev
                      wrote on 22 Dec 2016, 12:18 last edited by
                      #10

                      @lmofallis said in Is there a clear way to parse HTML in Qt 5.7:

                      @Konstantin-Tokarev Thank you.

                      So, I don't need QtWebKit, because I don't want to render anything. I only want to download a HTML source or load it from a file, and then parse it.

                      Don't underestimate value of "CSS queries" point - in case you have complex documents where you need to process only a few deeply nested elements, it can be very handy to have full-blown CSS query engine. With revived QtWebKit complex queries will even be JIT-compiled!

                      For implementated XML parsers with Qt, are they also good as libxml2 or gumbo-parser?

                      "Good" is fuzzy term. You may like API of Qt parsers more, but speed may be worse than with others. QDom has abysmal performance, QXmlStreamReader is much faster but still has no way around conversion of all document text to UTF16 internally which hurts performance if your document is e.g. in UTF8

                      Just in case you want lightning fast XML DOM parser, try pugixml

                      Meanwhile for your answer, I will give an other try for libxml2 and gumbo-parser.

                      L 1 Reply Last reply 22 Dec 2016, 13:57
                      2
                      • K Konstantin Tokarev
                        22 Dec 2016, 12:18

                        @lmofallis said in Is there a clear way to parse HTML in Qt 5.7:

                        @Konstantin-Tokarev Thank you.

                        So, I don't need QtWebKit, because I don't want to render anything. I only want to download a HTML source or load it from a file, and then parse it.

                        Don't underestimate value of "CSS queries" point - in case you have complex documents where you need to process only a few deeply nested elements, it can be very handy to have full-blown CSS query engine. With revived QtWebKit complex queries will even be JIT-compiled!

                        For implementated XML parsers with Qt, are they also good as libxml2 or gumbo-parser?

                        "Good" is fuzzy term. You may like API of Qt parsers more, but speed may be worse than with others. QDom has abysmal performance, QXmlStreamReader is much faster but still has no way around conversion of all document text to UTF16 internally which hurts performance if your document is e.g. in UTF8

                        Just in case you want lightning fast XML DOM parser, try pugixml

                        Meanwhile for your answer, I will give an other try for libxml2 and gumbo-parser.

                        L Offline
                        L Offline
                        lmofallis
                        wrote on 22 Dec 2016, 13:57 last edited by
                        #11

                        @Konstantin-Tokarev

                        I agree with about QtWebKit. But, what about the performances?
                        Sometimes I have to process with more than a thousand file, and I think it's not a good idea to do this job with QtWebKit as I understand.

                        99% of case I work with documents encoded with ANSI or UTF-8, therefore for QXmlStreamReader will be good for its speed.

                        Thanks.

                        K 2 Replies Last reply 22 Dec 2016, 14:14
                        0
                        • L lmofallis
                          22 Dec 2016, 13:57

                          @Konstantin-Tokarev

                          I agree with about QtWebKit. But, what about the performances?
                          Sometimes I have to process with more than a thousand file, and I think it's not a good idea to do this job with QtWebKit as I understand.

                          99% of case I work with documents encoded with ANSI or UTF-8, therefore for QXmlStreamReader will be good for its speed.

                          Thanks.

                          K Offline
                          K Offline
                          Konstantin Tokarev
                          wrote on 22 Dec 2016, 14:14 last edited by
                          #12

                          @lmofallis HTML parser in WebKit is heavily optimized. If QtWebKit is appropriate for your task (i.e., your task matches one or more reasons listed above), I'd recommend to make a benchmark. Note that QtWebKit will take additional time to initialize and deinitialize, and use a bit more memory (but constant if you disable caches), so benchmark should involve parsing a large number of documents in a cycle.

                          1 Reply Last reply
                          1
                          • L lmofallis
                            22 Dec 2016, 13:57

                            @Konstantin-Tokarev

                            I agree with about QtWebKit. But, what about the performances?
                            Sometimes I have to process with more than a thousand file, and I think it's not a good idea to do this job with QtWebKit as I understand.

                            99% of case I work with documents encoded with ANSI or UTF-8, therefore for QXmlStreamReader will be good for its speed.

                            Thanks.

                            K Offline
                            K Offline
                            Konstantin Tokarev
                            wrote on 22 Dec 2016, 14:18 last edited by
                            #13

                            @lmofallis said in Is there a clear way to parse HTML in Qt 5.7:

                            99% of case I work with documents encoded with ANSI or UTF-8, therefore for QXmlStreamReader will be good for its speed.

                            It will be good for WebKit, it has fast path for ASCII text. With QXmlStreamReader you will end up converting all text going through parser to UTF16. Also note that QXmlStreamReader won't parse HTML documents that are not valid XML.

                            L 1 Reply Last reply 31 Dec 2016, 18:57
                            1
                            • K Konstantin Tokarev
                              22 Dec 2016, 14:18

                              @lmofallis said in Is there a clear way to parse HTML in Qt 5.7:

                              99% of case I work with documents encoded with ANSI or UTF-8, therefore for QXmlStreamReader will be good for its speed.

                              It will be good for WebKit, it has fast path for ASCII text. With QXmlStreamReader you will end up converting all text going through parser to UTF16. Also note that QXmlStreamReader won't parse HTML documents that are not valid XML.

                              L Offline
                              L Offline
                              lmofallis
                              wrote on 31 Dec 2016, 18:57 last edited by
                              #14

                              @Konstantin-Tokarev

                              I apologize for my late answer.
                              Finaly, I used the QGumboParser library, and it works fine until now. But, I don't know how can I remove a entire Tag (OuterHTML) and its children.

                              For example, I want to delete <div class="content"> from this code:

                              <html>
                                 <body>
                                 	<h3>First header</h3>
                                 	<p>text text text</p>
                                 	<div class="content">
                                 		<h3>Nested header <a href="">My Link</a></h3>
                                 	</div>
                                 </body>
                              </html>
                              

                              The result:

                              <html>
                                 <body>
                                 	<h3>First header</h3>
                                 	<p>text text text</p>
                                 </body>
                              </html>
                              

                              Thank you.

                              1 Reply Last reply
                              0
                              • danttiD Offline
                                danttiD Offline
                                dantti
                                wrote on 19 Feb 2018, 18:03 last edited by
                                #15

                                @cochise said in Is there a clear way to parse HTML in Qt 5.7:

                                https://github.com/cutelyst/html-qt

                                This is a bit of an old thread but as @cochise said html-qt is an HTML parser, sadly I didn't finish it yet but it follows WHATWG specification on how to implement an HTML parser as HTML is not XML. It's mostly complete but outputting a DOM tree isn't ready yet, so help is welcome.

                                1 Reply Last reply
                                0

                                • Login

                                • Login or register to search.
                                • First post
                                  Last post
                                0
                                • Categories
                                • Recent
                                • Tags
                                • Popular
                                • Users
                                • Groups
                                • Search
                                • Get Qt Extensions
                                • Unsolved