Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Web crawler app using QWebPage
QtWS25 Last Chance

Web crawler app using QWebPage

Scheduled Pinned Locked Moved General and Desktop
qwebpageqwebkitweb pageweb crawler
16 Posts 3 Posters 8.1k Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J jelicicm
    4 Jun 2015, 15:11

    Hello all,

    I'm trying to create a web crawler app that should get the URL from user input, connect to that web-page and search for some expression(a string probably) on that page.

    I came here looking for tips on how to do this.

    I looked on this site and google for examples but only found how to make a Web Browser in Qt, how to go back, change pages etc...
    I don't need to show that page in my app, just to search it, and count occurrences of a string, maybe print out those sentences etc.

    I think QWebPage can do the trick because it has findText() function implemented. But, todays pages are not written exclusively in HTML (or some other thing), and are mostly a combo of few things...
    How do I download entire web page to some buffer or something, and then search it?

    Thanks in advance!

    P Offline
    P Offline
    p3c0
    Moderators
    wrote on 5 Jun 2015, 06:01 last edited by p3c0 6 May 2015, 06:02
    #2

    Hi @jelicicm,
    You are on the right track with the usage of QWebPage but I think that would not be alone enough. Web crawlers try to crawl through every link it finds on a webpage. To implement as such you will need to parse the HTML (basically the dom elements) which you will receive when the page loading finishes. To parse an HTML, Qt has a nice API called QWebElement which will ease your work in helping to extract the elements which you need.

    How do I download entire web page to some buffer or something, and then search it?

    Usually you should retrieve the web page after the loading finishes. QWebView loads the page and fires a signal when the loading finishes. Use it, connect a slot to loadFinished signal and retrieved the loaded page's contents. Typically like this:

    void MyClass::onLoadFinished(bool)
    {
        QString page_content = webView->page()->mainFrame()->toHtml(); 
        //page = QWebPage, mainFrame = QWebFrame
    }
    

    157

    1 Reply Last reply
    0
    • J jelicicm
      4 Jun 2015, 15:11

      Hello all,

      I'm trying to create a web crawler app that should get the URL from user input, connect to that web-page and search for some expression(a string probably) on that page.

      I came here looking for tips on how to do this.

      I looked on this site and google for examples but only found how to make a Web Browser in Qt, how to go back, change pages etc...
      I don't need to show that page in my app, just to search it, and count occurrences of a string, maybe print out those sentences etc.

      I think QWebPage can do the trick because it has findText() function implemented. But, todays pages are not written exclusively in HTML (or some other thing), and are mostly a combo of few things...
      How do I download entire web page to some buffer or something, and then search it?

      Thanks in advance!

      P Offline
      P Offline
      p3c0
      Moderators
      wrote on 5 Jun 2015, 06:19 last edited by p3c0 6 May 2015, 08:49
      #3

      Hi again @jelicicm,
      I completely missed that you don't want to show the web page. The above example requires you to use QWebView.
      Well, In that case you can use QNetworkAccessManager to download a page. It too has a finished signal. Connect to it and retrieve the contents using QNetworkReply::readAll(). Then the usual stuff of parsing the HTML (an HTML is an XML). To do it you can use either QXmlStreamReader or QDomDocument. Each one has advantages of its own.

      157

      J J 2 Replies Last reply 5 Jun 2015, 07:02
      0
      • P p3c0
        5 Jun 2015, 06:19

        Hi again @jelicicm,
        I completely missed that you don't want to show the web page. The above example requires you to use QWebView.
        Well, In that case you can use QNetworkAccessManager to download a page. It too has a finished signal. Connect to it and retrieve the contents using QNetworkReply::readAll(). Then the usual stuff of parsing the HTML (an HTML is an XML). To do it you can use either QXmlStreamReader or QDomDocument. Each one has advantages of its own.

        J Offline
        J Offline
        jelicicm
        wrote on 5 Jun 2015, 07:02 last edited by
        #4

        @p3c0 Thank you for your reply... both of them!

        I have used QNetworkAccessManager once before, so I have some experience with it.

        I guess I could put QNetworkReply::readAll(), and load it in QXmlStreamReader object?

        So far, I have this>

        void MainWindow::on_crawl_clicked() //push button click
        {
        QXmlStreamReader *xml = new QXmlStreamReader();
            ui->plainTextEdit->clear();
            QUrl URL(ui->url->text());
            Downloader *d = new Downloader(this);
            QNetworkRequest req(URL);
            QObject::connect(d,SIGNAL(dloadend(QXmlStreamReader*)),this,SLOT(print(QXmlStreamReader*)));
            QObject::connect(d,SIGNAL(dloadend(QXmlStreamReader*)),d,SLOT(deleteLater()));
        }
        
        void MainWindow::print(QXmlStreamReader *xml)  //it never prints out "debug", so it never gets to here...
        {
            ui->plainTextEdit->appendPlainText("debug");
            ui->plainTextEdit->appendPlainText(xml->text().toString());
        }
        
        void Downloader::doDownload(QNetworkRequest req)
        {
            manager = new QNetworkAccessManager(this);
        
            connect(manager, SIGNAL(finished(QNetworkReply*)),this,SLOT(replyFinished(QNetworkReply*)));
        
            manager -> get(req);
        }
        
        void Downloader::replyFinished (QNetworkReply *reply)
        {
            QXmlStreamReader *buffer = new QXmlStreamReader();
            if(reply->error()) {
                qDebug() << "ERROR!";
                qDebug() << reply->errorString();
                reply->deleteLater();
                manager->deleteLater();
                emit err();
            }
            else
        
            {
                buffer->addData(reply->readAll());
        
                reply->deleteLater();
                manager->deleteLater();
        
                emit dloadend(buffer);
        
        }
        

        Do you maybe see what am I doing wrong?

        P 1 Reply Last reply 5 Jun 2015, 07:12
        0
        • J jelicicm
          5 Jun 2015, 07:02

          @p3c0 Thank you for your reply... both of them!

          I have used QNetworkAccessManager once before, so I have some experience with it.

          I guess I could put QNetworkReply::readAll(), and load it in QXmlStreamReader object?

          So far, I have this>

          void MainWindow::on_crawl_clicked() //push button click
          {
          QXmlStreamReader *xml = new QXmlStreamReader();
              ui->plainTextEdit->clear();
              QUrl URL(ui->url->text());
              Downloader *d = new Downloader(this);
              QNetworkRequest req(URL);
              QObject::connect(d,SIGNAL(dloadend(QXmlStreamReader*)),this,SLOT(print(QXmlStreamReader*)));
              QObject::connect(d,SIGNAL(dloadend(QXmlStreamReader*)),d,SLOT(deleteLater()));
          }
          
          void MainWindow::print(QXmlStreamReader *xml)  //it never prints out "debug", so it never gets to here...
          {
              ui->plainTextEdit->appendPlainText("debug");
              ui->plainTextEdit->appendPlainText(xml->text().toString());
          }
          
          void Downloader::doDownload(QNetworkRequest req)
          {
              manager = new QNetworkAccessManager(this);
          
              connect(manager, SIGNAL(finished(QNetworkReply*)),this,SLOT(replyFinished(QNetworkReply*)));
          
              manager -> get(req);
          }
          
          void Downloader::replyFinished (QNetworkReply *reply)
          {
              QXmlStreamReader *buffer = new QXmlStreamReader();
              if(reply->error()) {
                  qDebug() << "ERROR!";
                  qDebug() << reply->errorString();
                  reply->deleteLater();
                  manager->deleteLater();
                  emit err();
              }
              else
          
              {
                  buffer->addData(reply->readAll());
          
                  reply->deleteLater();
                  manager->deleteLater();
          
                  emit dloadend(buffer);
          
          }
          

          Do you maybe see what am I doing wrong?

          P Offline
          P Offline
          p3c0
          Moderators
          wrote on 5 Jun 2015, 07:12 last edited by p3c0 6 May 2015, 07:15
          #5

          @jelicicm I guess the order of deletion is causing the problem.
          Also where are you calling doDownload ?
          Does it get into replyFinished slot i.e in else part ? I would create QXmlStreamReader object in the else part i.e when it succeeds.

          157

          J 1 Reply Last reply 5 Jun 2015, 07:26
          0
          • P p3c0
            5 Jun 2015, 06:19

            Hi again @jelicicm,
            I completely missed that you don't want to show the web page. The above example requires you to use QWebView.
            Well, In that case you can use QNetworkAccessManager to download a page. It too has a finished signal. Connect to it and retrieve the contents using QNetworkReply::readAll(). Then the usual stuff of parsing the HTML (an HTML is an XML). To do it you can use either QXmlStreamReader or QDomDocument. Each one has advantages of its own.

            J Offline
            J Offline
            JohanSolo
            wrote on 5 Jun 2015, 07:23 last edited by
            #6

            @p3c0 said:Then the usual stuff of parsing the HTML (an HTML is an XML). To do it you can use either >QXmlStreamReader or QDomDocument. Each one has advantages of it own.

            Well, HTML is not necessarily XML, if you're lucky, the downloaded page is XHTML and it's a win. But if the page is HTML where tags are either not nested correctly nor properly closed, your XML parser will most probably yell at you that the document contains syntax errors I'm afraid...

            `They did not know it was impossible, so they did it.'
            -- Mark Twain

            1 Reply Last reply
            0
            • P p3c0
              5 Jun 2015, 07:12

              @jelicicm I guess the order of deletion is causing the problem.
              Also where are you calling doDownload ?
              Does it get into replyFinished slot i.e in else part ? I would create QXmlStreamReader object in the else part i.e when it succeeds.

              J Offline
              J Offline
              jelicicm
              wrote on 5 Jun 2015, 07:26 last edited by
              #7

              @p3c0 I can't believe this, I never called doDownload(), and I wonder why it doesn't download. Stupid, stupid..

              I created a QFile, and managed to save some page to it. When I open it, everything looks good, and it seems to me that the download works fine. I saved it as .html, and managed to open it nicely.

              However, I'm still not sure that my QXmlStreamReader buffer object is good. I tried printing all of it withui->plainTextEdit->appendPlainText(xml->text().toString());, but after "debug" nothing gets printed?!

              P 1 Reply Last reply 5 Jun 2015, 07:31
              0
              • J jelicicm
                5 Jun 2015, 07:26

                @p3c0 I can't believe this, I never called doDownload(), and I wonder why it doesn't download. Stupid, stupid..

                I created a QFile, and managed to save some page to it. When I open it, everything looks good, and it seems to me that the download works fine. I saved it as .html, and managed to open it nicely.

                However, I'm still not sure that my QXmlStreamReader buffer object is good. I tried printing all of it withui->plainTextEdit->appendPlainText(xml->text().toString());, but after "debug" nothing gets printed?!

                P Offline
                P Offline
                p3c0
                Moderators
                wrote on 5 Jun 2015, 07:31 last edited by
                #8

                @jelicicm Is the print slot called ?

                157

                J 1 Reply Last reply 5 Jun 2015, 07:32
                0
                • P p3c0
                  5 Jun 2015, 07:31

                  @jelicicm Is the print slot called ?

                  J Offline
                  J Offline
                  jelicicm
                  wrote on 5 Jun 2015, 07:32 last edited by
                  #9

                  @p3c0 Yes, I have a line that just prints "debug" in my PlainTextEdit field, and that gets printed, but, after that, I want to printout what I have downloaded, and it doesn't work.

                  P 1 Reply Last reply 5 Jun 2015, 07:35
                  0
                  • J jelicicm
                    5 Jun 2015, 07:32

                    @p3c0 Yes, I have a line that just prints "debug" in my PlainTextEdit field, and that gets printed, but, after that, I want to printout what I have downloaded, and it doesn't work.

                    P Offline
                    P Offline
                    p3c0
                    Moderators
                    wrote on 5 Jun 2015, 07:35 last edited by
                    #10

                    @jelicicm Ok can you make sure it works in replyFinished else part ? Try the same with buffer.

                    157

                    J 1 Reply Last reply 5 Jun 2015, 07:42
                    0
                    • P p3c0
                      5 Jun 2015, 07:35

                      @jelicicm Ok can you make sure it works in replyFinished else part ? Try the same with buffer.

                      J Offline
                      J Offline
                      jelicicm
                      wrote on 5 Jun 2015, 07:42 last edited by
                      #11

                      @p3c0 I did this in my replyFinished, else part>

                              buffer->addData(reply->readAll());
                      
                              while(!buffer->atEnd()) {
                                  qDebug()<<buffer->readNext();
                              }
                      

                      What I get are some weird numbers, that I don't understand..
                      My qDebug output looks like this

                      2
                      8
                      4
                      6
                      4
                      6
                      4
                      6
                      5
                      6
                      4
                      6
                      4
                      6
                      4
                      6
                      4
                      6
                      4
                      5
                      6
                      4
                      5
                      6
                      4
                      5
                      6
                      4
                      5
                      6
                      4
                      5
                      6
                      4
                      5
                      6
                      4
                      5
                      6
                      4
                      5
                      6
                      4
                      1
                      

                      Web page I was downloading> http://www.bbc.com/future/story/20150604-the-bravest-walks-ever-taken

                      P 1 Reply Last reply 5 Jun 2015, 08:03
                      0
                      • J jelicicm
                        5 Jun 2015, 07:42

                        @p3c0 I did this in my replyFinished, else part>

                                buffer->addData(reply->readAll());
                        
                                while(!buffer->atEnd()) {
                                    qDebug()<<buffer->readNext();
                                }
                        

                        What I get are some weird numbers, that I don't understand..
                        My qDebug output looks like this

                        2
                        8
                        4
                        6
                        4
                        6
                        4
                        6
                        5
                        6
                        4
                        6
                        4
                        6
                        4
                        6
                        4
                        6
                        4
                        5
                        6
                        4
                        5
                        6
                        4
                        5
                        6
                        4
                        5
                        6
                        4
                        5
                        6
                        4
                        5
                        6
                        4
                        5
                        6
                        4
                        5
                        6
                        4
                        1
                        

                        Web page I was downloading> http://www.bbc.com/future/story/20150604-the-bravest-walks-ever-taken

                        P Offline
                        P Offline
                        p3c0
                        Moderators
                        wrote on 5 Jun 2015, 08:03 last edited by p3c0 6 May 2015, 08:14
                        #12

                        @jelicicm It prints a TokenType and not the text. Since TokenType is an enum what you see are its values. Also text() will not so the same.
                        You can try to print errorString() to see if there are any errors.
                        Since readNext prints something it means the there's data in it.
                        There are few examples on forums where you can see how to parse the XML using QXmlStreamReader. I would also suggest you to look for htmlinfo example under <QtDir>/xml/htmlinfo installed on your system. Another example would be this. But a more close example will be htmlinfo.

                        Edited

                        157

                        1 Reply Last reply
                        0
                        • J Offline
                          J Offline
                          jelicicm
                          wrote on 5 Jun 2015, 09:45 last edited by
                          #13

                          Wow @p3c0 this really helped a lot! Thank you very much!

                          Currently I have>

                          void MainWindow::print(QXmlStreamReader *reader)
                          {
                              int paragraphCount = 0;
                              QStringList links;
                              QString title;
                              QString text;
                          
                              while (!reader->atEnd()) {
                                  reader->readNext();
                                  /*text = reader->readElementText();
                                  if(text.contains("some text")) {
                                      qDebug()<<"found text\n";
                                  }*/
                                  if (reader->isStartElement()) {
                                      if (reader->name() == "title")
                                          title = reader->readElementText();
                                      else if(reader->name() == "a")
                                          links.append(reader->attributes().value("href").toString());
                                      else if(reader->name() == "p")
                                          ++paragraphCount;
                                  }
                              }
                              if (reader->hasError()) {
                                  ui->plainTextEdit->appendPlainText( "  The HTML file isn't well-formed: " + reader->errorString()+"\n");
                                  return;
                              }
                          
                              qDebug()<<"Title: "<<title;
                              qDebug()<<"Paragraph count: "<<paragraphCount;
                              qDebug()<<"No of links: "<<links.size();
                              qDebug()<<"One link: "<<links[3];
                          }
                          

                          And, this works perfectly!

                          However, when I uncomment reader->readElementText();, I always get an error from little lower in the code.

                          What I'm trying to do is to search for some text in the web page, and I guess that should be done with this function, but I can't get it to work.

                          P 1 Reply Last reply 5 Jun 2015, 09:55
                          0
                          • J jelicicm
                            5 Jun 2015, 09:45

                            Wow @p3c0 this really helped a lot! Thank you very much!

                            Currently I have>

                            void MainWindow::print(QXmlStreamReader *reader)
                            {
                                int paragraphCount = 0;
                                QStringList links;
                                QString title;
                                QString text;
                            
                                while (!reader->atEnd()) {
                                    reader->readNext();
                                    /*text = reader->readElementText();
                                    if(text.contains("some text")) {
                                        qDebug()<<"found text\n";
                                    }*/
                                    if (reader->isStartElement()) {
                                        if (reader->name() == "title")
                                            title = reader->readElementText();
                                        else if(reader->name() == "a")
                                            links.append(reader->attributes().value("href").toString());
                                        else if(reader->name() == "p")
                                            ++paragraphCount;
                                    }
                                }
                                if (reader->hasError()) {
                                    ui->plainTextEdit->appendPlainText( "  The HTML file isn't well-formed: " + reader->errorString()+"\n");
                                    return;
                                }
                            
                                qDebug()<<"Title: "<<title;
                                qDebug()<<"Paragraph count: "<<paragraphCount;
                                qDebug()<<"No of links: "<<links.size();
                                qDebug()<<"One link: "<<links[3];
                            }
                            

                            And, this works perfectly!

                            However, when I uncomment reader->readElementText();, I always get an error from little lower in the code.

                            What I'm trying to do is to search for some text in the web page, and I guess that should be done with this function, but I can't get it to work.

                            P Offline
                            P Offline
                            p3c0
                            Moderators
                            wrote on 5 Jun 2015, 09:55 last edited by
                            #14

                            @jelicicm As the readElementText doc states:

                            Convenience function to be called in case a StartElement was read. Reads until the corresponding EndElement and returns all text in-between.

                            It should be used only when StartElement is encountered. Your commented code doesn't do that. The next code shows how it should be done.

                            157

                            J 1 Reply Last reply 5 Jun 2015, 10:12
                            0
                            • P p3c0
                              5 Jun 2015, 09:55

                              @jelicicm As the readElementText doc states:

                              Convenience function to be called in case a StartElement was read. Reads until the corresponding EndElement and returns all text in-between.

                              It should be used only when StartElement is encountered. Your commented code doesn't do that. The next code shows how it should be done.

                              J Offline
                              J Offline
                              jelicicm
                              wrote on 5 Jun 2015, 10:12 last edited by
                              #15

                              @p3c0 In the meantime I discovered that the problem is in QString text = reader->readElementText();

                              I just did:

                              while (!reader->atEnd()) {
                                      reader->readNext();
                              
                                      if (reader->isStartElement()) {
                                          text = reader->readElementText();
                                          qDebug()<<text;
                              }
                              
                                  if (reader->hasError()) {
                                      ui->plainTextEdit->appendPlainText( "  The HTML file isn't well-formed: " + reader->errorString()+"\n");
                                      return;
                                  }
                              

                              And, it always prints out: The HTML file isn't well-formed: Expected character data.
                              qDebug just prints
                              "

                              "
                              It doesnt even make a difference if I put this in or not.

                              if (reader->name() == "title")
                                              title = reader->readElementText();
                                          else if(reader->name() == "a")
                                              links.append(reader->attributes().value("href").toString());
                                          else if(reader->name() == "p")
                                              ++paragraphCount;
                              

                              For a moment I thought that reader is emptied after reading text, but thats not it.

                              P 1 Reply Last reply 5 Jun 2015, 11:14
                              0
                              • J jelicicm
                                5 Jun 2015, 10:12

                                @p3c0 In the meantime I discovered that the problem is in QString text = reader->readElementText();

                                I just did:

                                while (!reader->atEnd()) {
                                        reader->readNext();
                                
                                        if (reader->isStartElement()) {
                                            text = reader->readElementText();
                                            qDebug()<<text;
                                }
                                
                                    if (reader->hasError()) {
                                        ui->plainTextEdit->appendPlainText( "  The HTML file isn't well-formed: " + reader->errorString()+"\n");
                                        return;
                                    }
                                

                                And, it always prints out: The HTML file isn't well-formed: Expected character data.
                                qDebug just prints
                                "

                                "
                                It doesnt even make a difference if I put this in or not.

                                if (reader->name() == "title")
                                                title = reader->readElementText();
                                            else if(reader->name() == "a")
                                                links.append(reader->attributes().value("href").toString());
                                            else if(reader->name() == "p")
                                                ++paragraphCount;
                                

                                For a moment I thought that reader is emptied after reading text, but thats not it.

                                P Offline
                                P Offline
                                p3c0
                                Moderators
                                wrote on 5 Jun 2015, 11:14 last edited by
                                #16

                                @jelicicm After some analysis I'm to unsure about the detailed working of it.
                                However QXmlStreamReader can help you in extracting the links which you can probably use in your web crawler implemention as shown in that example.
                                Also to implement something as simple as searching you can instead resort to QTextStream. Set the html content as byte array to it. The iterate over it, extract the line and check if the particular word exists in it using QString::contains.

                                157

                                1 Reply Last reply
                                0

                                11/16

                                5 Jun 2015, 07:42

                                • Login

                                • Login or register to search.
                                11 out of 16
                                • First post
                                  11/16
                                  Last post
                                0
                                • Categories
                                • Recent
                                • Tags
                                • Popular
                                • Users
                                • Groups
                                • Search
                                • Get Qt Extensions
                                • Unsolved