Web crawler app using QWebPage
-
@jelicicm I guess the order of deletion is causing the problem.
Also where are you callingdoDownload
?
Does it get intoreplyFinished
slot i.e in else part ? I would createQXmlStreamReader
object in the else part i.e when it succeeds.wrote on 5 Jun 2015, 07:26 last edited by@p3c0 I can't believe this, I never called
doDownload()
, and I wonder why it doesn't download. Stupid, stupid..I created a QFile, and managed to save some page to it. When I open it, everything looks good, and it seems to me that the download works fine. I saved it as .html, and managed to open it nicely.
However, I'm still not sure that my QXmlStreamReader buffer object is good. I tried printing all of it with
ui->plainTextEdit->appendPlainText(xml->text().toString());
, but after "debug" nothing gets printed?! -
@p3c0 I can't believe this, I never called
doDownload()
, and I wonder why it doesn't download. Stupid, stupid..I created a QFile, and managed to save some page to it. When I open it, everything looks good, and it seems to me that the download works fine. I saved it as .html, and managed to open it nicely.
However, I'm still not sure that my QXmlStreamReader buffer object is good. I tried printing all of it with
ui->plainTextEdit->appendPlainText(xml->text().toString());
, but after "debug" nothing gets printed?!@jelicicm Is the
print
slot called ? -
@p3c0 Yes, I have a line that just prints "debug" in my PlainTextEdit field, and that gets printed, but, after that, I want to printout what I have downloaded, and it doesn't work.
@jelicicm Ok can you make sure it works in
replyFinished
else part ? Try the same withbuffer
. -
@jelicicm Ok can you make sure it works in
replyFinished
else part ? Try the same withbuffer
.wrote on 5 Jun 2015, 07:42 last edited by@p3c0 I did this in my
replyFinished
, else part>buffer->addData(reply->readAll()); while(!buffer->atEnd()) { qDebug()<<buffer->readNext(); }
What I get are some weird numbers, that I don't understand..
My qDebug output looks like this2 8 4 6 4 6 4 6 5 6 4 6 4 6 4 6 4 6 4 5 6 4 5 6 4 5 6 4 5 6 4 5 6 4 5 6 4 5 6 4 5 6 4 1
Web page I was downloading> http://www.bbc.com/future/story/20150604-the-bravest-walks-ever-taken
-
@p3c0 I did this in my
replyFinished
, else part>buffer->addData(reply->readAll()); while(!buffer->atEnd()) { qDebug()<<buffer->readNext(); }
What I get are some weird numbers, that I don't understand..
My qDebug output looks like this2 8 4 6 4 6 4 6 5 6 4 6 4 6 4 6 4 6 4 5 6 4 5 6 4 5 6 4 5 6 4 5 6 4 5 6 4 5 6 4 5 6 4 1
Web page I was downloading> http://www.bbc.com/future/story/20150604-the-bravest-walks-ever-taken
@jelicicm It prints a
TokenType
and not the text. SinceTokenType
is an enum what you see are its values. Alsotext()
will not so the same.
You can try to printerrorString()
to see if there are any errors.
SincereadNext
prints something it means the there's data in it.
There are few examples on forums where you can see how to parse the XML usingQXmlStreamReader
. I would also suggest you to look forhtmlinfo
example under<QtDir>/xml/htmlinfo
installed on your system. Another example would be this. But a more close example will behtmlinfo
.Edited
-
wrote on 5 Jun 2015, 09:45 last edited by
Wow @p3c0 this really helped a lot! Thank you very much!
Currently I have>
void MainWindow::print(QXmlStreamReader *reader) { int paragraphCount = 0; QStringList links; QString title; QString text; while (!reader->atEnd()) { reader->readNext(); /*text = reader->readElementText(); if(text.contains("some text")) { qDebug()<<"found text\n"; }*/ if (reader->isStartElement()) { if (reader->name() == "title") title = reader->readElementText(); else if(reader->name() == "a") links.append(reader->attributes().value("href").toString()); else if(reader->name() == "p") ++paragraphCount; } } if (reader->hasError()) { ui->plainTextEdit->appendPlainText( " The HTML file isn't well-formed: " + reader->errorString()+"\n"); return; } qDebug()<<"Title: "<<title; qDebug()<<"Paragraph count: "<<paragraphCount; qDebug()<<"No of links: "<<links.size(); qDebug()<<"One link: "<<links[3]; }
And, this works perfectly!
However, when I uncomment
reader->readElementText();
, I always get an error from little lower in the code.What I'm trying to do is to search for some text in the web page, and I guess that should be done with this function, but I can't get it to work.
-
Wow @p3c0 this really helped a lot! Thank you very much!
Currently I have>
void MainWindow::print(QXmlStreamReader *reader) { int paragraphCount = 0; QStringList links; QString title; QString text; while (!reader->atEnd()) { reader->readNext(); /*text = reader->readElementText(); if(text.contains("some text")) { qDebug()<<"found text\n"; }*/ if (reader->isStartElement()) { if (reader->name() == "title") title = reader->readElementText(); else if(reader->name() == "a") links.append(reader->attributes().value("href").toString()); else if(reader->name() == "p") ++paragraphCount; } } if (reader->hasError()) { ui->plainTextEdit->appendPlainText( " The HTML file isn't well-formed: " + reader->errorString()+"\n"); return; } qDebug()<<"Title: "<<title; qDebug()<<"Paragraph count: "<<paragraphCount; qDebug()<<"No of links: "<<links.size(); qDebug()<<"One link: "<<links[3]; }
And, this works perfectly!
However, when I uncomment
reader->readElementText();
, I always get an error from little lower in the code.What I'm trying to do is to search for some text in the web page, and I guess that should be done with this function, but I can't get it to work.
@jelicicm As the readElementText doc states:
Convenience function to be called in case a StartElement was read. Reads until the corresponding EndElement and returns all text in-between.
It should be used only when
StartElement
is encountered. Your commented code doesn't do that. The next code shows how it should be done. -
@jelicicm As the readElementText doc states:
Convenience function to be called in case a StartElement was read. Reads until the corresponding EndElement and returns all text in-between.
It should be used only when
StartElement
is encountered. Your commented code doesn't do that. The next code shows how it should be done.wrote on 5 Jun 2015, 10:12 last edited by@p3c0 In the meantime I discovered that the problem is in
QString text = reader->readElementText();
I just did:
while (!reader->atEnd()) { reader->readNext(); if (reader->isStartElement()) { text = reader->readElementText(); qDebug()<<text; } if (reader->hasError()) { ui->plainTextEdit->appendPlainText( " The HTML file isn't well-formed: " + reader->errorString()+"\n"); return; }
And, it always prints out:
The HTML file isn't well-formed: Expected character data.
qDebug just prints
""
It doesnt even make a difference if I put this in or not.if (reader->name() == "title") title = reader->readElementText(); else if(reader->name() == "a") links.append(reader->attributes().value("href").toString()); else if(reader->name() == "p") ++paragraphCount;
For a moment I thought that reader is emptied after reading text, but thats not it.
-
@p3c0 In the meantime I discovered that the problem is in
QString text = reader->readElementText();
I just did:
while (!reader->atEnd()) { reader->readNext(); if (reader->isStartElement()) { text = reader->readElementText(); qDebug()<<text; } if (reader->hasError()) { ui->plainTextEdit->appendPlainText( " The HTML file isn't well-formed: " + reader->errorString()+"\n"); return; }
And, it always prints out:
The HTML file isn't well-formed: Expected character data.
qDebug just prints
""
It doesnt even make a difference if I put this in or not.if (reader->name() == "title") title = reader->readElementText(); else if(reader->name() == "a") links.append(reader->attributes().value("href").toString()); else if(reader->name() == "p") ++paragraphCount;
For a moment I thought that reader is emptied after reading text, but thats not it.
@jelicicm After some analysis I'm to unsure about the detailed working of it.
HoweverQXmlStreamReader
can help you in extracting the links which you can probably use in your web crawler implemention as shown in that example.
Also to implement something as simple as searching you can instead resort toQTextStream
. Set the html content as byte array to it. The iterate over it, extract the line and check if the particular word exists in it usingQString::contains
.
16/16