Skip to content
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Groups
  • Search
  • Get Qt Extensions
  • Unsolved
Collapse
Brand Logo
  1. Home
  2. Qt Development
  3. General and Desktop
  4. Unicode handling by QRegularExpression

Unicode handling by QRegularExpression

Scheduled Pinned Locked Moved Solved General and Desktop
qregularexpressunicode
10 Posts 4 Posters 6.5k Views
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • P Offline
    P Offline
    panosk
    wrote on 15 May 2016, 16:31 last edited by
    #1

    I migrated a few regular expressions used in my application from QRegExp to QRegularExpression. After all was done, I was getting the non-fatal message

      QRegularExpressionPrivate::doMatch(): called on an invalid QRegularExpression object
    

    and after a little debugging I found that it happens when the regular expression's setPattern() loads the regexp ("\u2029"). The regexps are loaded in runtime from a file which contains many other ICU regexps like the one above. How can I fix this?

    Thanks in advance.

    1 Reply Last reply
    0
    • P Offline
      P Offline
      Paul Colby
      wrote on 15 May 2016, 21:27 last edited by
      #2

      Hi @panosk,

      The regexps are loaded in runtime from a file which contains many other ICU regexps like the one above.

      How are you loading the expressions from file, and then assigning them to QRegularExpression?

      P 1 Reply Last reply 16 May 2016, 08:13
      0
      • S Offline
        S Offline
        SGaist
        Lifetime Qt Champion
        wrote on 15 May 2016, 21:29 last edited by
        #3

        Hi,

        Can you post a small code sample that triggers that ?

        Interested in AI ? www.idiap.ch
        Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

        1 Reply Last reply
        0
        • P Paul Colby
          15 May 2016, 21:27

          Hi @panosk,

          The regexps are loaded in runtime from a file which contains many other ICU regexps like the one above.

          How are you loading the expressions from file, and then assigning them to QRegularExpression?

          P Offline
          P Offline
          panosk
          wrote on 16 May 2016, 08:13 last edited by
          #4

          @Paul-Colby said:

          Hi @panosk,

          The regexps are loaded in runtime from a file which contains many other ICU regexps like the one above.

          How are you loading the expressions from file, and then assigning them to QRegularExpression?

          The code is correct and has been working fine for quite some time, there's nothing wrong with the way the regexps are assigned. However, I was examining a wrong file, the correct file doesn't have many ICU regexps as I thought, only 2, so the problem is trivial as I can remove these regexps. It would be nice though to know why this happens.

          @SGaist said:

          Hi,

          Can you post a small code sample that triggers that ?

          No special code is needed. You can try this and see the message:

          QRegExp regexp("\\u2029"); // No complaints here
             if (regexp.indexIn(someText) > -1)
                qDebug() << "Match";
          
          
          QRegularExpression regexp("\\u2029"); // It doesn't like this and the message appears
          QRegularExpressionMatch match = regexp.match(someText);
             if (match.hasMatch())
                qDebug() << "Match";
          
          1 Reply Last reply
          0
          • P Offline
            P Offline
            panosk
            wrote on 16 May 2016, 08:41 last edited by panosk
            #5

            So, I think I found the problem. I created a file and pasted the \u2029 character from the character selector utility. Running either QRegExp or QRegularExpression doesn't find it if I use double slashes, but only QRegularExpression warns about the problem. That is

            QRegularExpression regexp("\\u2029")
            

            doesn't work, while

            QRegularExpression regexp("\u2029")
            

            finds the match.

            The problem is that when I retrieve the (correctly, I think) formatted regexp string "\u2029" from my file that contains the regexps, the slash is escaped automatically and hence the problem. Maybe QRegExp and QRegularExpression should recognize such cases and not escape them or maybe I miss sth :-).

            1 Reply Last reply
            0
            • S Offline
              S Offline
              SGaist
              Lifetime Qt Champion
              wrote on 16 May 2016, 21:26 last edited by
              #6

              You did. You have a sequence that represent a unicode char in your file. So the string resulting from the load will have that backslash escaped to match the content of the file. If you what to load that char from a file, you must write it as is in that file in the first place.

              Interested in AI ? www.idiap.ch
              Please read the Qt Code of Conduct - https://forum.qt.io/topic/113070/qt-code-of-conduct

              P 1 Reply Last reply 17 May 2016, 07:39
              0
              • S SGaist
                16 May 2016, 21:26

                You did. You have a sequence that represent a unicode char in your file. So the string resulting from the load will have that backslash escaped to match the content of the file. If you what to load that char from a file, you must write it as is in that file in the first place.

                P Offline
                P Offline
                panosk
                wrote on 17 May 2016, 07:39 last edited by
                #7

                @SGaist said:

                You did. You have a sequence that represent a unicode char in your file. So the string resulting from the load will have that backslash escaped to match the content of the file. If you what to load that char from a file, you must write it as is in that file in the first place.

                I'm not so sure... The unicode representation is used in a regular expression in a file, for example [\u00A0\s], and I want to load that regular expression from the file to QRegularExpression in runtime. Currently, it seems there's no way to do that. It seems QRegularExpression recognizes the unicode sequence when you write it directly to the constructor or to the setPattern() function, but it doesn't recognize it when it loads it from a file and it wrongly escapes it.

                K 1 Reply Last reply 17 May 2016, 08:59
                0
                • P panosk
                  17 May 2016, 07:39

                  @SGaist said:

                  You did. You have a sequence that represent a unicode char in your file. So the string resulting from the load will have that backslash escaped to match the content of the file. If you what to load that char from a file, you must write it as is in that file in the first place.

                  I'm not so sure... The unicode representation is used in a regular expression in a file, for example [\u00A0\s], and I want to load that regular expression from the file to QRegularExpression in runtime. Currently, it seems there's no way to do that. It seems QRegularExpression recognizes the unicode sequence when you write it directly to the constructor or to the setPattern() function, but it doesn't recognize it when it loads it from a file and it wrongly escapes it.

                  K Offline
                  K Offline
                  kshegunov
                  Moderators
                  wrote on 17 May 2016, 08:59 last edited by kshegunov
                  #8

                  @panosk

                  Perl and PCRE do not support the \uFFFF syntax. They use \x{FFFF} instead. You can omit leading zeros in the hexadecimal number between the curly braces. Since \x by itself is not a valid regex token, \x{1234} can never be confused to match \x 1234 times. It always matches the Unicode code point U+1234. \x{1234}{5678} will try to match code point U+1234 exactly 5678 times.

                  From here: http://www.regular-expressions.info/unicode.html

                  QRegularExpression is PCRE based, so try specifying the code point correctly in your file/testing string. For example, try like this:

                  QRegularExpression regexp("\\x{2029}"); // It should like this just fine
                  

                  PS.
                  This

                  QRegularExpression regexp("\u2029")
                  

                  Works, because \u2029 is a unicode character (written through its hex representation) and is then passed to the engine as a sequence of bytes. It would be equivalent to:

                  const char rx[] = {0x20, 0x29};
                  QRegularExpression regexp(rx);
                  

                  Kind regards.

                  Read and abide by the Qt Code of Conduct

                  P 1 Reply Last reply 17 May 2016, 09:31
                  2
                  • K kshegunov
                    17 May 2016, 08:59

                    @panosk

                    Perl and PCRE do not support the \uFFFF syntax. They use \x{FFFF} instead. You can omit leading zeros in the hexadecimal number between the curly braces. Since \x by itself is not a valid regex token, \x{1234} can never be confused to match \x 1234 times. It always matches the Unicode code point U+1234. \x{1234}{5678} will try to match code point U+1234 exactly 5678 times.

                    From here: http://www.regular-expressions.info/unicode.html

                    QRegularExpression is PCRE based, so try specifying the code point correctly in your file/testing string. For example, try like this:

                    QRegularExpression regexp("\\x{2029}"); // It should like this just fine
                    

                    PS.
                    This

                    QRegularExpression regexp("\u2029")
                    

                    Works, because \u2029 is a unicode character (written through its hex representation) and is then passed to the engine as a sequence of bytes. It would be equivalent to:

                    const char rx[] = {0x20, 0x29};
                    QRegularExpression regexp(rx);
                    

                    Kind regards.

                    P Offline
                    P Offline
                    panosk
                    wrote on 17 May 2016, 09:31 last edited by
                    #9

                    @kshegunov said:

                    QRegularExpression is PCRE based

                    Thank you very much for the clear explanation. So, it seems this is the issue. Apart from these unicode peculiarities, I don't think there are other major differences between the PCRE and the ICU standards so I can modify the few instances of these unicode representations to the appropriate format.

                    K 1 Reply Last reply 17 May 2016, 09:32
                    0
                    • P panosk
                      17 May 2016, 09:31

                      @kshegunov said:

                      QRegularExpression is PCRE based

                      Thank you very much for the clear explanation. So, it seems this is the issue. Apart from these unicode peculiarities, I don't think there are other major differences between the PCRE and the ICU standards so I can modify the few instances of these unicode representations to the appropriate format.

                      K Offline
                      K Offline
                      kshegunov
                      Moderators
                      wrote on 17 May 2016, 09:32 last edited by
                      #10

                      @panosk
                      I suggest trying out in code first, and if everything goes smoothly, then yes you can replace the codepoints in your file.

                      Good luck!

                      Read and abide by the Qt Code of Conduct

                      1 Reply Last reply
                      0

                      6/10

                      16 May 2016, 21:26

                      • Login

                      • Login or register to search.
                      6 out of 10
                      • First post
                        6/10
                        Last post
                      0
                      • Categories
                      • Recent
                      • Tags
                      • Popular
                      • Users
                      • Groups
                      • Search
                      • Get Qt Extensions
                      • Unsolved