QTextDocument and ts_tree_edit
-
I am writing a text editor and I am using Qt for graphics. I use TreeSitter for syntax highlighting and other things. It is fast enough on small and medium sized buffers, but when a buffer is very large, it can slow down due to TreeSitter parsing the entire thing when it changes.
TreeSitter has a function called ts_tree_edit which one must apply on a TSTree* before using that tree as an old_tree parameter when parsing. This makes parsing much faster, but the function ts_tree_edit requires the position in point and bytes, and it also requires the old and new end positions in points and byte offset.
QTextDocument has a contentsChange signal, which has the following signature:void QTextDocument::contentsChange(int position, int charsRemoved, int charsAdded)The problem is that when this signal fires the contents are already changed so I have no way of getting the old point, since a newline character could be deleted and I would have no way of knowing. Similarly I couldn't get the old end offset in bytes, because TreeSitter expects utf8 encoded string where some characters take more bytes than others, yet the callback only gives me the number of characters removed.
So does anyone know how I can use this TreeSitter functionality with QTextDocument? I haven't found anything about it online, which is why I am making this post. One thing I was considering is using the undo/redo functionality for this, but I am not sure if this will be efficient.
Thank you in advance for your help!
-
I am writing a text editor and I am using Qt for graphics. I use TreeSitter for syntax highlighting and other things. It is fast enough on small and medium sized buffers, but when a buffer is very large, it can slow down due to TreeSitter parsing the entire thing when it changes.
TreeSitter has a function called ts_tree_edit which one must apply on a TSTree* before using that tree as an old_tree parameter when parsing. This makes parsing much faster, but the function ts_tree_edit requires the position in point and bytes, and it also requires the old and new end positions in points and byte offset.
QTextDocument has a contentsChange signal, which has the following signature:void QTextDocument::contentsChange(int position, int charsRemoved, int charsAdded)The problem is that when this signal fires the contents are already changed so I have no way of getting the old point, since a newline character could be deleted and I would have no way of knowing. Similarly I couldn't get the old end offset in bytes, because TreeSitter expects utf8 encoded string where some characters take more bytes than others, yet the callback only gives me the number of characters removed.
So does anyone know how I can use this TreeSitter functionality with QTextDocument? I haven't found anything about it online, which is why I am making this post. One thing I was considering is using the undo/redo functionality for this, but I am not sure if this will be efficient.
Thank you in advance for your help!
@eineskamelles Welcome to the forum.
I've never worked with tree-sitter, so can't really help you with anything specific with its' API. I can give you some tips on the Qt side of things.
The
contentsChangesignal is absolutely the correct way to go about syncing changes to aQTextDocumentwith an outside parser, and actually gives you everything you need. The key is to interpret its' parameters as saying that the text between UTF-16 offset range[position, position + charsRemoved]in the document before the change, should be changed to contain the text which is in the UTF-16 offset range[position, position + charsAdded]as it is now in the document after the change.Regarding turning the UTF-16 offsets to UTF-8 offsets. If you are sure that Tree-Sitter absolutely requires the edited range start/end byte offsets to be according to the UTF-8 encoding, Qt doesn't help you there. You need to maintain a global index to map between UTF-16 and UTF-8 positions in your document, and maintain it manually after edits (possibly within the same slot) - sorry. You can improve the performance by keeping for each text block its' UTF-8 start and end offsets. Though a cursory Google search suggests that Tree-Sitter natively supports UTF-16?
@eineskamelles said in QTextDocument and ts_tree_edit:
One thing I was considering is using the undo/redo functionality for this, but I am not sure if this will be efficient.
Don't bother,
QTextDocumentdoesn't provide access to its' undo stacks, and even if it did - the actualQUndoCommandimplementation that actually contains the change data is not exported at all (not even as private API). -
@eineskamelles Welcome to the forum.
I've never worked with tree-sitter, so can't really help you with anything specific with its' API. I can give you some tips on the Qt side of things.
The
contentsChangesignal is absolutely the correct way to go about syncing changes to aQTextDocumentwith an outside parser, and actually gives you everything you need. The key is to interpret its' parameters as saying that the text between UTF-16 offset range[position, position + charsRemoved]in the document before the change, should be changed to contain the text which is in the UTF-16 offset range[position, position + charsAdded]as it is now in the document after the change.Regarding turning the UTF-16 offsets to UTF-8 offsets. If you are sure that Tree-Sitter absolutely requires the edited range start/end byte offsets to be according to the UTF-8 encoding, Qt doesn't help you there. You need to maintain a global index to map between UTF-16 and UTF-8 positions in your document, and maintain it manually after edits (possibly within the same slot) - sorry. You can improve the performance by keeping for each text block its' UTF-8 start and end offsets. Though a cursory Google search suggests that Tree-Sitter natively supports UTF-16?
@eineskamelles said in QTextDocument and ts_tree_edit:
One thing I was considering is using the undo/redo functionality for this, but I am not sure if this will be efficient.
Don't bother,
QTextDocumentdoesn't provide access to its' undo stacks, and even if it did - the actualQUndoCommandimplementation that actually contains the change data is not exported at all (not even as private API).@IgKh Thank you for your response.
I must admit that it hadn't occurred to me to use UTF-16 encoding in tree-sitter. Sorry about that. But I also need to provide the old end point (row, col) and new end point to this function, and I cannot determine the old end point based on those three parameters alone, because in those x characters removed there might be a newline or not.
Do you have any suggestions for determining the point ofposition + charsRemovedin the old document? Thanks -
@IgKh Thank you for your response.
I must admit that it hadn't occurred to me to use UTF-16 encoding in tree-sitter. Sorry about that. But I also need to provide the old end point (row, col) and new end point to this function, and I cannot determine the old end point based on those three parameters alone, because in those x characters removed there might be a newline or not.
Do you have any suggestions for determining the point ofposition + charsRemovedin the old document? Thanks@eineskamelles said in QTextDocument and ts_tree_edit:
But I also need to provide the old end point (row, col) and new end point to this function, and I cannot determine the old end point based on those three parameters alone, because in those x characters removed there might be a newline or not.
Indeed you can't. It is strange why it would need both, but the information needed to obtain the block number and offset within the block of a character position in the document pre-editing is no longer available from the
QTextDocument. The recourse here is to maintain a copy yourself... For eachQTextBlocknumber (not the object itself, since it can be destroyed), save your own copy of its' position and length and recreate it after edits. -
@eineskamelles said in QTextDocument and ts_tree_edit:
But I also need to provide the old end point (row, col) and new end point to this function, and I cannot determine the old end point based on those three parameters alone, because in those x characters removed there might be a newline or not.
Indeed you can't. It is strange why it would need both, but the information needed to obtain the block number and offset within the block of a character position in the document pre-editing is no longer available from the
QTextDocument. The recourse here is to maintain a copy yourself... For eachQTextBlocknumber (not the object itself, since it can be destroyed), save your own copy of its' position and length and recreate it after edits.@IgKh I have measured the time it takes for undo() and redo() methods and together it is about 30ms on a 250.000 file so this might be doable to get the points. The function signature says:
/** * Edit the syntax tree to keep it in sync with source code that has been * edited. * * You must describe the edit both in terms of byte offsets and in terms of * (row, column) coordinates. */Although according to some github issues, it seems like using bytes only might work too, but it is not guaranteed to work in the future. Also one last question, are you sure that
charsAddedandcharsRemovedhave to be added toposition? Because when I create a QTextCursor on the added position it says that the index is out of range. Thanks. -
@IgKh I have measured the time it takes for undo() and redo() methods and together it is about 30ms on a 250.000 file so this might be doable to get the points. The function signature says:
/** * Edit the syntax tree to keep it in sync with source code that has been * edited. * * You must describe the edit both in terms of byte offsets and in terms of * (row, column) coordinates. */Although according to some github issues, it seems like using bytes only might work too, but it is not guaranteed to work in the future. Also one last question, are you sure that
charsAddedandcharsRemovedhave to be added toposition? Because when I create a QTextCursor on the added position it says that the index is out of range. Thanks.@eineskamelles said in QTextDocument and ts_tree_edit:
Also one last question, are you sure that charsAdded and charsRemoved have to be added to position? Because when I create a QTextCursor on the added position it says that the index is out of range
Yes,
position + charsAddedis a valid cursor position pointing to immediately after the last character inserted as part of the edit.position + charsRemovedis not a valid position in the modified document, of course. If that doesn't work for you please post an example.@eineskamelles said in QTextDocument and ts_tree_edit:
I have measured the time it takes for undo() and redo() methods and together
I wouldn't do that. The interaction between the undo stack and the
contentsChangesignal is problematic, and will break in subtle ways, for example if you use edit blocks. You could do such things in response to theundoCommandAddedsignal, but that is probably not helpful for your use case. -
@eineskamelles said in QTextDocument and ts_tree_edit:
Also one last question, are you sure that charsAdded and charsRemoved have to be added to position? Because when I create a QTextCursor on the added position it says that the index is out of range
Yes,
position + charsAddedis a valid cursor position pointing to immediately after the last character inserted as part of the edit.position + charsRemovedis not a valid position in the modified document, of course. If that doesn't work for you please post an example.@eineskamelles said in QTextDocument and ts_tree_edit:
I have measured the time it takes for undo() and redo() methods and together
I wouldn't do that. The interaction between the undo stack and the
contentsChangesignal is problematic, and will break in subtle ways, for example if you use edit blocks. You could do such things in response to theundoCommandAddedsignal, but that is probably not helpful for your use case.@IgKh I had some internal error and that is why it was off. I fixed it now. Thank you for your help.