QTreeView with lots of items is really slow. Can it be optimised or is something buggy?

Cipherstream

@Christian-Ehrlicher
I am new to Qt so I am sure you know more than me. When I looked at the code for selectAll() it looked like it was iterating over all rows on the root. Wouldn't all rows in a listview be on the root? ie if you had a listview with 10 million items, wouldn't all 10 million items be rows "on the root"? Which would mean that rowCount() would get called 10 million times when calling selectAll()?

Cipherstream

@JonB
I'll answer your shorter post first, since it is quicker :)

Yes a display with 70 million items could indeed take up a lot of ram. However due to the model/view paradigm you can potentially look up the data as needed and not need to keep any items in memory.

You might then say "that sounds like a great case for fetch more", however I did initially try using fetch more and as far as I could tell it was an iterative fetcher from index 0 upwards. So to get the last item in 70m items it would still have to fetch all 70m items.

If instead I have a fileformat on disk with a header that says "this file has 70m items", I can just read in the header and know how many items my view needs to handle. The user then scrolls the scroll bar to the location they want to look at and only that data will then be accessed. So they could do ctrl+end to go to the end of the list and the model will only access the items displayed on the last "page" of the view.

This does indeed work well like this. If you let this initial slow setup part finish and then attempt to move through the data, you can see that it only fetches the minimal amount of data needed.

This is for an app similar to wireshark where you may indeed need to have millions of items/packet in the one view. Funnily enough while looking into this issue I came across the developers of wireshark also talking about this problem (as they also use Qt it seems) and arguing that yes they do indeed to handle that much data in a view, however they are not doing it in python hehe.

Cipherstream

@JonB
Your tests are a good idea to look at getting real world timing for these calls. However the point is that I can't think of any reason why rowCount() or columnCount() needs to be called 20m times each if I have 10m items. Whatever code is doing this is imo the code that needs fixing.

Your timing test results suggest that calling them so many times is taking at least 10 of the 17 seconds. Therefore optimising whatever is calling them so many times to only call once and cache the result value would immediately improve the time taken by 10 seconds. A 59% speed increase.

For the elephants in the room :)

Yes python is slower. However the simple fact that tableview is able to handle the processing instantly suggests that python speed in general is not the issue. I agreee that expecting Qt code to be changed just for the sake of python wrapper speed may be a bit of an ask. However I also feel that a model implementation that is making a linear amount of these calls based on the number of items is faulty. The C++ implementation by @IgKh above shows listview and treeview takes 7 times longer than tableview which also suggests something is not right. However since the "7 times longer" value is only 2 seconds it can be ignored.
It seems I answered this in my previous reply to you. I don't agree with the blanket statement that "10 million items in a UI view is too much". Yes it is a lot and most of the time it is not needed. But like a lot of things there are times where it is needed. The "paging" mechanism is exactly what the model/view paradigm is for! And indeed only "pages" of the data are actually accessed at a time when these UI views are in use. It is only during startup that some of these views are (incorrectly imo) doing some kind of processing for every single item.

Yes many calls to rowCount() is expensive. I don't see a reason why that many are done at startup. I hope to track down the cause to see if there is a valid reason :)

Cipherstream

Since it appeared that the cause of the issue was in the C++ Qt code I installed Qt C++ dev env and was able to build and debug the code that @IgKh provided.

I tracked down the cause of the many calls to rowCount and columnCount for the list view. They were coming from line 2600 onwards in:
https://code.qt.io/cgit/qt/qtbase.git/tree/src/widgets/itemviews/qlistview.cpp?h=6.8

The method QListModeViewBase::doStaticLayout() iterates over every row in the model to precalculate offsets for items in the list. This code does an optimisation where if gridSize has been set it will use the grid size for all items instead of calculating it for every single item. I confirmed that this optimisation had an effect by adding the line self.view.setGridSize( QtCore.QSize(18,18) ) to my python code.

The list view already has another existing optimisation for item sizes enabled by doing self.view.setUniformItemSizes(True). This "fixed item size" flag should also be used in this setup code in the same way the fixed size from a grid setting is used.

However if you remember, list view was calling rowCount and columnCount twice for every item in the model. The 2nd calls are done in this same method when checking if the row is hidden. The same loop that iterates over all items in the model also does an "is hidden" check on every item. The "is hidden" check looks to see if each item is present in the list of "hidden rows". An optimisation that can be done for this is to check if that list of hidden rows is empty, in which case no items are hidden and so it doesn't need to individually check whether every single item is hidden.

The first optimisation makes sense since if a "uniform item size" optimisation flag is set, then the code shouldn't check every single item in the list for its size!
The second optimisation might be up for debate as to whether it should be included or not. It does make a drastic change in processing time for python list views with lots of items however.

It turns out that Qt is very easy to build from source on windows, so I was able to make a custom Qt build and test my optimisations with my original python code.

Without either optimisation:
time taken to display list with 70000000 items = 358.775 seconds

With "size" only optimisation:
time taken to display list with 70000000 items = 198.576 seconds

With "size" and "hidden" optimisations:
time taken to display list with 70000000 items = 0.992 seconds

So the python code went from taking 6 minutes preprocessing before showing the view to just under 1 second.

Christian Ehrlicher

Wrt the rowCount calls: https://codereview.qt-project.org/c/qt/qtbase/+/601341

Cipherstream

I looked into optimising the TreeView and unfortunately I can't see any way to optimise it without a major rewrite. :(

I assumed there was an issue with ListView as it did not make sense that a 1D list of items would take longer than the 2D TableView. This assumption turned out to be correct, and I was hoping that upon finding the fix it may also apply to TreeView.

The TreeView slowdown also occurs during its layout function QTreeViewPrivate::layout() where it loops over every row calling model->index() directly and model->rowCount.() via QTreeViewPrivate::hasVisibleChildren(). It is again the huge number of calls to these model functions across the C++/python barrier (due to the model being in python code) that cause the slowdown.

Interestingly while looking into the TreeView code I saw that it did implement an "is hidden" optimisation like I was suggesting for ListView. Based off that I think both ListView optimisations I suggested should be added to the source code. I'll add a post below with the changes.

Cipherstream

@Christian-Ehrlicher

These are the two optimisations that I feel should be made for ListView. Would you be able to submit them for me?
(Feel free to change them as needed)

1) uniform / fixed size optimisation

qtbase\src\widgets\itemviews\qlistview.cpp: line 2561
original:

void QListModeViewBase::doStaticLayout(const QListViewLayoutInfo &info)
{
    const bool useItemSize = !info.grid.isValid();
    const QPoint topLeft = initStaticLayout(info);
    QStyleOptionViewItem option;
    initViewItemOption(&option);
    option.rect = info.bounds;
    option.rect.adjust(info.spacing, info.spacing, -info.spacing, -info.spacing);

    // The static layout data structures are as follows:
    // One vector contains the coordinate in the direction of layout flow.
    // Another vector contains the coordinates of the segments.
    // A third vector contains the index (model row) of the first item
    // of each segment.

    int segStartPosition;
    int segEndPosition;
    int deltaFlowPosition;
    int deltaSegPosition;
    int deltaSegHint;
    int flowPosition;
    int segPosition;

    if (info.flow == QListView::LeftToRight) {
        segStartPosition = info.bounds.left();
        segEndPosition = info.bounds.width();
        flowPosition = topLeft.x();
        segPosition = topLeft.y();
        deltaFlowPosition = info.grid.width(); // dx
        deltaSegPosition = useItemSize ? batchSavedDeltaSeg : info.grid.height(); // dy
        deltaSegHint = info.grid.height();
    } else { // flow == QListView::TopToBottom
        segStartPosition = info.bounds.top();
        segEndPosition = info.bounds.height();
        flowPosition = topLeft.y();
        segPosition = topLeft.x();
        deltaFlowPosition = info.grid.height(); // dy
        deltaSegPosition = useItemSize ? batchSavedDeltaSeg : info.grid.width(); // dx
        deltaSegHint = info.grid.width();
    }

    for (int row = info.first; row <= info.last; ++row) {
        if (isHidden(row)) { // ###
            flowPositions.append(flowPosition);
        } else {
            // if we are not using a grid, we need to find the deltas

optimised

void QListModeViewBase::doStaticLayout(const QListViewLayoutInfo &info)
{
    const bool useItemSize = !info.grid.isValid() && !uniformItemSizes();
    const QPoint topLeft = initStaticLayout(info);
    QStyleOptionViewItem option;
    initViewItemOption(&option);
    option.rect = info.bounds;
    option.rect.adjust(info.spacing, info.spacing, -info.spacing, -info.spacing);
    const QSize uniformSize = (uniformItemSizes() && cachedItemSize().isValid())
                        ? cachedItemSize() : itemSize(option, modelIndex(info.first));
    const QSize fixedSize = info.grid.isValid() ? info.grid : uniformSize;

    // The static layout data structures are as follows:
    // One vector contains the coordinate in the direction of layout flow.
    // Another vector contains the coordinates of the segments.
    // A third vector contains the index (model row) of the first item
    // of each segment.

    int segStartPosition;
    int segEndPosition;
    int deltaFlowPosition;
    int deltaSegPosition;
    int deltaSegHint;
    int flowPosition;
    int segPosition;

    if (info.flow == QListView::LeftToRight) {
        segStartPosition = info.bounds.left();
        segEndPosition = info.bounds.width();
        flowPosition = topLeft.x();
        segPosition = topLeft.y();
        deltaFlowPosition = fixedSize.width(); // dx
        deltaSegPosition = useItemSize ? batchSavedDeltaSeg : fixedSize.height(); // dy
        deltaSegHint = fixedSize.height();
    } else { // flow == QListView::TopToBottom
        segStartPosition = info.bounds.top();
        segEndPosition = info.bounds.height();
        flowPosition = topLeft.y();
        segPosition = topLeft.x();
        deltaFlowPosition = fixedSize.height(); // dy
        deltaSegPosition = useItemSize ? batchSavedDeltaSeg : fixedSize.width(); // dx
        deltaSegHint = fixedSize.width();
    }

    for (int row = info.first; row <= info.last; ++row) {
        if (isHidden(row)) { // ###
            flowPositions.append(flowPosition);
        } else {
            // if we are not using a fixed size, we need to find the deltas

2) is hidden optimisation

qtbase\src\widgets\itemviews\qlistview_p.h: line 358
original:

    inline bool isHidden(int row) const {
        QModelIndex idx = model->index(row, 0, root);
        return isPersistent(idx) && hiddenRows.contains(idx);
    }

optimised:

    inline bool isHidden(int row) const {
        if (hiddenRows.isEmpty())
            return false;
        QModelIndex idx = model->index(row, 0, root);
        return isPersistent(idx) && hiddenRows.contains(idx);
    }

Explanation:

QListModeViewBase::doStaticLayout() is called when the view is initially setup or resized. It loops over every item in the model calculating item layout based off the size and hidden attributes for each item. Currently the view is accessing the model once per item to calculate the size and then a second time per item when checking if it is hidden.

When using python wrappers, calls between C++ code and python code is not as "cheap" as calls between C++ code. This potentially applies to other wrappers for Qt too.

Now look at the case where you have a list view in C++ code accessing a model in python code, and the model has 10 million items. This will do 20 million calls beteen C++ and python code every time QListModeViewBase::doStaticLayout() is called. This creates a huge lag when initially showing or resizing a list view.

Optimisation Recommendations:

QListView already has an optimisation flag for 'uniformItemSizes'. If this is set then it can assume all items are the same size and should not have to get the size individually for each item. This flag and behaviour should be used in QListModeViewBase::doStaticLayout().
QTreeView already has an optimisation for QTreeViewPrivate::isRowHidden() which first checks if the list of hidden rows is empty, in which case there are no hidden rows and so no need to check each individual item to see if they are hidden. This behaviour should be added to QListViewPrivate::isHidden().

Optimisation Test Results:

A listview and model were created with 70 million items.
The time taken by doStaticLayout() to process this before showing it was:

Using python wrappers:

359 seconds before optimisation
199 seconds with "fixed size" optimisation
1 second with "fixed size" and "is hidden" optimisation

In C++ code:

2.2 seconds before optimisation
1.5 seconds after both optimisations

So even the C++ code gets a 30% speed boost.
This was on a pretty fast PC, on a slower device the 30% speed boost in C++ code might be more meaningful.

Cipherstream

While looking through the ListView code I realised that it also had item limits due to INT_MAX size (ie the largest value a signed 32bit integer can hold).

This occurs due to the values calculated and stored in QList<int> QListModeViewBase::flowPositions;. These values are basically item index * row height (for a normal top to bottom list).

The default row height appears to be 18 and the value for INT_MAX is 2147483647. So 2147483647 / 18 = 119,304,647

So the maximum number of items for ListView is just under 120 million items. I tested this and indeed it fails to handle 120m items.

A fix for this might be to use 64bit signed integers like
```QList<int64_t> QListModeViewBase::flowPositions;``, and then check anywhere where flowPositions values are handled to make sure that are treated correctly for 64bit values.

JonB

@Cipherstream
Your work on tracking down and optimising has been amazing. I would not want in any way to detract from that, but may I make a couple of general observations.

You keep pointing the figure at "calls between C++ and python code" as being a big issue. But earlier I found and suggested to you this is not the case? Scroll up to my https://forum.qt.io/post/813335 where I claim timings show there is "no" overhead across the language boundaries. If you do millions of direct, Python calls to some standalone nonVirtualRowCount() function, no C++ involved, I find exactly the same timing.
You were only able to improve (some of) the speed by being prepared to go into C++ and change the source of Qt. That improved C++ by 30%, which is absolutely not to be sniffed at, but Python by a factor of 360x! That indicates to me that Python has a problem, and in general Qt is written for C++. Nobody notices the overhead of calling rowCount() because it's either inlined or so fast as to be insignificant. But if it is indeed the case that calling a simple, tiny Python function is that much slower than C++ (which I did not know and am horrified at the timings) then you may always encounter other places where something innocuous in C++ is turning out to be horrific in Python.

As I say, not in any way to dismiss the case you have examined and the enormous improvement you have achieved by changing the C++ code. Just that there may always be issues from Python if speed is critical.

IgKh

@JonB said in QTreeView with lots of items is really slow. Can it be optimised or is something buggy?:

which I did not know and am horrified at the timings

A kind of poorly kept secret is that CPython (which is pretty much is Python) is considered to be a mere "reference implementation" and its' maintainers put implementation simplicity above performance, or at least that was the case in the past. The implication of this is that CPython is horrendously slow, even for a dynamic scripting language. There is no JIT, every line of bytecode is interpreted as it is executed. There is no inlining, function and method calls are dynamically dispatched by looking for string keys in a series of dictionaries for every call. Absolutely every little thing is allocated on the heap with the full PyObject overhead and has reference counting. FFI calls into Python require acquiring the Global Interpreter Lock, which is not huge overhead on your typical GUI app (as the GIL is normally not contended), but is not free.

This situation led to the fact that most high-performance Python toolkits and libraries are implemented in C, Fortran, C++ or Rust and are exposed via bindings. PyQt / PySide included. For a Wireshark-like application like @Cipherstream is describing, that's probably the approach that will be the path of least resistance too - write the data layer in C++ (the data model, and the custom delegate which you'll probably end up needing) and expose those to Python via Shiboken. Leave to Python the overall construction and management of the UI, where it has the relative advantage.

Cipherstream

@JonB
Yes it seems python really shows its speed limitations once you start calling things 10s of millions of times xD

You are right with your timings showing that python calls by themselves are indeed slow. In my last writeup I am trying to point out that when rowCount etc are called within C++ code, there is minimal cost. So the devs might easily overlook (or not care) about calling them one or more times for every item in the model. It it not until these calls are routed out of the fast C++ code and into the slow python code that it becomes really noticable.

With that said, the fact that there is an optimisation flag for "uniform sizes" means that it must have come up as an issue at some point. So it makes sense for the code to use that same flag to skip variable size processing during layout too. The same goes for the "is hidden" processing where treeview already implements the optimisation I put forward for use in llistview. So it must also have come up as being something worth optimising in the past.

Cipherstream

@IgKh
I have been tossing up how to proceed, since I do need to use a treeview for some parts of my app. I wanted to make the app in python so that it would easily work cross platform.

Stay 100% python and just have slow load times for big files. Smaller files are still very quick.
Have C++ module for bottleneck code (like you suggest), but then once there is some C++ code it makes it less easily cross platform compatible.
Make it fully C++

I am not really happy with any of the options hehe, so progress has stalled for now. Maybe if I went with option 2 but made the C++ code be part of a pip installable module, it would still stay easily cross platform...

IgKh

@Cipherstream Option #2 is really not a big issue. More Python packages than you'd guess have a compiled component, and Python packaging tools handle that case well by building binary wheels. (As much as anything about the Python packaging ecosystem can be described as "well")