Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow repository browsing in 1.14.x #15707

Closed
2 of 6 tasks
tsowa opened this issue May 3, 2021 · 87 comments
Closed
2 of 6 tasks

Slow repository browsing in 1.14.x #15707

tsowa opened this issue May 3, 2021 · 87 comments
Labels
performance/bigrepo Performance Issues affecting Big Repositories performance/speed performance issues with slow downs

Comments

@tsowa
Copy link

tsowa commented May 3, 2021

Description

I saw a similar thread but there is "windows" in the title so I create a new issue. Gitea 1.14.x is much slower in repository browsing than Gitea 1.13.

Sample repo running with 1.14.1:
https://gitea.ttmath.org/FreeBSD/ports
Try to open any directory, for example:
https://gitea.ttmath.org/FreeBSD/ports/src/branch/main/audio
It takes between 50-150 seconds to open a page.

The same repo running with 1.13.7:
https://giteaold.ttmath.org/FreeBSD/ports
Try to open similar directory, for example:
https://giteaold.ttmath.org/FreeBSD/ports/src/branch/main/audio
I takes about 5 seconds.

You can see the same problem on try.gitea.io:
https://try.gitea.io/tsowa/FreeBSD_ports
But you have a cache so you have to find a directory which was not open before. Opening such a page takes 100-300 seconds.

Let me know if more info is needed.

@zeripath
Copy link
Contributor

zeripath commented May 3, 2021

This is because the algorithm was changed in 1.14 due to a problem with go-git causing significant memory issues. Thank you for the test cases though because they will provide tests to improve the current algorithm.

If you are suffering significant slow downs here you can switch back to the gogit build by adding gogit to your TAGS during building.

We would otherwise appreciate help in improving the performance of the algorithm for the pure git version.

@tsowa
Copy link
Author

tsowa commented May 10, 2021

Thanks for the hint with TAGS. I don't have time to make more tests now but I found something interesting.

When browsing my repository with gitea I see in htop following git processes:

22304 root       20   0 12876  2100 S  0.0  0.0  0:00.00 daemon: /usr/bin/env[22305]
22305 git2       31   0  926M  254M S 136.  0.8  1:29.80 └─ /usr/local/sbin/gitea web
22839 git2       21   0  952M  158M S  3.3  0.5  0:01.11    ├─ /usr/local/bin/git -c credential.helper= -c protocol.version=2 rev-list --format=%T 9ea557779ce520c206f223f6f7b48fcc52f92dad
22840 git2       27   0 1103M  275M S 13.5  0.8  0:04.59    └─ /usr/local/bin/git -c credential.helper= -c protocol.version=2 cat-file --batch

These processes were running for about one minute so I have run the first git process by hand:

$ cd /var/db/gitea2/gitea-repositories/freebsd/ports.git
$ /usr/local/bin/git -c credential.helper= -c protocol.version=2 rev-list --format=%T 9ea557779ce520c206f223f6f7b48fcc52f92dad | wc -l

and it gave me 1087346 rows. I suppose the millions rows are then pass to the second git process.

I have piped output from the first git to the other:

$ /usr/local/bin/git -c credential.helper= -c protocol.version=2 rev-list --format=%T 9ea557779ce520c206f223f6f7b48fcc52f92dad | /usr/local/bin/git -c credential.helper= -c protocol.version=2 cat-file --batch > swinka.txt

it takes about 15 seconds and shows that file swinka.txt is larger than 1 GB

$ ll -h swinka.txt 
-rw-r--r--  1 git2  git2   1,4G 10 maj 22:47 swinka.txt

so there is a lot of data to pass between gitea and git. So the question is: is it really needed for the first git process to return one milion rows?

@zeripath
Copy link
Contributor

@tsowa unfortunately yes but it should be relatively fast - the issue will be that the structure of some repos will actually require that million of rows to be checked more than a few times. Determining which commit a file is related to is not a simple task in git - and although there's a commit graph we don't have a good way of querying it.

(It shouldn't take 15s to pipe those two commands together - you're slowing things down by allocating file space - you should pipe the output to null btw.)

There are a few more improvements to that function that can be made - for a start the function is not optimised for our collapsing of of directories containing a single document - and writing a commit graph reader would be part of that.

The gogit backend does have a commitgraph reader but it is not frugal with memory at all. I need to spend some time making a reader that is much more frugal and stream like but I haven't had the time. (See the technical docs https://github.com/git/git/blob/master/Documentation/technical/commit-graph.txt)

In the end though we need to move rendering of last commit info out of repo browsing and in to an ajax call. Again something I haven't had time to do.

@zeripath
Copy link
Contributor

One question - have you disabled the commit cache? If so please re-enable it.

@noerw noerw added performance/bigrepo Performance Issues affecting Big Repositories performance/speed performance issues with slow downs labels May 13, 2021
@tsowa
Copy link
Author

tsowa commented May 15, 2021

It was enabled by default but the 'adapter' option was set to 'memory'. Now I have installed memcached and changed adapter to 'memcache' and a difference is visible.

Opening https://gitea.ttmath.org/FreeBSD/ports for the first time took 79766ms and for the second time only 3063ms. Opening https://gitea.ttmath.org/FreeBSD/ports/src/branch/main/audio for the first time 141221ms and later 37205ms.

But I see that you are calling a lot of git processes, I have created a small git wrapper in such a way:

#include <unistd.h>
#include <fstream>
#include <iostream>

int main(int argc, char * argv[], char * envp[])
{
    std::ofstream file("/home/tomek/git.log", std::ios_base::out | std::ios_base::app);

    if( file )
    {
        file << "git ";

        for(size_t i=0 ; argv[i] ; ++i)
        {
            file << argv[i] << " ";
        }

        file << std::endl;
        file.close();
    }

    return execve("/usr/local/bin/git.org", argv, envp);
}

I have moved original /usr/local/bin/git to /usr/local/bin/git.org and have compiled above program as /usr/local/bin/git. And it gives me git.log with all git operations and I see that sometimes you are calling the git binary 300 times in one request:

cat ~/git.log | wc -l
     335

So it cannot be fast, this remains me of the old days when we were using cgi scripts. Is there a reason you are calling git directly instead of using a git library such as libgit2?

@lunny
Copy link
Member

lunny commented May 15, 2021

Could you also count what's the git command Gitea invoked in these 335 commonds?

This is because when browsing, Gitea will get last commit message for every dir/file on the ui. For v1.13.0, we use go-git which is a pure go git library, for v1.14.x, we have two versions for windows because the library have some memory problems. And maybe you could try to compile the go git version yourself to check what's the different between them.

@fnetX
Copy link
Contributor

fnetX commented May 28, 2021

Hello there, looking at this from Codeberg's perspective (issue)

As you can see, we're also suffering from the slow repository browsing which affects the overall performance of our machine. While setting up a Redis cache works well for us, we would like to improve the initial generation on cache misses, too.
Today, we tried to find some more information about the bottleneck, I hope this is useful for you:

We suspect especially this command where each folder is checked for the latest commit, executing /usr/bin/git -c credential.helper= -c protocol.version=2 -c filter.lfs.required= -c filter.lfs.smudge= -c filter.lfs.clean= rev-list --format=%T <commit> which has a terrible performance (loading the entire commit history).

The idea makes sense to us: getting all commits, checking if the folder or file was touched. But the logic without gogit doesn't appear to stop after all necessary information was loaded, but rather continue serving up the entire history, even if all files or subfolders in a folder have already been hit by a recent commit.
While we didn't completely understand the gogit logic yet, it appears to be a little smarter at this point and only looking so far back in history as necessary to retrieve the information it's looking for.

We assume that the process should be stopped before it ends if all information was provided.

It looks like there's a lot of stuff to be improved with the native git backend, and some actions will probably always be slower because they cannot directly interface with git operations (e. g. directly working on git results while they are fetched instead of the piped input). It might a good idea to turn back to gogit for all systems when the memory issues are resolved, or, look for another git library (that is more native and maybe faster than gogit), but offers some better interface (go-bindings for libgit2?)

Please let us know if we can provide further assistance in improving this performance issue.

Some other random observations that might be interesting to you:

  • multiple requests for the same resource try to generate it concurrently on cache miss, the operation doesn't get queued (thus, requesting the same page twice results in generating it twice until the cache is filled)
  • git processes aren't stopped when the requesting TCP connection is closed
    • if a connection times out (proxy), the process is still running in the backend
    • a user reloading the page easily doubles the resources for this operation
    • it's possible to DoS a huge server by simply spamming F5 (reload) on a page with a cache miss with minimum cost at the attackers side (no need to keep the connection open)
  • initial generation with this method will always be very slow if some files (e.g. a README , LICENCE, .gitignore, gitattributes, dockerignore etc) weren't touched for a long time
  • pushing to a repo invalidates the full redis cache, even if only parts of the information changed (e.g. a subfolder was updated), thus active repos won't profit very much from the cache
  • timed out git commands (by adjusting the timeout in Gitea) keeps them listed as running in the Gitea admin monitoring section, although the processes don't exist in the system any more

@zeripath
Copy link
Contributor

Take a look at #15891

@zeripath
Copy link
Contributor

zeripath commented May 28, 2021

@fnetX thanks for your long comment.


It's worth remembering that the issue precipitating the pure git backend was memory use. I've submitted a patch to go-git which should cause much lower memory load. Until that is in and working correctly, go-git will happily load in huge blobs in to memory - storing them in caches even when you want to check the size of the object. It's really worth being clear that that is an absolutely intolerable situation.

Further the issues you are highlighting in the last section are not new to the native git backend. They're present in a different way in the go-git backend just in a way you can't track. I've long advocated for changing to a more gitlab approach for this and/or for passing in request contexts to terminate things - I'm really happy to work on this - but I haven't had a chance to do this - and to be honest none of you are paying me.


Now going on to the get last commit algorithm.

All algorithms have a balance between memory and time. The current algorithm is highly optimised against memory use. If we are happy to use more memory that can be improved.

We suspect especially this command where each folder is checked for the latest commit, executing /usr/bin/git -c credential.helper= -c protocol.version=2 -c filter.lfs.required= -c filter.lfs.smudge= -c filter.lfs.clean= rev-list --format=%T <commit> which has a terrible performance (loading the entire commit history).

The idea makes sense to us: getting all commits, checking if the folder or file was touched. But the logic without gogit doesn't appear to stop after all necessary information was loaded, but rather continue serving up the entire history, even if all files or subfolders in a folder have already been hit by a recent commit.
While we didn't completely understand the gogit logic yet, it appears to be a little smarter at this point and only looking so far back in history as necessary to retrieve the information it's looking for.

Looking at the length of time the rev-list process is running a bit of a distractor. Yes the go-git process can stop once it's finished looking at all the parents and the paths, but it's a question of memory and time spent tracking the parents. git rev-list avoided tracking those parents - and grabbing the root tree saved a lot of time - but we could add in %P to the format to allow tracking of parents and could allow termination once all appropriate parents are determined - but I don't think it's the primary cause of delays.

The greatest speed up in #15891 is actually preemptive passing the next tree ID to the git cat-file process as soon as we know what it's going to be. A large proportion of time appears to be spent waiting for go to fill the read buffer from the other process. This is where the go-git algorithm can be quicker as it avoids that by reading files in directly.

Some other random observations that might be interesting to you:

  • multiple requests for the same resource try to generate it concurrently on cache miss, the operation doesn't get queued (thus, requesting the same page twice results in generating it twice until the cache is filled)

  • git processes aren't stopped when the requesting TCP connection is closed

    • if a connection times out (proxy), the process is still running in the backend
    • a user reloading the page easily doubles the resources for this operation
    • it's possible to DoS a huge server by simply spamming F5 (reload) on a page with a cache miss with minimum cost at the attackers side (no need to keep the connection open)
  • initial generation with this method will always be very slow if some files (e.g. a README , LICENCE, .gitignore, gitattributes, dockerignore etc) weren't touched for a long time

  • pushing to a repo invalidates the full redis cache, even if only parts of the information changed (e.g. a subfolder was updated), thus active repos won't profit very much from the cache

  • timed out git commands (by adjusting the timeout in Gitea) keeps them listed as running in the Gitea admin monitoring section, although the processes don't exist in the system any more

These are all longstanding issues and I am aware of them. I would love to spend time fixing these but I am limited in my time and availability.


Honestly I wish you'd just talked to me directly. I'm always on Discord and could have told you and kept you abreast of what was going on and my progress in trying to speed this up.

@zeripath

This comment has been minimized.

@fnetX
Copy link
Contributor

fnetX commented May 28, 2021

Hey, thank you very much for the explanation.

I've submitted a patch to go-git which should cause much lower memory load. Until that is in and working correctly ...

I somehow thought this was already in and just needed some further improvements, my bad.

and to be honest none of you are paying me.

Yes, we can mainly offer to being thankful as long as we aren't paid for anything either. ❤️
But let's see if we can figure something out.

Honestly I wish you'd just talked to me directly. I'm always on Discord

Yeah, the others told me that, too. But since Discord is a proprietary app that kept crashing my computer back when I last used it, I decided against this and went for dumping our findings somewhere, hoping they are of any use. Chose this issue over the thread on Codeberg as it seemed to better fit in this topic here.

I think I know how to seriously improve this

Sounds like good news. Please tell us if there's anything we can do.

@zeripath
Copy link
Contributor

oh my - I think I know how to seriously improve this. I think I've been way too distracted by the way it was done in the go-git implementation and there's genuinely a much quicker way to do this.

Unfortunately this doesn't work.

The idea was to use git log --format=%H --raw -t --no-abbrev --reverse COMMIT_ID -- paths but I can't come up with a way to stop it from listing the contents of the trees - meaning that it takes even longer.

If I could figure out a way to not list the contents of the trees this would be definitely faster than the go-git version.

@fnetX
Copy link
Contributor

fnetX commented May 28, 2021

What about adding -n 1? Correct me if I am completely mistaken, because I neither fully understand the Gitea backend yet, nor do I know how git works, but this seems to give you the latest commit of a path and gives the same result as Gitea currently gives.

@fnetX
Copy link
Contributor

fnetX commented May 28, 2021

Oh, you probably still want to have the full list of the folder you're looking at, just not of all the subfolders?

@zeripath
Copy link
Contributor

zeripath commented May 29, 2021

yeah - I mean if we could just do that n times then it would be easy and fine but it's not like that.

Also it's not quite -n1 consider the following tree:

         H
       /   \
      D     E
      |     |
      C     F
      |
      B
      |
      A

If the wibble becomes the object with SHA deadbeef at B and at E. The correct commit to report is B not E.

So -n1 is still not right. git describe will give the correct answer but it's too slow to be run n times.

@zeripath
Copy link
Contributor

Could you test #15891? In my limited testing this is faster for the root directory than the go-git native version. There is a still a slowdown problem in the subdirectories.

@zeripath
Copy link
Contributor

@fnetX - I've just made another improvement in #15891 which should solve the sub directories problem

@fnetX
Copy link
Contributor

fnetX commented May 30, 2021

Thank you. We haven't yet been able to properly backport it to our fork // rebase our patches to this pull. We'll look into it and test then.

@tsowa
Copy link
Author

tsowa commented May 30, 2021

@zeripath Thanks, now testing bd1455a from your repo (cache is disabled):
https://giteanew.ttmath.org/FreeBSD/ports

The speed up is visible, browsing directories is about 5 times faster than 1.14.x. Not as fast as cgit but much better than before. Good job.

@zeripath
Copy link
Contributor

@tsowa does cgit even attempt to provide last commit information?

@zeripath
Copy link
Contributor

I have a backport of the latest get-lastcommit-cache performance improvements on to 1.14 if people would like them.

@fnetX
Copy link
Contributor

fnetX commented May 30, 2021

We have tested the backport you provided on codeberg-test.org and it has ~x3 loadtimes as go-git (15 to 17 seconds your pull vs. ~ 5 seconds go-git). We're using git version 2.29.2 - do you know if a more recent version might have a better performance or if there are other constraints that might decrease performance?
It's a single-core 2GB RAM VPS.

@zeripath
Copy link
Contributor

well that's interesting - as my timings appear to be similar to those on go-git.

are you sure you've built from the backport-improve-get-lastcommit branch?

The version should be 1.14.2+33-g57d45e1c2 as in SHA 57d45e1c247eaafb3a3a92ab593c31356b472d6f

Do you have commit graphs enabled for your repos?

@fnetX
Copy link
Contributor

fnetX commented May 30, 2021

We deployed this branch which has your commits on top of our 1.14 patches cleanly: https://codeberg.org/Codeberg/gitea/src/branch/codeberg-try-puregit-improvements (just confirmed once more that the commit matches: 1.14.2+49-g7e9e3f364)

Yes, you can browse commit graphs on Codeberg.

@zeripath
Copy link
Contributor

Hmm... I am very confused as this is now just as fast as gogit for me and possibly faster in places. Tell me there's at least some improvement here for you?

I'm almost at my limit for what I can do to speed this up any further. The main slowdowns in my testing were in filling the buffers between the pipes & adjusting when the subsequent reads occurred seemed to fix this for me - perhaps my processor is just fast enough that the earlier writes provide me just enough time to prevent the fill lock whereas on your processor that's not quite enough time. I just don't think there's any way to avoid it - I mean we could try an os.pipe instead of an io.pipe? I tried an nio.Pipe but it was just as slow. Certainly we can't switch to the same algorithm as the go-git variant as that would require even more communication and waiting for the cat-file-batch pipes to respond and fill.


By commit graph I meant the git core.commitGraph functionality. It should be enabled by default but ... I've certainly seen repos on my system that don't have a commit graph even though they would clearly benefit.


Ok I guess we're at a point of diminishing returns - and I might be better off looking at solving the problems in go-git and making last commit info stop slowing down rendering.

@ashimokawa
Copy link
Contributor

ashimokawa commented May 31, 2021

@zeripath

I did tests with the linux kernel and nixpkgs repos after a gitea restart with cold cache. It seems to be factor 3 on both in favor of gogit. I have no idea why the optimizations do nothing on those repos. git(new) here is the backport of your optimizations.

linux 1.14 gogit 0:21
linux 1.14 git 1:03
linux 1.14 git(new) 1:01

nixpkgs 1.14 gogit 0:06
nixpkgs 1.14 git 0:17
nixpkgs 1.14 git(new) 0:17

I also backported your optimization to 1.14 before with same results, but I blamed by lack of understanding of the code and a bad backport.

@zeripath
Copy link
Contributor

zeripath commented May 31, 2021

@ashimokawa you'd need to backport a few other PRs to see the improvement - it's not just #15891 that is needed. I'm happy to give you a link to that backport.

@lunny
Copy link
Member

lunny commented May 31, 2021

gogit has a commitgraph optimization, ref #7314, but of course, git version should also read commitgraph(#7313) if that file has been updated.

If we want to continue the development from gogit, maybe we can maintain a fork in gitea's orgnization if the original cannot merge the PR quickly.

@zeripath
Copy link
Contributor

zeripath commented Jun 6, 2021

@ashimokawa are you able to retest #16059 and tell me which repo it fails on. No rush. I'll push for #16042 to be merged in the meantime

@ashimokawa
Copy link
Contributor

@zeripath

It did not fail, it was just extremely slow, slower that anything I have ever tested in this context.
What should I retest, there are no new commits added to this PR, right?

@zeripath
Copy link
Contributor

zeripath commented Jun 6, 2021

@ashimokawa did you actually test from 67a1aa9fc55a36364b6c2e5bd2e230410a63e3fe ? or was it from 2cfd269688c56839ddc53b0563ad7aefd3a4da2a ?

What was the repo that was so slow btw?

@ashimokawa
Copy link
Contributor

The repo I was testing with was https://codeberg-test.org/bigrepos/nixpkgs with cold cache

The branch I was testing was https://github.com/zeripath/gitea/tree/backport-use-git-log-raw

It took 55 seconds (!) compared to plain 1.14 with pure git which only took ~17s

go-git took 7.7s, your (previous) backport was the fastest with 6.7s so far but led to wrong data in some subdirs as @fnetX pointed out.

@zeripath
Copy link
Contributor

zeripath commented Jun 6, 2021

Thanks @tsowa

So I've discovered something terrible with git log --raw and --name-only and --name-status that seems insurmountable.

If there is a merge commit it appears that the files created in that merge don't appear.

This might simply be a bug with git but it's kinda annoying. From within the git repository:

$ git --version
git version 2.31.1

$ git log --raw --parents faefdd61ec7c7f6f3c8c9907891465ac9a2a1475 -- gitk-git/.gitignore
commit 9a6c84e6e9078b0ef4fd2c50b200e8552a28c6fa
Author: Junio C Hamano <gitster@pobox.com>
Date:   Wed Jan 30 13:52:44 2013 -0800

    Merge git://ozlabs.org/~paulus/gitk
    
    * git://ozlabs.org/~paulus/gitk:
      gitk: Ignore gitk-wish buildproduct

:000000 100644 0000000000 d7ebcaf366 A  gitk-git/.gitignore

$ git log --raw --parents faefdd61ec7c7f6f3c8c9907891465ac9a2a1475 -- gitk-git | grep gitk-git/.gitignore

$ echo $?
1

This might be the cause of slowdowns people are seeing. Same thing happens on 1.7.2!

@zeripath
Copy link
Contributor

zeripath commented Jun 6, 2021

OK I've figured that out! -c is the trick

@zeripath
Copy link
Contributor

zeripath commented Jun 6, 2021

OK I've pushed up another version of #16059 and its backport on to 1.14 to backport-use-git-log-raw.

These are radically quicker for me on most of these repositories and examples.

reponame #16059 5a90343 #16042 4c851b1 GoGit
ports 47fc04fbc3 2536ms 8717ms 8746ms
ports/devel 2633ms 50097ms 20482ms
ports/audio 240ms 2680ms 2887ms
ports/polish 185ms 8029ms 820ms
git faefdd61e 5108ms 3120ms 3373ms
git/gitk-git 313ms 544ms 2322ms
git/Documentation 13983ms 1040ms 5989ms
nixpkgs c43e0f4873 2694ms 19714ms 5863ms
nixpkgs/pkgs 2733ms 26187ms 714ms

I guess the next step is examining why git git/Documentation and nixpkgs/pkgs are pathological for 16059 and how that can be ameliorated.

@tsowa
Copy link
Author

tsowa commented Jun 6, 2021

@zeripath I have updated https://giteanew2.ttmath.org to 5a90343b and can confirm:

ports ~14 s
ports/devel ~9.5 s
ports/audio ~5.6 s
ports/polish ~5.5 s

The times probably depend on the kind of disks on the server, I have got old hdds 7200RPMs.

@zeripath
Copy link
Contributor

zeripath commented Jun 6, 2021

Yeah my development laptop is a bit of a beast.

The slow down appears to be with dealing with huge numbers of unseen parents.

I'll take another look though.

@zeripath
Copy link
Contributor

zeripath commented Jun 7, 2021

OK I've adjusted the heuristic slightly which appears to improve those three pathological cases.

@tsowa
Copy link
Author

tsowa commented Jun 8, 2021

For me the performance is sufficient, thank you for your work. Any plan to put it to 1.14.3?

@zeripath
Copy link
Contributor

zeripath commented Jun 8, 2021

well if you could comment on #16059 and say that you think this is now ready. If the codeberg peeps think it's better that would be good too.

@ashimokawa
Copy link
Contributor

I tested nixpkgs and linux with https://github.com/zeripath/gitea/tree/backport-use-git-log-raw

nixpkg

#16059 3.4s
go-git 7.7s

linux

#16059 65s
go-git 22s

Would need more testing to say anything, sometimes it is amazingly fast, sometimes almost 3x slower than go-git. But memory usage is way down :)

@ashimokawa
Copy link
Contributor

Found another really bad one:

nixpkgs/src/branch/master/pkgs/data/icons/beauty-line-icon-theme

#16059 83s(!)
pure git (1.14 default) 36s
go-git 1.9s

So the PR is factor 40 slower vs go-git in this case, there is ONE file in the directory above

@zeripath
Copy link
Contributor

zeripath commented Jun 8, 2021

That's just so weird - I've written a slight adjustment to only restart the log --name-status when there is genuinely no parent - hopefully that solves the problem above. It's still slower than go-git here which is weird as I can't really see any good reason for that but so it goes.

@zeripath
Copy link
Contributor

zeripath commented Jun 8, 2021

@tsowa
Copy link
Author

tsowa commented Jun 8, 2021

use-git-log-raw (283959931) takes 6-7 second to render beauty-line-icon-theme:
https://giteanew2.ttmath.org/NixOS/nixpkgs/src/branch/master/pkgs/data/icons/beauty-line-icon-theme

@ashimokawa
Copy link
Contributor

ashimokawa commented Jun 9, 2021

@zeripath
Unfortunanely no change for nixpkgs/src/branch/master/pkgs/data/icons/beauty-line-icon-theme
measured 89s now.

It is good to improve pure git but I see that for this one it is even a regression from current pure git code (not to mention go-git which only takes less than 2 seconds)

@lunny
Copy link
Member

lunny commented Jun 9, 2021

I think the next direction is use commitgraph file or asynchronous load last commit information.
Reuse go-git and improve the memory model is also a possible case.

Or just don't show last commit info if the tree entires is more than some value.

@fnetX
Copy link
Contributor

fnetX commented Jun 9, 2021

I was missing out here a bit, didn't yet try to get the exact changes in the different algorithms, but I'm just thinking about the first one:
That one was only slow, because it scanned all the commits for file changes and that could take a while. What about limiting this search for, let's say, 5000 commits (repos of this size seemed acceptable to browse) and render the page then. The rest is rendered asynchronously on demand, some commit info might then appear sooner than another.

With caching enabled (or at least a persistent cache), the information could already be generated for the full file tree. So the first access on repo shows some last commits and slowly loads all other information, but in the background the cache is already filled for the other files so that the whole commit log doesn't need to be scanned for each new subfolder access.

I'm still sceptic with too much async JS rendering, I mainly don't use GitLab because I know it sometimes hangs my browser when loading all the frontend stuff. Not sure if it's related to the commit info, but using the GitLab frontend is a pain for me.

@ashimokawa
Copy link
Contributor

The other problem with all those pure git implementation also is that if a user cancels a request the git process runs till completion and does not get killed. So if a user is impatient and presses reload, he can easily OOM an instance.

@6543
Copy link
Member

6543 commented Jun 9, 2021

@zeripath didn't you do some work to pass the context down to git exec commans ☝️ ?

@zeripath
Copy link
Contributor

zeripath commented Jun 9, 2021

@ashimokawa:

The other problem with all those pure git implementation also is that if a user cancels a request the git process runs till completion and does not get killed. So if a user is impatient and presses reload, he can easily OOM an instance.

You're wrong that this is unique to the native git backend - it's present in the gogit backend where it is absolutely uncancelable and will happily eat all of your memory until it's done. The native git backend at least has cancellable "processes" on the process page.

#16032, which is already in 1.15 and is the PR @6543 is talking about, passes down the request context into GetLastCommitPaths for both backends making them absolutely cancellable.

However, if I could convince other maintainers to review #16063 - this PR pushes the generation of commit info out of the render loop defering it on to an uniqued queue structure which is clearly the best option.

Two further additions would be good:

  • Make the commit info work process manageable - this would be relatively easily
  • Pass in the cache into GetLastCommitInfo so that as soon as a path value is determined it is cached.

@fnetX

was missing out here a bit, didn't yet try to get the exact changes in the different algorithms, but I'm just thinking about the first one:
That one was only slow, because it scanned all the commits for file changes and that could take a while. What about limiting this search for, let's say, 5000 commits (repos of this size seemed acceptable to browse) and render the page then. The rest is rendered asynchronously on demand, some commit info might then appear sooner than another.

There are two PRs up for discussion here #16042 & #16059.

        I
       / \
      A   B
      |   |
      C   D
       \ /
        E 
        | 
        F
        | 

Assume our file of interest changed in C and B, A < B in terms of time, but had the same SHA in F too.

Name Method Result
Github Difficult to be completely sure of their algorithm but I think it's essentially git log -n1 B
1.14 GoGit Walks I A C the commit tree oldest-parent first looking for changes to the file C
1.14 git Walks I A B C D E F ... looking for the earliest time our file had the SHA F
#16042 Walks I A B C D E for the earliest time our file had the SHA, but we stops at E as we have reached a single branch and have found that sha C
#16049 Walks I A B with git log --name-status ... which gives the diff for each commit, stops at B as the file is changed there B

One of the problems with 1.14 git and #16042 is that git rev-list returns only the treeID for the commit so we have to traverse the tree using git cat-file --batch. We don't repeatedly call git ls-tree for each commit as that is slower.

With caching enabled (or at least a persistent cache), the information could already be generated for the full file tree. So the first access on repo shows some last commits and slowly loads all other information, but in the background the cache is already filled for the other files so that the whole commit log doesn't need to be scanned for each new subfolder access.

#16063 could easily be changed to pre cache the full repository but gitea will already attempt to precache large repos on pushes.

I'm still sceptic with too much async JS rendering, I mainly don't use GitLab because I know it sometimes hangs my browser when loading all the frontend stuff. Not sure if it's related to the commit info, but using the GitLab frontend is a pain for me.

#16063 as currently constituted simply does setTimeout(2s, reload); it's not doing anything cleverer than that. It would be possible to do a simple div replacement trick as used in the graph but the code that renders directories is a hellscape of spaghetti which I think could be done as a follow up to #16063 if it was to get in. We would not need to do cleverer rendering than simply replacing the div I think.

@tuaris
Copy link

tuaris commented Jul 6, 2021

I came across this GH issue while looking for information on the same problem I've been having. After briefly looking over the comments here I understand that the slowness is due to Gitea trying to load the last commit message on a per folder basis so it can display it on the rendered page.

It looks like even though there is an option to cache the last commit, that doesn't really do much in all cases. However I have noticed some improvements by setting these options:

Probably could see even more improvement by increase the TTL more at a trade off of possibly having wrong info.

[cache]
ENABLED	= true
ADAPTER	= memcache
HOST	= localhost:11211
ITEM_TTL = 24h

[cache.last_commit]
ENABLED	= true
ITEM_TTL = 8860h
COMMITS_COUNT = 1500

I have one suggestion, especially for large repositories such as the FreeBSD ports repo. Maybe offer an option to disable the display of the commit message on a per-repo basis. Although it's a nice feature, it's kinda pointless on large repos IMHO. I don't think I even pay attention to it.

Or you could have those commit messages indexed in the background to the database or Elasticsearch instead of generating them on the fly with git log

Edit: just saw #16063 which is similar to what I just described.

@zeripath
Copy link
Contributor

I'm going to close this as I believe these problems have been considerably improved on 1.15 and main. If specific problems remain please ask for a reopen but please provide some logs - or consider opening another issue with more details.

@tuaris
Copy link

tuaris commented Aug 30, 2021

I will say there is an improvement after upgrading to 1.15.0. Though it doesn't look like additional page loads after the initial are any 'faster'. Taking the ongoing example of the official FreeBSD ports git repository as a benchmark.

When loaded into Gitea it takes about 2 seconds to render the sub directory www:
image

Reloading the same page subsequent times take only slightly less time:
image

It's a much better experience than 1.14.x. Thanks for the work towards improving this. Hopefully the ideas mentioned in #16063 and in my previous comment are still on the table.

@zeripath
Copy link
Contributor

@tuaris if second loads are not any faster then it's likely you do not have the cache set up.

Regarding the last commit info - #16467 would essentially do that.

@go-gitea go-gitea locked and limited conversation to collaborators Oct 19, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
performance/bigrepo Performance Issues affecting Big Repositories performance/speed performance issues with slow downs
Projects
None yet
Development

No branches or pull requests

8 participants