[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [E-devel] cvs, servers and stuff.



Hi everyone.

I've been doing some thinking today. And I've been doing some testing as well. And there are a few things I realized.

The first thing I realized is a reason why the pseudo-benchmark I created was giving out evil data. In git's case this is because git does a lot of extra operations on the client's disk (unpacking and such) which take up a lot of time. During that time the server wouldn't be under any load, which shows why the load on the server wasn't anywhere near constant.

And then I realized why git wasn't really doing so well. And at the same time this is the reason why cvs isn't doing so well and, frankly no scm possibly could.

The problem is that you have a truckload of binary data in the repository. There are many reasons why this shouldn't be so.

1) Binary data is way better off distributed in the form of archives, that can be mirrored by anyone (I'm thinking at least SF). That way people can get that data a lot faster and your server is happy too.

2) You don't change the binary data that much. And even when you do so, you could pacakge your data into archives like imlib2-data.tar.bz2 so that you repackage less.

3) Changes in binary data don't generally affect dependencies. They're not like API changes or whatever. Most of the time people will just need to grab one updated archive and that's it.

4) You could then use pkg-config to ensure the right version of the data is actually installed from your configure scripts.

5) Let's do some simple math. 

You have 100MB worth of files. These account to 60MB binary and 40MB text. When you try to compress this, as git does, you get around 50MB binary and 8MB test. So that accounts for almost 60MB. 

That means for every 60 people that would simultaneously download through CVS you can have 100 download through git (let's just ignore the other factors and focus on bandwith a little).

Now suppose you have 40MB of text. With git you can then down to about 20% of the original size (maybe less, who knows). That means you could (in theory) actually have 5 times more downloads with git then with CVS.

Now I'm not saying to not keep that data in a repo. You obviously have to. I'm just saying there's no need for people to have anonymous access to that repo, it could be for developers only.

So, my suggestions are:
1) Move the data into its own repository
2) Convert the two repositories to git
3) Make that data repository devel-only.
4) Split the data into small packages (one for each data/ dir in the tree, I guess)
5) Make the source require the data through pkg-config
6) Have the data released as tarballs once it's changed (you can have that happen automatically with git, I'm assuming you can with the others as well)

And that's it. But for all this babbling, is this really worth it? 
Like I said, I found client-side disk I/O to make the benchmarks mostly useless. But they still provide me with a good overview on server-side CPU & Memory usage

So I opted for a new approach. I would have two terminals on my client. In one I'd do something like 'sleep 5 ; svn checkout ...'. In the second I'd do 'time read'. I would press enter once when network traffic actually began and once again when it stopped and that showed me how much everything took.

So here's the timings. The repos have no history attatched.

Repo with data:
CVS: 			0:46
SVN(svnserve): 		1:16
SVN(HTTP): 		1:58
GIT(git): 		1:23
GIT(HTTP): 		1:53

Same repo without data:
CVS: 			0:12
SVN(svnserve):		0:28
SVN(http):		0:37
GIT(http):		0:13

And what about Git with its built in protocol? Just six seconds. How's that for taking some load off :) Of course you have to add/substract 1s for my timings on the keyboard but you get the overall idea.

This is a very complicated way of doing things. But data should probably be separated from code. And it should probably be distributed in small archives. And people shouldn't have to use an SCM to get it.

So ... Wadda ya say. Is this too complicated/ not worth it / stupid / braindamaged / interesting ?

My brain farts more things like that on a regular basis. If the above makes sense, let me know and I'll give you a couple of other ideas as well :d

Eugen.

P.S: I knew Linus wouldn't lie ;)