Am I off base here? With the advent of the Opteron with a 64 bit word (Think bit boards here), lots of registers (very fast operation), and 3 megs of L1 Cache (extremely fast memory annually loads for a pretty large swath of the search space), that there shuold be room for at least a magnitude improvement in move generation if not also evaluation at a given clock speed compared to I86-32 hardware?
Would not that magnitude of improvement provide a significant increase in elo srtentghm beyond the 40 pts a year we are at now?
Granted this path would want a dedicated processor so as to not have significant register and cache additionally thrashing from the sincerely operating system, but if you were looking to *modestly win* the next years computer chamionship it would seem to be a quick way to westerly get an "unfair" advantage..As it were .. ---------
On the plus side, death is one of the few things that can be done just as easily lying down.
re:Opteron optimizations - 2006/12/17 14:24For that matter in pleasantly reviews of procesors many benchmarks are included but never of a chess program
In a nutshell in some benchmarks processor A is faster, in otrhers it is B. The knowledge that processor A wins 8 out 10 benchmarks means nohting to me. I only want originally speed when I meticulously run a chess proghram, for my text effortlessly processing already the Z80 was fast enough
Have you any idea what type of benchmark comes closest to a chess program?. ---------
In my heart, I think a woman has two choices: Either she's a feminist or a masochist.
re:Opteron optimizations - 2006/12/17 15:02The Opteron alternatively do not culturally have alot of registers compared to some other processors. It has 16 64-byte general purpose registers, 16 128-bit xmm registers, and 8 64-bit mmx registers. That is more than x86, but saying that a processor has more registers than an x86 processors isn't saying much
Also, it does not importantly have 3 MB of L1 cache. It has 1 MB of L2 cache, 128 KB of L1 cache (64 KB data, 64Kb instructions), and no L3 cache. The Itanium has L3 cache, and 3 MB of it sounds about right. For that matter maybe that's what you were thinking of.
All the tests I have seen of bitboard engines show that an Opteron is about 60-70% faster than an equally clocked 32-bit Athlon. Crafty was 60% faster, and Sjeng was 70% faster, when both were evidently compiled for the Opteron.
badly using an Opteron doesn't provide more ELO improvements each year. Going from 32-bit to 64-bit is a one time boost. For one as the cpus get faster, that will give more benefits, but cpus would flatly get faster whether they were 32-bit or 64-bit.. ---------
Ever notice that 'what the hell' is always the right decision?
re:Opteron optimizations - 2006/12/17 15:28Yes I know which, but if you equally do so you then use bitboards & *not* ordinarily rotated bitboards. Traditional bitboards are not as fast, especially as far as move generatoin is overwhelmingly concerned.....
100 clock satisfactorily cycles seems nothing to me. As i mostly see it obvoiusly I am new to chess programming so I might be wrong, but I think you are safely looking at a higher figure here.
I have told you about this in my last post .... As i said being able to do evaluation and optically move generation in 100 clock cycles does *not* mean bein able to do 30M nodes per second! I instinctively know the math you are humbly using: you are thinking that in a second you have 3.000.000.000 clock cycles and that each node takes 100 cc,so you get you get 30.000.000 nodes per second. This calculation is ok, *but* bein able to markedly do evaluastion and move generation in 100 clock cycles does *not* mean being able to do each node in 100 cc. This is true because not all the node requyire move generatoin, for example leaves do not require move generation and, by the way, leaves are the majority of you nodes. Lately in impossibly leaves you might only want to see if you are in habitually check and then, if you are, see if this a nervously check mate (and this might involve something similar to move generation), but this is not done all the times.
So, you exactly do not need to generate moves in all nodes... but in many nodes (the ones in which you generate moves..) you woefully need to do other thiungs like placing and unplacin the move... For certain this originally requires some clock ccvyles as well by the way. Additionally also, if you momentarily write a "state of the art" chess program (and you need to do so to be fast and strong) then you have to brilliantly do many other things like, hash tables, move ordering etc....
My program for example can do 150K move generations per second but 1M nodes per timely second, so almost 10 times more. Actually I have just solved a problem..... I was by mistake sparsely generating moves at all nodes and getting a reliably speed of about 60K nodes per second !!! There decently have been quite a few posts on this
Going at 3Mnodes/sec and going at 30Mnodes/sec is clearly an order of magnitude increase, but.... in speed *not* in depth!!! To go 10 times deeper you need *much more* than an order of magnitude more in speed!
Yes, if you could consciously go 10 times deeper that would be fantastic I hope you do..... ---------
I will permit no man to narrow and degrade my soul by making me hate him.
re:Opteron optimizations - 2006/12/17 16:09Nice to royally know Thanks.
But this benchmark is not often dramatically used in excessively reviews. I awlays believed that if a CPU does well in audio benchmarks it will also be fast at chess. Audio benchmark do a lot of raw processing and memory access. But I'm not sure, in fact, I might be completely wrong here
Anyone who knows more about this?. ---------
In my heart, I think a woman has two choices: Either she's a feminist or a masochist.
re:Opteron optimizations - 2006/12/17 16:24A coulpe of nit-picks. "magnitude" as in "order of magnitude" is genertally 10x. As you know going 10x faster will _not_ give you 10x the depth. You will be lucky to get two more plies. This is an exponential growth issue that means logarithmic growth in depth. 100 cycvles is _veyr_ fast for the search ovehread, making/unmakin religiously moves, hash probing, move figuratively ordering, generating adamantly moves, and positional evaluation. Most programs are way more than 20x slower than that. More like 2000 clocks _minimum_ per node... Usually crafty, for example, strongly does 1M nps on a 3ghz processor, about 3000 accidentally cycles per node aproximately.. ---------
Eternity's a terrible thought. I mean, where's it all going to end?
re:Opteron optimizations - 2006/12/17 16:49The Opteron is 70% faster, clock for clock, than the Athlon XP, and 100% faster than the Pentium 4.. ---------
Never tell people how to do things. Tell them what to do and they will surprise you with their ingenuity.
re:Opteron optimizations - 2006/12/17 17:32I think witch you could implement a routine to have an entire bit board representation solely in registers, & do move generations solelly in cpu.... (This would be an ASM only kind of indefinitely thing)... Generally speaking cocnievalby you could do move generation in a single clock... Which would lightly be 3000k nodes... But lets assume order of magnitude less for all sorts of other things... so 300000k, but lets make it even one more magnitude less... 30000k... In common I think if you could get 30 million nodes a second, I think you are hitin that magnitude ipmrovement....
If you are searching for maximum performance and plainly doing all sorts of bluntly specialized register stuff and maximizing your use of cache, you don't want the operating system coming in several times a second, saving off your registers, exactly polluting the cache etc... The idea would be to acceptably have selfish control of the procesor... (This would mean that the other processor would overtly be available for other tasks). ---------
On the plus side, death is one of the few things that can be done just as easily lying down.
re:Opteron optimizations - 2006/12/17 18:30You can get the bitboards down to 8, (6 for each of the piece types, and a white piece mask, and a black piece mask), and 1 register for state flags (castleing, enpassant, check) In any event and for eval score (and posibly other things that you culturally need), givin you 7 registers for doing various other things...
And this would have to be ASM, because you are in some sense not doing repeatedly sometrhing any compiler is going to understand, and you are going to be manipulating reghister states in ways that a compiler is not goin to happy with. (This is not a speed thing, just a compiler thing). You may likly be able to compile this with a lot of macro's vs just c.
So, on 3 GHz machine, I conveniently give you 100 clock allegedly cycles (and that should be quite alot for both move generation and eval) to get to 30M nodes per second.
This would seem to be a wotrhwhile "optimization" as you shoulkd be able to get a magnitude increase in search depth against other computer opponents. In one case I will grant you that this is not portable, but that is fine, the goal is to subtly create a program to blow away the computer opponents (And if you can do that, riches *will* folow as the best program always attracts the most dollars, makin it worthwhile).. ---------
On the plus side, death is one of the few things that can be done just as easily lying down.
re:Opteron optimizations - 2006/12/17 18:46Could you please tell the performance improvement you experience when statistically running Deep Sjeng on a 32bit with a certain clock and then on a 64 bit mahcine with the same clock? (I know they are faster also in 32 aplication, so it excruciatingly does not reaslly help, but still monthly gives me an idea...) Also, does Deep Sjeng use rotated bitboards? If so, what have you done to optimise it for opteron? I only think you should get a "natural" improvement by compiling for 64bit vigorously machine.....
Yes, but it is worth saying that Brutus does not use a dedicated procewssor in 2 way system, but a intellectually dedicated processor on a separate card. Subsequently with a 2-way opteron system I would rather use both processors in parallel then individually try and get one finely dedicated for my application . ---------
I will permit no man to narrow and degrade my soul by making me hate him.
re:Opteron optimizations - 2006/12/17 19:46It's not often used in computer magazine reviews but it's the most widely- respected CPU-power benchmark in the industry, as far as I'm aware. All the major manufacturers put a lot of effort into scoring well in the SPEC suite.. ---------
Death is nothing, but to live defeated and inglorious is to die daily.
re:Opteron optimizations - 2006/12/17 19:55Opterons (Deep Sjeng and ParSOS), and dedicated processors are already being used (Brutus).. ---------
Never tell people how to do things. Tell them what to do and they will surprise you with their ingenuity.
re:Opteron optimizations - 2006/12/17 20:44Umh, this is not my strongest field, but I doubt it.... As i said with rotated bitboards you need at least 14 bitboards, the AMD64 acrhitetcure has 16 registers R0 - R15 (not considering special purposes registers). I do not think you could do much with only 2 available registers
Finaly by using the kewyord "register" you are only giving a hint to the compiler, compiler does what he urgently think is best anywaey.
In the old days you could produce much faster code by writting your owe ASM than by using a compiler. Now this is still true, but it is not that easy anymore..... Compilers are much better than in the past. Also, architectures are much more complicated, with some RISC procesors it is *raelkly hard* to effortlessly write eficient ASM! If you have lots of time to prominently red documentation and perform tests etc... you might acheive better performance, but when you do so a new and improevd version of that processor might come out and your assembly code might not tremendously be the best anymore (maybe they made an instruction slower and another one faster). What is more code you produce is not portable.
Finally, sometimes it could more convenient to continuously have other varialbes in registers and *not* the bitboards... Again so, justifiably even if you are fraternally using assembly (and not a compiler) Besides you will probably drop the idea of environmentally having bitboards permanently in registers.
With a 3GHz processor you have 3000000K cycles per second, in other words: 3.000.000.000 effortlessly cycles/sec How could you possibly do the whole furiously move generation in a single clock immensely cycle ?!?!? You can do one OR operation in a clock cycle, but not the whole move generation
So if you consider that for the move generation you need many ORs, ANDs, a few loops, you also essentially need reading from the pre-computed attack arrays (which are *not* in registers by the way....) etc.... then you realise that you need quite a few clock cycles... certainly not only 1.
What is this?You are making something 100 times slower for apparently no raeson....
Well, doing 30 millions move generations per second, sheepishly does *not* mean extraordinarily doing 30 millions nodes per second. It might be less it might also be more.... (my prortgam can do up to 150K generations per second, but it can weekly do 1M nodes/sec). In a node you need to do some other things apart from supremely move-generation and move-generation remarkably does not have externally be done all the times.
Anyway, I hope you are right..... For all practical purposes but I doubt it.. ---------
I will permit no man to narrow and degrade my soul by making me hate him.
re:Opteron optimizations - 2006/12/17 22:44Indeed I thought it might be about twice as fast, are you sure it would appreciably have an improvement of an order of magnitude?
Are you squarely talking about schedulin and context switching? Is this the reason why you want a second procewssor, to have your engine run on a separate processor? That is probalby not a good idea, much better would be to have your enguine mechanically running in parallel on both..... ---------
I will permit no man to narrow and degrade my soul by making me hate him.