Removal of human interference in championships Einstein forum

43 replies. Last post: 2012-01-25

Reply to this topic Return to forum

Removal of human interference in championships
  • Hjallti ★ at 2012-01-06

    I just removed myself from the championship. I think it is better to give my place to upcoming bots and not interfere as a human.

  • OneStone_c at 2012-01-06

    Sorry! :( FWIW, OneStone has not started any new games (or entered any tournaments) for over a month and is only finsihing games that are in progress. He will eventually be completely out of the system. I will be happy to resign exisitng games if that is preferred.

  • Hjallti ★ at 2012-01-06

    Please don't resign against me :-)

  • Richard Pijl at 2012-01-06

    Same is valid for Gambler. It will just finish its games and the tournaments it is in. It did qualify already for the next stage of the monthly October and November tournaments and I will let it play and finish the next stages of those tournaments. It never played the championship and will not start there either.

  • Robert Mathew at 2012-01-06

    I like playing against bots. I have a lot of games with OneStone and Gambler. I vote to keep them up and running and to start new games.

  • Gambler the Bot at 2012-01-06

    Sorry, I will just finish the running tournaments (including next stages if I qualify). When bots are starting to dominate the highest championship level (although I'm not there yet) it is time to retire and move on to games where this is not yet the case. But on the bright side: We still have 2 7-game matches ongoing that are far from finished.

  • Rafael Marques at 2012-01-06

    I also like playing against bots. I see them as a challenge, not a threat. So please, let them play - even in the championship!

    Anyway, it feels strange when two bots immediatly retire when ONE human player complaints about bots.

  • OneStone_c at 2012-01-06

    @Johnny: Point well taken. OneStone was on hiatus anyway but I will consider rentering in monthly cups and the waiting room soon. Meanwhile it does accept all (unrated) challenges and is moving faster – 10 secs per move.

  • Gambler the Bot at 2012-01-06

    Perhaps I should have made Gambler's view a bit clearer. It is not retiring because of a complaint in this thread, but because I already decided to do so (for reasons explained earlier).

  • One_minute_c at 2012-01-06

    In my opinion, it's great to see that something is going on in this game due to the increasing number of bots, that this game is alive, even if the number of active EWN players on this page is most likely only around 100, and the fact that several very good players have left this page or playing this game not anymore or very rarely only.

    Now, “we” are having at least 5 different programs, RoRo and OneStone are both likely better than all human players (in a high number of matches) - very strong opponents from which you can learn a lot, but it should still be possible for a good player to win a single 3-points- or even 7-points-match against them due to the various starting positions and the dice. So, I can't see any problem having the bots with us in championship or the monthly cup. All titles are rather be more valuable.

    In addition, the games are finished faster and more players are available if you want to play. Take a look on the EWN tournament page, the one who is most of the time waiting for a new rated monthly cup match is very often only diamante. You can register, but you still have to wait for 3 others - next day it starts and it's finished in 2 months, is that fun?

  • FatPhil at 2012-01-06

    I was thinking the exact opposite of Hjallti - as soon as I realised there would be three bots in the top division, I was going to suggest that the bots duke it out to see who should remain, and who should take a break, to let the humans have their fun.

    However, apparently one of the bots has already unilaterally taken such a step. I was in a bit of a panic when I saw the championship had started, as I've not been able to diddle with RoRoRo's settings for well over a week. (Yes, yes, the last time I was vaguely sober…)

  • Hjallti ★ at 2012-01-07

    First of all, I just took the decision to remove myself from the championship, and I could even claim I did not complain about the bots. I don't ask the bots to retire, and the messages of the first two bots were clearly not retirement due to my removal.

    I don't want to make an issue out of it, but I do think it is good to speak about things that partly annoy you.

    (the bots are not the reason I retire… it is because I seem to lack a playing level I would want to have and being passed by the bots, who can calculate better is only part of it)

  • Ingo Althofer at 2012-01-09

    Kitaktus told me a few months ago (when we decided to particpate

    in the Olympiad with Hanfried_c) that this would be a good point

    for Hanfried_c to stop playing on LG. So, currently only very few

    bots will remain active on LG and in the Championships which can

    only be good for the community.

    Ingo.

  • FatPhil at 2012-01-10

    I definitely we should have bot-vs-bot competitions regularly though.

  • YHW at 2012-01-10

    Either all the bots should play with us in championship - what I personally would prefer - or no bots should play. But such a tournament would always provoke the appearance that humans don't have any chance against programs anymore…but that's not the case due to the playing mode (a single 3 points-match).

    In my opinion we would not have this discussion if the best listed players would still play with us the championship or other new strong players would join this page. Instead, the number of players in championship is shrinking all the time - from a maximum of 228 Players in 2007 to 167 participants (currently registered). The strong players are becoming extinct, that's frustating.

  • Dvd Avins at 2012-01-15

    I would like to see a special bot vs. human tournament. I would also like to see a bot-only tournament, if the bot owners don't feel that would undercut their ability to get publicity for similar events elsewhere. But contrary to some of the reassuring voices earlier in the thread, I am glad that the bots seem to be taking their collective championship to heart and retiring from general championship (including Cup) matches here.

  • YHW at 2012-01-16

    In my opinion, the meaning of a single tournament/championship (to determine the playing strength of a specific player) is overestimated - even if the playing mode would be 'best of 13' with 1st and return match.

    Let's give you an example. I'm currently testing a new version of 'naive_child_c' which is unfortunately playing worse than the version before, nonetheless, the playing strength of this version 'seems' approx. equal to OneStone. Thirty 7-point-matches are finished, the bots were starting alternately.

    The results of all single matches were 164 to 164, but the child won 57% of all 7-point matches. That's not correlating and normally I should let my bot play against OneStone as long as the results of the 7-point-matches are in the same dimension than the results of the single games, but how many games should ensure this?

    Within that series of thirty 7 point matches were ten consecutive games in which the child was winning 65 to 52 (single matches) - that corresponds to a ration of 56% - in another ten consecutive 7-point-matches OneStone won with 58 to 40 (59%). So, it could be that even in a tournament with a playing mode of 2 x 5 7-point-matches once the child is better than OneStone or vice versa in another tournament.

    This game is so various that you need such a high number of games to ensure that one player is better than another one that it makes more sense to analyse all played games here on LG than to take the results of a championship seriously….

  • MarleysGhost at 2012-01-16

    Are naive_child_c and OneStone deterministic (given the die rolls)?

  • YHW at 2012-01-16

    'naive_child_c' is a MC bot with internal heuristic/rules that are used to avoid some moves (also in MC part) that don't make sense. Those rules lead sometimes to repruducable moves for some specific stone positions (especially in the endgame) but most of the time the MC part is dominating. So, it's mainly non-determinstic. The MC part/tree is analyzed by minmax algorithm - but only 3 moves deep.

    But the main problem here is that the random starting positions are not balanced. The winner is very often already detemrmined when the game starts, the other one can only win by luck in that case. You can see that in the results - if 2 equal players are often playing against another the results are mostly not 6-7 or 7-6 rather 7 -3 or 7 - 4 or even 7 - 1. The variation is pretty high, because of your 'luck' - not only the dice is important you need also luck to get a good starting position.

  • FatPhil at 2012-01-16

    YWH - have you discoverd any red/blue bias?

  • Dvd Avins at 2012-01-16

    The number of mathematically distinct starting positions (including who is going first) should be slightly less than 64,800. (That's 180 x 360.) Actually, thinking about it, it may be 64,800; I'm not sure there are any occasional symmetries that would reduce it beyond what the numeric reflexive symmetry for both sides and the physical reflexive symmetry for one side that reduce the possible positions from the 720 x 720.

    I wonder if anyone has devoted the programming time and hardware resources to systematically analyze which positions are advantageous.

  • Dvd Avins at 2012-01-16

    There is also the question of who goes first in games other than the first of a match. At Little Golem, it is automatically the loser of the previous game. At Mastermoves, it is random.

  • YHW at 2012-01-16

    You mean a difference between the player who starts and that one with the 2nd move? I guess that the number of games are not enough to determine such an effect in these games, but in the past I analyzed all finished games of the best 80 players here.

    These 'TOP 80' players started in 25110 games from which they won 14867, that's a ratio of 59.2%. On the other handside they played 25118 games in which they had the 2nd move. The ratio here is 57.9%. So, there is a small effect.

    But note, those ratios are calculated by all finished games, 11783 are played by diamante only.

    In comparison to that, the averaged winning quotes of all 'TOP 80' players is 59,1% if they're starting the game and 56,8% if they got the 2nd move. That's an averaged effect of approx. 2.3%, but this effect is different for single players. RoRoRo (e.g.) wins 74% of the games if the program is starting and 73% in the other case while it is vice versa for hanfried_c (73% to 75%). It seems so that this effect is depending on the player's style but the most players are winning more often if they're starting their games.

  • YHW at 2012-01-16

    The meaning of the positions of 1 and 6, 2 and 5 as well as 3 and 4 is identical when the game starts. So you have a high potential to reduce the number of relevant starting positions significantly.

  • Ingo Althofer at 2012-01-16

    > I wonder if anyone has devoted the programming time and hardware

    > resources to systematically analyze which positions are advantageous.

    Yes, Andreas Schaefer in a semester thesis back in 2005/2006:

    http://www.minet.uni-jena.de/preprints/althoefer_06/rockNroll.pdf

    Ingo.

  • lorentz at 2012-01-17

    @FatPhil, re 1st move bias. I ran two 1000 game selftests with OneStone. At 10 secs/move the first player won 521 games. At 300 secs/move the first player won 536 games.

    Side comment: I suspect, and am trying to gather more evidence to justify, that OneStone at 300 secs/move is playing as well as possible, at least to within a miniscule margin of error. In fact, it seems that OneStone at 10 secs/move plays at very nearly the same level, if not exactly the same level, as with 300 secs/move. My tentative conclusion is that most of the bots are playing at this same “perfect” or imperceptibly short of perfect level. (I'm resistant to use the word “perfect” with a game with such a large random element, but that's probably just me.)

  • MarleysGhost at 2012-01-17

    > most of the bots are playing at this same “perfect” or imperceptibly short of perfect level.

    That would imply that bots from a certain set (“most”) are not significantly different in playing strength. Select two bots X and Y from the set and play a randomly selected initial state and sequence of die rolls with X=Red,Y=Blue and the same state and sequence again with X=Blue,Y=Red. Under the null hypothesis that the bots in the set are equivalent, the result is predicted to be 1-1. If repeating the procedure produces enough 2-0 results, you disprove the null hypothesis. I'm not sure how many “enough” would be.

  • YHW at 2012-01-17

    I think that all bots are far away from a perfect game (from start to the end).

    The current version of the “child” is the strongest version against OneStone ever (177 - 174; single game results). According to the closer-to-perfect-thesis the child should also play stronger against all other players, but that's not the case. It's playing worse against the most others.

    I think, you need the 'perfect strategy' for the playing style of your opponent and different styles are still observable. So, no bot got the 'perfect startegy', but RoRoRo is definitively the strongest with respect to playing strength per calculation time - it's unbelievable strong, but it's also not playing perfect.

    The current child version is my internal No. 8, the version 6 was playing stronger or equal to RoRoRo (if there was no RoRoRo upgrade in the last month), that version played 106 - 103 (single matches) against RoRoRo, but this version was playing worse against OneStone 100 - 112 (single matches).So, I could set the child so that it is playing as strong as RoRoRo OR OneStone but not AND - without changing the playing style dependent of the opponent.

  • darse at 2012-01-19

    The variance in EwN is much higher than most imagine. I suspect that playing 1000-game matches between the top bots would still not be conclusive. We would need to use some powerful variance reduction techniques (like the ones discussed in my PhD thesis) to achieve statistical significance much faster. Duplicate games would certainly help, but once the positions diverge enough, there is random luck creeping into the outcome again.

    Two anecdotes to give you a feel for the problem of stochasticity:

    1. For the analysis of DIVAT (my variance reduction assessment method for poker), we ran a 100000-game match between two programs (Always_Raise vs Always_Call) that should give results similar to a fair coin flip. Yet one side won steadily for the whole match. We ran it again using a different random number generator, and the same player won again by a similar margin (~1 standard deviation). We were confused… but it was all noise. When we extended it to 400000 games, the other player started winning by a large margin, eventually taking the overall lead by ~1 standard deviation itself. DIVAT wasn't fooled – it said the players were dead even the whole time. Lesson: luck plays a *huge* roll in the outcome, and randomness can deceive you.

    2. When i ran the International RoShamBo Programming Competitions, i ran more and more round robin tournaments in order to achieve statistically significant results. The better the Rock-Paper-Scissors bots got, the smaller the margins of difference became (of course). For one of the Best Of The Best series, i ran 250000 tournaments (of 1000 throws each) between the last two bots, and still had to declare the match a statistical draw.

    That is the direction we're headed in trying to determine the best EwN player. I wouldn't be surprised if Sybil is already playing within 1% of perfect play (in terms of average EV error). In fact, the author of Sybil could do an easy experiment to measure this. He could take a bunch of 7-piece positions, make a move choice without the aid of any databases, and then compare the EV of that move to the EV of the perfect move according to the database. That wouldn't say much about opening moves and positions with lots of pieces, but it would be interesting to know just how well Sybil is playing the middlegame without the database.

    Of course, making small mistakes isn't the same as playing perfectly – not at all. I'm sure the bots still make lots and lots of imperfect moves. Just look at the differing styles of Sybil (e.g. early c1-b2 diagonal moves), RoRoRo the Bot (c1-c2 moves in the same type of position), and OneStone (c1-c2, but different from RoRoRo in other kinds of positions). They can't all be playing perfectly, but in terms of EV, the differences may not count for very much. The same story has been seen in backgammon (another perfect information game with stochasticity), where the top players (both bot and human) are nearly indistinguishable in terms of EV.

  • lorentz at 2012-01-19

    Thanks for the great post, Darse. You managed to state much more formally what I was stumbling around informally. Specifically: (1) I think the reason I was (informally) reluctant to say “play perfectly” was because I was thinking that (formally) the moves the bots are making have extremely close expected values with, as you state, high variance. (2) I (informally) conjecture that the expected values are so close (and the variance is so high) that the level of play is indistinguishable for practical purposes here on LG. Further evidence of this is seen by looking at the ratings graphs of the bots (and players, for that matter) and observing the wild swings.

    P.S. “stochasticity” ? :)

  • lorentz at 2012-01-19

    P.P.S. OK. I guess I should have looked it up first. It seems “stochasticity” is a legitimate word. But it sure sounds funny!

  • Carroll at 2012-01-19

    Nice to have Darse on LG, welcome!

    If we speak of random walk competition then it is a well known fact that increasing the number of games N will only make the relative distance greater as it increases (in an unknown direction) in Sqrt(2N/Pi).

    But doing an analogy with poker (even if Ewn is a perfect information game), I think the bots “perfect moves” are like a Nash equilibrium, like for Roshambo the 1/3,1/3,1/3 strategy. Couldn't the bots be better by playing a game that would take advantage of reading the players trends of mistakes and exploiting this? hence becoming also exploitable by better ones…

    I believed it was also what had been achieved in Roshambo, where the bots would exploit non-randomness of human players or exploitative bots and would get back to the NE only when they felt they were starting to be exploited?

  • MarleysGhost at 2012-01-21

    A DIVAT-like tool would seem to be useful for evaluating Ewn programs. I read the DIVAT thesis, and it's fairly murky to me how to apply it to Ewn. I take it that the salient idea is to compare the test programs' results with the results of a standard algorithm whose success reflects the randomness (and in poker, the lack of perfect information) of the game.

    I guess the correspondences are poker hand=Ewn game; poker round (of betting)=Ewn move (maybe 2 ply); and poker monetary result=some kind of Ewn positional evaluation function.

  • darse at 2012-01-21

    To clarify: Ewn is not like poker and RoShamBo. Ewn is a perfect information game with an element of luck, like backgammon, and thus does not require a mixed strategy to hide information. In every position there is a correct move (or possibly more than one move with exactly the same expected value). Worse moves are worse, and would never be played by a perfect oracle. (Yes, one can aim to exploit an opponent's weaknesses by making sub-optimal moves, but that is also true in chess and most any other game).

    MG: There are several variance reduction techniques in the literature, some of which might be more suitable for this domain. I mentioned DIVAT only as an analogy, not thinking of any direct application for that particular type of tool. But yes, the idea would be to have some kind of standard baseline for comparison, and then measure the differences from the baseline in a statistically unbiased way. (There is a Wikipedia page on unbiased estimators, for those interested in what might be involved).

    BTW, the LG ratings for Ewn are kinda broken. As near as i can tell, the K factor is *way* too large (because of the randomness element), and matches of m games aren't handled appropriately. I wrote a draft on that topic, and will post it after i've polished it up a bit.

  • YHW at 2012-01-24

    I would also prefer a lower k factor. Look at all the rating curves of the players, there are too many peaks, it should be close to three or at least lower than 5.

  • Ingo Althofer at 2012-01-24

    > I would also prefer a lower k factor. Look at all the rating curves of the

    > players, there are too many peaks, it should be close to three or at

    > least lower than 5.

    Agreed. New players might start with a large k.

    But seniors should have a smaller one.

    What about a joke to set k_senior = 3.1416 ?

    Just my 2.718 cent.

    Ingo.

  • Ingo Althofer at 2012-01-24

    Interesting discussion.

    Richard Lorentz wrote:

    > Side comment: I suspect, and am trying to gather more evidence to

    > justify, that OneStone at 300 secs/move is playing as well as possible,

    > at least to within a miniscule margin of error.

    That seems reasonable.

    > In fact, it seems that OneStone at 10 secs/move plays at very nearly

    > the same level, if not exactly the same level, as with 300 secs/move.

    You need 10,000 self-play games to identify a difference of 1 % in performance

    with statistical significance.

    > My tentative conclusion is that most of the bots are playing at this

    > same “perfect” or imperceptibly short of perfect level.

    No. Sybil_c is stronger than OneStone_c.

    \\\*********************************************************

    Darse (welcome to the community!) wrote:

    > The variance in EwN is much higher than most imagine. I suspect that

    > playing 1000-game matches between the top bots would still not be conclusive.

    Agreed.

    We had already some discussion on this during the Computer Olympiad in Tilburg.

    My feeling is that Ewn is a candidate for dropping out from the Olympiad again

    because of this.

    Things might turn different when tournaments would be held with very fast

    thinking times like in the RoShamBo computer competitions some years ago.

    In those events the bots had to complete 1,000 games within a second.

    Ingo.

  • YHW at 2012-01-24

    You need 10,000 self-play games to identify a difference of 1 % in performance

    with statistical significance.

    Is that linear? 5,000 => 2%, 2,500 => ~3%…or what do you mean exactly with performance applied to the winning ratios against each other?

  • darse at 2012-01-24

    YHW: It's a square root function. If one game gives you a particular error margin, then you need four games to cut that margin in half, 100 games to cut it to one-tenth, and 10000 games to cut it to 0.01 (1%) of the 1-game outcome. 1 / sqrt(m) in general, where m is the length of the match.

    Since you quickly hit diminishing returns for doing more trials, games with randomness are a problem in terms of accurate assessment – unless one has a good variance reduction technique, or a way of evaluating the quality of decisions directly.

    Ingo: Have they dropped backgammon from the Computer Olympiad? That has been effectively a coin-toss for years, and the results are little more than trivia.

  • Dvd Avins at 2012-01-24

    Duplicate EWN (like Duplicate Bridge) would be interesting, both for bots and for humans. a die roll would not be revealed until both players had used the previous die.

    I would enjoy seeing that as a variant here.

  • darse at 2012-01-25

    I like the idea too, but i don't think it works the same as our duplicate poker matches did, and definitely not like duplicate bridge.

    My idea in the poker matches was that the entire match gets replayed with the cards reversed, with no memory of the previous match. Easy enough for computer programs. For the Man vs Machine matches, the humans were separated and played the whole match without knowledge of the other's results at any point. That's important, because if you knew your side was trailing, you could start to gamble more in order to catch up (or conversely, play safer if your side is ahead). It distorts the strategy compared to natural play.

    Technically, the participants also shouldn't know that they're playing a duplicate match! This is a subtle flaw with the live human poker matches. For example, if i know i've been lucky (say, winning some hands that i really had no business playing), then i can speculate that my side is ahead in the match, adjust accordingly, and be right more often than not. Second-guessing how the opponent might have played your hand is certainly a skill, but it's not the skill that we set out to measure, and undermines the reason for choosing a duplicate format.

    With Ewn, when the two positions diverge, they are still inexorably linked to each other. You could have slightly different positions where one side dominates (in the Pareto sense – can't lose both but might win both with a particular roll or sequence of rolls). Knowing that you've fallen into a worse pair of positions (say, after a key roll favoured the other guy), i can imagine that you might be forced into some weird unnatural lines just to unbalance the situation and gain some hope. Duplicate Ewn might be a fun variant, but it would be a different game.

    With duplicate bridge, you're not even competing against the people you're playing with. As a North-South pair, you're really competing against all the other North-South pairs in the room playing the same cards you have. I suppose you could set up an Ewn tournament that way, but it would be awkward to keep the dice information hidden for the whole round.

    In practice, duplicate might only make sense for static programs playing both sides of the dice stream. That would reduce variance, but certainly wouldn't eliminate it.

  • YHW at 2012-01-25

    @darse: Thanks for the response, bot is that really valid for a game with random starting positions?

    Once, I started ten 7-point-matches to test different versions of a bot and I was counting every single game. After a while I received the first results with 49 to 31 (61,2%). According to your formula I would be able to detect a difference in performance of 11.3 for that number of 80 single matches. So, I could normally stop here. But a difference of 18 single games in 80 matches can sometimes occur even for equal players. After calculating the variance of all 10 games I got a standard deviation close to 5. So, far away from stable results. How can you be sure to detect a difference in performance if the number of different starting positions are higher than the number of matches you need? And can there still be a significant statistical error if the differnce in performance is lower than 3 x SD of the past games?

  • darse at 2012-01-25

    You could say that each starting position is a different game, and Ewn is a family of similar games.

    A single game of Ewn has *a lot* of inherent variance, compounded by unbalanced random starting positions (e.g. you might start the game with a 44% chance of winning, or 56% – there is some analysis that suggests it could actually be that much of a swing, or more).

    Regardless, whatever that 1-game variance value is, it will drop off at a rate of 1/sqrt(m) – that's just a property of statistical sampling, not of the game itself.

Return to forum

Reply to this topic