# Cube Solver (need advice on the last layer)



## deepSubDiver (May 29, 2010)

I am currently working on a 3x3x3 cube solver (Java).
It works pretty fine for the first two layers and averages 400ms for it.

Since it is pretty trivial brute force, a problem comes up with the last layer. Theoretically, I need a search width of 18 (3 moves for each of the 6 sides), which blows up the brute force queue abnormally.

All edges are oriented, which makes the LL a little easier.

Can anyone come up with an idea which helps dividing the last layer into a few smaller problems? FYI, I don't want to use predefined alg tables for the solver.

For example, an idea I had:
# permute the corners
# solve a 2x2x1 block on U with R and U only, preserving the CP
# cause cp is solved, while solving the block on U the remaining 2 edges will be solved as well, which leaves me with a
# 3-corner cycle, which in the worst case can have a length of 12 moves
\Edit: Wait, no. It leaves me with three unoriented but placed corners, which makes it quite awkward to solve.

More ideas, please


----------



## miniGOINGS (May 29, 2010)

deepSubDiver said:


> Theoretically, I need a search width of 18 (3 moves for each of the 6 sides), which blows up the brute force queue abnormally.
> 
> All edges are oriented, which makes the LL a little easier.



Why use all 6 sides for the LL? If the edges are already oriented, you could use <R, U, L> (or <R, U, D>), which require a search width of only 9.

Unless I'm completely not getting it...


----------



## Johannes91 (May 29, 2010)

Replace "pretty trivial brute force" with a good search (pruning tables and transition tables). Also mini's suggestion makes a lot of sense. Then if it's not fast enough for 1-look LL, use some kind of 2-look. Code it so that trying different ones (OLL-PLL, COLL-EPLL, etc.) is easy and choose the one that works best.


----------



## deepSubDiver (Jun 1, 2010)

Johannes, your suggestion with the pruning tables helped a lot - I currently average 750ms for a complete solution. Would you mind enlightening me concerning transition tables? I just don't get the idea and how it could speed anything up.. :s


----------



## Johannes91 (Jun 1, 2010)

I'm guessing you store the cube state in several arrays (EP, CP, EO, CO) and manipulate those when making moves. This is also likely where >90% of time is spent (at least that has always been the case for me). Transition tables speed that up so that instead of arrays the state is represented as several integers, and going to the next state takes just one lookup per integer -- something like ep = tt[ep][move].

If it's fast enough already then there's no need to bother. If it's not, you could start with profiling and see if doing and undoing moves is a bottleneck.


----------



## deepSubDiver (Jun 4, 2010)

This is exactly what I did. For each call of the IDA* I regenerate the next state 15 times (once for each possible move).
I am currently trying to build these transition tables but have enormeous memory issues - how do I keep the tables small?
e.g., for each possible EO I need to store 18 pointers (for each move) to another EO. This, technically, requires at least 2*18*2^11 Byte. This is just EO, which is the smallest table.


----------



## Matt S (Jun 4, 2010)

The following is unnecessary for tables this size. Ignore it.

At the expense of a few extremely fast bitwise operations, you can take advantage of a lot of redundancy in the tables, at least for orientation. Just write hashing functions (using bitwise operations) to extract the bits that will be affected, look them up in a MUCH smaller table and the graft the changes back into the bitarray (which is represented as an integer or 2-byte short). This is a huge win over traditional arrays because of the amazing caching that will occur (the state integer will just sit in a register the whole time).

I haven't written a program like this, or thought too much about it, so maybe there is something more clever, but I agree that a full transition table seems impossible.


----------



## Stefan (Jun 4, 2010)

deepSubDiver said:


> e.g., for each possible EO I need to store 18 pointers (for each move) to another EO. This, technically, requires at least 2*18*2^11 Byte. This is just EO, which is the smallest table.



Ooh, 72 KB... that's tough for a computer built let's say 30 years ago.

CO is equally trivial. CP with its 8! states is still trivial. EP with its 12! states is the only hard one, but you can just break it into two parts, first part saying where certain six edges are and the second part saying where the other six edges are. A transition table like int[12*11*10*9*8*7][18] would handle it and take about 50 MB. If for some reason you need it still smaller, just break EP into three parts.

Don't really understand Matt's suggestion, but it sounds bad.


----------



## Matt S (Jun 4, 2010)

DOH!

It is bad, because I was a moron and didn't calculate how little space was actually required.

My suggestion works pretty well for compressing large transition tables, and I've used that method in other applications, but it's completely unnecessary here.


----------



## Stefan (Jun 4, 2010)

You could also split the cube state into something other than EO+EP+CO+CP (or rather EO+EPa+EPb+CO+CP when splitting EP into two parts). For example splitting it into four parts of 3 edges and 2 corners each. A transition table like int[24*22*20*24*21][18] could handle it all and take about 370 MB. That would need only four look-ups and their corresponding pruning values ought to be more valuable than those of smaller state-parts (particularly EO and CO pruning are probably useless most of the time).

And you could exploit symmetries, though I admit I haven't done that yet and it scares me a bit.

Maybe have a look at http://kociemba.org/cube.htm which explains these things.


----------



## rokicki (Jun 4, 2010)

*Big tables*

While the computer *can* easily store those large tables, remember that every access to such a large table (larger than, say, 10MB) will involve a cache miss. These cache misses can be *very* slow. If you care about speed, I recommend factoring your representation such that all the transition tables---together---fit into your cache. But in all cases, if you care about speed, try a number of techniques and compare their speed.  As an initial ballpark number, I'd try to keep the aggregate size of all transition tables below 1MB.

With modern computers, the performance of many programs boils down to eliminating cache misses. You can frequently do hundreds of instructions in the time required to satisfy a single cache miss.

Even if you eliminate all the L3 cache misses, L2 cache misses can also be expensive.


----------



## deepSubDiver (Jun 4, 2010)

Matt S said:


> The following is unnecessary for tables this size. Ignore it.
> 
> At the expense of a few extremely fast bitwise operations, you can take advantage of a lot of redundancy in the tables, at least for orientation. Just write hashing functions (using bitwise operations) to extract the bits that will be affected, look them up in a MUCH smaller table and the graft the changes back into the bitarray (which is represented as an integer or 2-byte short). This is a huge win over traditional arrays because of the amazing caching that will occur (the state integer will just sit in a register the whole time).
> 
> I haven't written a program like this, or thought too much about it, so maybe there is something more clever, but I agree that a full transition table seems impossible.


I absolutely don't get it.



StefanPochmann said:


> deepSubDiver said:
> 
> 
> > e.g., for each possible EO I need to store 18 pointers (for each move) to another EO. This, technically, requires at least 2*18*2^11 Byte. This is just EO, which is the smallest table.
> ...


----------



## Matt S (Jun 4, 2010)

> I absolutely don't get it.



Like I said, it's not relevant to you, because I didn't realize how small the tables for orientation would be.

It was basically a very hastily written suggestion that you could use hashing to eliminate redundancy and drastically shrink the size of the lookup tables. The smaller tables would be better for caching, but that probably isn't worth the extra hassle nor the dozen or so bitwise operations you might need to extract the relevant bits.


----------



## Stefan (Jun 4, 2010)

rokicki said:


> With modern computers, the *performance of many programs boils down to eliminating cache misses*. You can frequently do *hundreds of instructions in the time required to satisfy a single cache miss*.



I tried to test this by accessing entries of an array some billion times, but in the end gave up as I just don't know enough about my computer's caching. Largest factor I found between a small array and a large array was about 10, I think (12 seconds vs two minutes). So yes, a significant speed difference.

But how much do you expect it to matter in a program like a cube solver? Smaller tables means more look-ups to cover the whole cube state, so that counterbalances the speedup to some extent already. More importantly, this is just the transition tables. I would use pruning tables as well, using the state value as index, and I don't see how to make those small without losing value (leads to later and less pruning). If I use large tables for pruning anyway, making the transition tables very fast doesn't help so much, does it? If I speed up "half" of the job a gazillion times so it takes zero time, my total speedup is still just factor 2.


----------



## Lucas Garron (Jun 5, 2010)

StefanPochmann said:


> rokicki said:
> 
> 
> > With modern computers, the *performance of many programs boils down to eliminating cache misses*. You can frequently do *hundreds of instructions in the time required to satisfy a single cache miss*.
> ...



Callgrind?
gprof?

Anyhow, this actually matters for me in practice this summer. Very.


----------



## rokicki (Jun 5, 2010)

StefanPochmann said:


> I tried to test this by accessing entries of an array some billion times, but in the end gave up as I just don't know enough about my computer's caching. Largest factor I found between a small array and a large array was about 10, I think (12 seconds vs two minutes). So yes, a significant speed difference.



Depends on the table size and how you access them. But the difference
can be as high as a factor of 100.



StefanPochmann said:


> But how much do you expect it to matter in a program like a cube solver? Smaller tables means more look-ups to cover the whole cube state, so that counterbalances the speedup to some extent already. More importantly, this is just the transition tables. I would use pruning tables as well, using the state value as index, and I don't see how to make those small without losing value (leads to later and less pruning). If I use large tables for pruning anyway, making the transition tables very fast doesn't help so much, does it? If I speed up "half" of the job a gazillion times so it takes zero time, my total speedup is still just factor 2.



It will make a huge difference. You just need to test it.

You really want the only L3 cache miss to be in the main pruning table lookup itself; everything else should fit in cache. Don't forget, *every* large table lookup is expensive. Your cost to "evaluate" a position (and the move that took you to it) is the sum of the cost of the cache misses, times the number of the cache misses, plus the instruction count, very roughly. If you need to look up five coordinates in large tables, that's six expensive cache misses.

Like I said, you need to write the code and test it out and try different alternatives. The code I use right now uses very tiny tables and gets very good speed. On an i7-920, I get about 21 optimal solves a minute for random positions. I believe this is significantly faster than any other existing optimal solver for the 3x3x3.


----------



## Stefan (Jun 5, 2010)

Lucas Garron said:


> Callgrind?
> gprof?


That first one's cache simulation looks interesting...



rokicki said:


> It will make a huge difference. You just need to test it.


Ok, I will, next time I write a real program like that (not a shitty test trying to "simulate" one like I did yesterday).



rokicki said:


> On an i7-920, I get about 21 optimal solves a minute for random positions.


That sounds pretty good, yes . How much memory does it use? And any plans to publish it? (if you haven't and I just missed it)


----------



## rokicki (Jun 6, 2010)

rokicki said:


> On an i7-920, I get about 21 optimal solves a minute for random positions.


That sounds pretty good, yes . How much memory does it use? And any plans to publish it? (if you haven't and I just missed it)[/QUOTE]

I plan to publish it soon, for some definition of soon. The optimal solver (which does both halfturn and quarter turn) can use pruning tables of various sizes; the currently supported options are 471MB, 1.4GB, 2.5GB, and 7.5GB. The best times are with the 7.5GB pruning tables, but even the 471MB gets fairly good performance. As I get access to machines with more memory, I plan to extend the pruning tables supported.

This is the optimal solver I used to optimally solve 1,000,000 random positions in both QTM and HTM.

I wrote up some notes on the current performance of solvers and the like for the Gathering for Gardner; I've put the notes at http://tomas.rokicki.com/g4g9.pdf if anyone is interested.


----------



## Stefan (Jun 6, 2010)

rokicki said:


> I've put the notes at http://tomas.rokicki.com/g4g9.pdf if anyone is interested.



Thanks 

Now I'm quite curious about your pruning table. You write you get an effective distance of about 13 with an 8GB pruning table. But with 5538068321152 positions even using mod M + inv just at exactly 13 turns, and even with just 1 bit per position, that's 644GB. Do I misunderstand something, or does your pruning table employ other tricks besides symmetry and bit usage?


----------



## rokicki (Jun 6, 2010)

Dang, I wish I had you review that document before I sent it off. The average effective depth is 11.87 (almost 12). The 13 figure is for the QTM. So the 13 is a mistake in the paper. It doesn't change any of the results, however.

But I do use a number of tricks. The first is, my pruning table index is not a hash of the whole cube position, but a hash of only portions of the cube position. Selecting the correct attributes to prune by is the black magic of pruning table construction. (The ultimate example of this is the tiny tiny pruning table generated by considering the corners only without centers; this can be reduced dramatically to, I think, only about 1000 entries, yet the depth stays quite high.)

Secondly, I apply the pruning table to the same position multiple times, in different orientations. Thus, I don't get all symmetry reduction, but this is mitigated by the fact that multiple probes have more effective impact on the effective depth.

I'll be writing all of this up, including how I selected the pruning tables and all the tricks I used.


----------



## Stefan (Jun 6, 2010)

rokicki said:


> Dang, I wish I had you review that document before I sent it off.



I'm so proud right now 
The 40 million evaluations mentioned in the next paragraph, is that for htm, qtm or both?



rokicki said:


> The first is, my pruning table index is not a hash of the whole cube position, but a hash of only portions of the cube position.



Damn, I never thought of using a hash at all. In the programs I wrote so far, I always just used the state values (edge orientation number, corner permutation number, what I used for the transition tables as well) as index directly. And yeah, I think I know what you mean with black magic. For example using the position of five edges and two corners for pruning, should I use ones making up a compact 2x2x3 block, or is it better to select separated pieces? I think I did some tests for this a while back but never got around to writing a solver then...



rokicki said:


> I'll be writing all of this up, including how I selected the pruning tables and all the tricks I used.



Looking forward to that.


----------

