# 4x4x4 Brute Force Solver Speed Test



## unsolved (Nov 25, 2014)

I decided my 4x4x4 brute force solving program was too slow, so I am experimenting with different versions and different optimizations to try and get it to run faster.







If you could, download this small console app (requires 64-bit hardware and 64-bit OS) from http://lightningcloudcomputing.com/OO_4x4x4_speed.zip and run this one minute test, I would appreciate it. Let me know your CPU and its speed, and the performance numbers you got at the end of the test. Example:

*Intel E5-2687W @ 3.1 GHz. Total time = 75 seconds. Speed = 28,851,005 nodes second.*

Thanks.

P.S. For the programmers out there, I managed to write code so that only 6 lines of code are executed to generate any given move.


----------



## Carrot (Nov 26, 2014)

*Intel i7-3820 @ 4.3GHz. Total time= 57. Speed = 38,461,090 nodes second.*


----------



## unsolved (Nov 26, 2014)

Carrot said:


> *Intel i7-3820 @ 4.3GHz. Total time= 57. Speed = 38,461,090 nodes second.*



Thanks for posting so quickly. And nice system.


----------



## MWilson (Nov 26, 2014)

Test completed requiring 52 seconds to examine 2176782336 nodes for an average of 41074464 per second

Intel i5-2500K @ 3.30 GHz


----------



## Jakube (Nov 26, 2014)

Test completed requiring 63 seconds to examine 2176782336 nodes for an average of 34288136 per second

Intel i7-4600M @ 2.90 GHz


----------



## cuBerBruce (Nov 26, 2014)

Intel Core i7-920 @ 2.67 GHz Total time = 85 seconds. Speed = 25,365,397 nodes per second.

I believe actual CPU speeds during the test was 2.793 GHz (mostly) and 2.926 GHz.


----------



## StachuK1992 (Nov 26, 2014)

*Intel i7-4910MQ @ 2.9 GHz. Total time = 45 seconds. Speed = 47,442,022 nodes second.

Edit: This was with my work laptop.
Later tonight, I'm going to rent an Azure VM with very high CPU and try it out on there. Will post results separately.*


----------



## unsolved (Nov 26, 2014)

cuBerBruce said:


> Intel Core i7-920 @ 2.67 GHz Total time = 85 seconds. Speed = 25,365,397 nodes per second.
> 
> I believe actual CPU speeds during the test was 2.793 GHz (mostly) and 2.926 GHz.



Thanks Bruce. Any idea how this compares to your own app's solving speeds when run in full-width search mode without accessing pruning tables? My numbers are all single core results, no multi-threading turned on as of yet.

In other news, the new program finally handles rotations without issue. Your old scramble (slightly changed):






The notation I use is for my development purposes. I can't deal with R and L being oppositely "primed" although turned in the same direction from the perspective of the cube holder (ditto for the other opposing faces and slices). So the scramble would be:

*L l r' F f b' R r l' F2 f2 b2 L l r' U u d' R2 r2 l2* and the solution is *L2 U' B' D2 B' L' U'*

I added the move *r' *after your second set of *L l* from your original post because the longer scramble can be solved 1-ply faster. Essentially each cluster of 3 moves requires only 1 turn to be solved since they are all parallel and all spinning the same way (which is why I write *L- l- r-* to express the moves *L l r'* -- I treat the _left_ as adopting the notation of the _right_.)

On a final note: the solving times are all long because of the 36^n branching factor. This is just the move generator calling itself in full-width search mode without any intelligence guiding it, so it will make a move sequence such as U U' U U U U' during a 6-ply search.

This was just to see how the raw speed handles.



StachuK1992 said:


> *Intel i7-4910MQ @ 2.9 GHz. Total time = 45 seconds. Speed = 47,442,022 nodes second.
> 
> Edit: This was with my work laptop.
> Later tonight, I'm going to rent an Azure VM with very high CPU and try it out on there. Will post results separately.*



I did not realize the i7-4910MQ was that efficient. I hope you are not renting the machine just to run this test program on  But I do appreciate the info. I was on West Chester Pike today during the snow, but no further out than Goshen township. I wasn't too far from you though.



Jakube said:


> Test completed requiring 63 seconds to examine 2176782336 nodes for an average of 34288136 per second
> 
> Intel i7-4600M @ 2.90 GHz



Thanks Jake. By the way, I watched your blindfold 6x6x6 solve on YouTube. If that was not a trick, that is amazing  I have no idea how a blindfold solve is even possible. I have a tough enough time doing a blindfold scramble!



MWilson said:


> Test completed requiring 52 seconds to examine 2176782336 nodes for an average of 41074464 per second
> 
> Intel i5-2500K @ 3.30 GHz



I remember when these chips first came out and everyone was excited about overclocking them to 5.0 GHz with little more than a MegaShadow air cooler with incredible heat sinks. A pretty fast chip at a great price. Thanks for posting.


----------



## StachuK1992 (Nov 26, 2014)

unsolved said:


> I did not realize the i7-4910MQ was that efficient. I hope you are not renting the machine just to run this test program on  But I do appreciate the info. I was on West Chester Pike today during the snow, but no further out than Goshen township. I wasn't too far from you though.



Yes, I've noticed that we're quite close together. I'm not really in WC anymore; I'm more or less in the Exton/Chester Springs/Glenmoore area, but still pretty close.

The 4910MQ is my work laptop. As for the Azure instance I'll likely fire up, I get $150/month in free Azure credit, so it's worth 'paying' $2-3 for a 16-core machine


----------



## unsolved (Nov 26, 2014)

StachuK1992 said:


> Yes, I've noticed that we're quite close together. I'm not really in WC anymore; I'm more or less in the Exton/Chester Springs/Glenmoore area, but still pretty close.
> 
> The 4910MQ is my work laptop. As for the Azure instance I'll likely fire up, I get $150/month in free Azure credit, so it's worth 'paying' $2-3 for a 16-core machine



That's a great deal! I have a 16-core E5-2687W but the cores are comparatively slow. And they each get to around 92 Celsius under full workload. It's nice on days like today though, my home office is nice and warm.
I have my box doing a 15-way hunt for some 900,000+ digit primes that start with the number 7, end with the number 1, and contain nothing but 0s in the middle.


----------



## cuBerBruce (Nov 27, 2014)

unsolved said:


> Thanks Bruce. Any idea how this compares to your own app's solving speeds when run in full-width search mode without accessing pruning tables? My numbers are all single core results, no multi-threading turned on as of yet.



I wrote some test code to apply a 108-move scramble 20 million times. The moves were single-layer moves. On my Core i7-920, it took 2 minutes and 4 seconds. That works out to about 17,419,000 moves per second. This was using the GNU 64-bit C++ compiler. So oo4x4x4's move code is about 50% faster on my machine assuming the move code was dominating its speed test run time, perhaps more if significant time was being spent elsewhere.

I note my code uses a cubie level rather than a facelet level representation. A cubie level is probably better for generating pruning table indices, and the speed of generating pruning table indices may be as important as the raw speed of making moves.

My code could be optimized better. For instance, the basic move routine makes three function calls, one for moving each piece type. (This is for the case of single-layer moves. Multiple-layer moves are more involved.)

By the way, speaking of single-threaded or multithreaded code, if you add multithreading support, I really hope you allow the user to have control over how many (cpu intensive) threads are allowed. I will not want other programs on my computer impacted terribly by your program trying to make use of every available virtual core.



unsolved said:


> The notation I use is for my development purposes. I can't deal with R and L being oppositely "primed" although turned in the same direction from the perspective of the cube holder (ditto for the other opposing faces and slices). So the scramble would be:
> 
> *L l r' F f b' R r l' F2 f2 b2 L l r' U u d' R2 r2 l2* and the solution is *L2 U' B' D2 B' L' U'*
> 
> I added the move *r' *after your second set of *L l* from your original post because the longer scramble can be solved 1-ply faster. Essentially each cluster of 3 moves requires only 1 turn to be solved since they are all parallel and all spinning the same way (which is why I write *L- l- r-* to express the moves *L l r'* -- I treat the _left_ as adopting the notation of the _right_.)



For your internal code, there is nothing wrong with using viewing the cube from the U/R/F sides for turn directions. But if you want the code to be embraced by the speedcubing community, you should have (at least as an option) the input and output be in terms of conventions of the speedcubing community.

The speedcubing community conventions started from Singmaster's notation for the 3x3x3 cube, where basically only face turns were considered by Singmaster to matter. By having the turn direction being determined by looking at the face being moved, there was no need to have to remember which three faces were considered as preferred faces. This was the big advantage of defining the face turns the way they were. This, of course, broke down when people wanted to use (central) inner layer turns and whole cube rotations, where there is no avoiding having three faces serving as reference faces. To make matters worse, the standards that developed for cube rotations and (central) inner layer turns were not made consistent with each other (except for S and z).


----------



## unsolved (Nov 27, 2014)

cuBerBruce said:


> I wrote some test code to apply a 108-move scramble 20 million times. The moves were single-layer moves. On my Core i7-920, it took 2 minutes and 4 seconds. That works out to about 17,419,000 moves per second. This was using the GNU 64-bit C++ compiler. So oo4x4x4's move code is about 50% faster on my machine assuming the move code was dominating its speed test run time, perhaps more if significant time was being spent elsewhere.I note my code uses a cubie level rather than a facelet level representation. A cubie level is probably better for generating pruning table indices, and the speed of generating pruning table indices may be as important as the raw speed of making moves.



Thanks. I wasn't sure how much is gained at the cubie level. My own attempts did not do that well, perhaps I am not understanding it. And, once conjoined with other parts of my code, the cubie-ish stuff I wrote turned out to be a detriment, as I had to make a lot of back-and-forth conversions between move generator output and pruning data/TFS data.



cuBerBruce said:


> My code could be optimized better. For instance, the basic move routine makes three function calls, one for moving each piece type. (This is for the case of single-layer moves. Multiple-layer moves are more involved.)



Yes, I know the feeling. I wasted a great deal of time profiling to try and eek out a few points worth of performance here and there. That's why I started over with a blank sheet of paper. The old code hit a max of 92,000,000 moves per second using all 16-cores in the most parallel version I had up and running on this machine, but I forget what the single core number was. A linear division by 16 is not perfectly applicable, but doing that would have it at around 5,750,000 per second on my 3.1 GHz cores. I think the speed was right around 10,000,000 per second on one core, so the rewrite from matrices to the new 64-bit boards resulted in a 2.8 times speed increase.

Using bitboards, every single move (no block turns or *compound moves*) is exactly 6 instructions.

Making the face move R:


```
BITBOARD_4x4x4_CUBE R_PLUS(BITBOARD_4x4x4_CUBE original_4x4x4_cube)
{
 BITBOARD_4x4x4_CUBE new_cube;

 /**************************************************************************/
 /* Rotating the right will change something on every face except the left */
 /**************************************************************************/
 new_cube._4x4x4_left = original_4x4x4_cube._4x4x4_left;

 /********************************************************************************************/
 /* Entire row of cubes moves from front -----> top -----> back -----> bottom -----> front   */
 /* new top = old front; new back = old top; new bottom = old back; new front = old bottom   */ 
 /********************************************************************************************/
 new_cube._4x4x4_top  = (original_4x4x4_cube._4x4x4_top & BITMASK_CUBE_FACE_NOT_COL_04)  | (original_4x4x4_cube._4x4x4_front & BITMASK_CUBE_FACE_COL_04);
 new_cube._4x4x4_back = (original_4x4x4_cube._4x4x4_back & BITMASK_CUBE_FACE_NOT_COL_01) | ((original_4x4x4_cube._4x4x4_top & BITMASK_CUBE_FACE_SQUARE_16) << 60)   | ((original_4x4x4_cube._4x4x4_top & BITMASK_CUBE_FACE_SQUARE_12) << 28)   | ((original_4x4x4_cube._4x4x4_top & BITMASK_CUBE_FACE_SQUARE_08) >> 4) | ((original_4x4x4_cube._4x4x4_top & BITMASK_CUBE_FACE_SQUARE_04) >> 36);
 new_cube._4x4x4_bottom = (original_4x4x4_cube._4x4x4_bottom & BITMASK_CUBE_FACE_NOT_COL_04) | ((original_4x4x4_cube._4x4x4_back & BITMASK_CUBE_FACE_SQUARE_01) >> 60) | ((original_4x4x4_cube._4x4x4_back & BITMASK_CUBE_FACE_SQUARE_05) >> 28) | ((original_4x4x4_cube._4x4x4_back & BITMASK_CUBE_FACE_SQUARE_09) << 4) | ((original_4x4x4_cube._4x4x4_back & BITMASK_CUBE_FACE_SQUARE_13) << 36);
 new_cube._4x4x4_front  = (original_4x4x4_cube._4x4x4_front & BITMASK_CUBE_FACE_NOT_COL_04)   | (original_4x4x4_cube._4x4x4_bottom & BITMASK_CUBE_FACE_COL_04);

 /*********************************************************************************/
 /*                                                                               */
 /* THESE CUBE LOCATIONS BECOME....              THESE CUBE LOCATIONS             */
 /*                                                                               */
 /*                                                                               */
 /*********************************/             /*********************************/
 /*  01   *  02   *  03   *   04  */             /*  13   *  09   *  05   *   01  */
 /*********************************/             /*********************************/
 /*  05   *  06   *  07   *   08  */             /*  14   *  10   *  06   *   02  */
 /*********************************/             /*********************************/
 /*  09   *  10   *  11   *   12  */             /*  15   *  11   *  07   *   03  */
 /*********************************/             /*********************************/
 /*  13   *  14   *  15   *   16  */             /*  16   *  12   *  08   *   04  */
 /*********************************/             /*********************************/
 /*                                                                               */
 /*                                                                               */
 /*********************************************************************************/

 new_cube._4x4x4_right = ((original_4x4x4_cube._4x4x4_right & BITMASK_CUBE_FACE_SQUARE_13) << 48)   | \
 ((original_4x4x4_cube._4x4x4_right & BITMASK_CUBE_FACE_SQUARE_09) << 28)   | \
 ((original_4x4x4_cube._4x4x4_right & BITMASK_CUBE_FACE_SQUARE_05) << 8)    | \
 ((original_4x4x4_cube._4x4x4_right & BITMASK_CUBE_FACE_SQUARE_01) >> 12)   | \
 ((original_4x4x4_cube._4x4x4_right & BITMASK_CUBE_FACE_SQUARE_14) << 36)   | \
 ((original_4x4x4_cube._4x4x4_right & BITMASK_CUBE_FACE_SQUARE_10) << 16)   | \
 ((original_4x4x4_cube._4x4x4_right & BITMASK_CUBE_FACE_SQUARE_06) >> 4)    | \
 ((original_4x4x4_cube._4x4x4_right & BITMASK_CUBE_FACE_SQUARE_02) >> 24)   | \
 ((original_4x4x4_cube._4x4x4_right & BITMASK_CUBE_FACE_SQUARE_15) << 24)   | \
 ((original_4x4x4_cube._4x4x4_right & BITMASK_CUBE_FACE_SQUARE_11) << 4)    | \
 ((original_4x4x4_cube._4x4x4_right & BITMASK_CUBE_FACE_SQUARE_07) >> 16)   | \
 ((original_4x4x4_cube._4x4x4_right & BITMASK_CUBE_FACE_SQUARE_03) >> 36)   | \
 ((original_4x4x4_cube._4x4x4_right & BITMASK_CUBE_FACE_SQUARE_16) << 12)   | \
 ((original_4x4x4_cube._4x4x4_right & BITMASK_CUBE_FACE_SQUARE_12) >> 8)    | \
 ((original_4x4x4_cube._4x4x4_right & BITMASK_CUBE_FACE_SQUARE_08) >> 28)   | \
 ((original_4x4x4_cube._4x4x4_right & BITMASK_CUBE_FACE_SQUARE_04) >> 48);

 return new_cube;
}
```

The rotation code is technically still one instruction even though it is cascading bit-ANDs and bit-shifts.

Making the slice move r'


```
BITBOARD_4x4x4_CUBE r_MINUS(BITBOARD_4x4x4_CUBE original_4x4x4_cube)
{
 BITBOARD_4x4x4_CUBE new_cube;

 /********************************************************************/
 /* Rotating the right slice will not change the right or left faces */
 /********************************************************************/
 new_cube._4x4x4_right = original_4x4x4_cube._4x4x4_right;
 new_cube._4x4x4_left  = original_4x4x4_cube._4x4x4_left;

 /*********************************************************************************************/
 /* Entire columns of cubes move from top -----> front -----> bottom -----> back -----> top   */
 /* new front = old top; new bottom = old front; new back = old bottom; new top = old back    */ 
 /*********************************************************************************************/
 new_cube._4x4x4_top = (original_4x4x4_cube._4x4x4_top & BITMASK_CUBE_FACE_NOT_COL_03) | ((original_4x4x4_cube._4x4x4_back & BITMASK_CUBE_FACE_SQUARE_14) << 44) | ((original_4x4x4_cube._4x4x4_back & BITMASK_CUBE_FACE_SQUARE_10) << 12) | ((original_4x4x4_cube._4x4x4_back & BITMASK_CUBE_FACE_SQUARE_06) >> 20) | ((original_4x4x4_cube._4x4x4_back & BITMASK_CUBE_FACE_SQUARE_02) >> 52);
 new_cube._4x4x4_back = (original_4x4x4_cube._4x4x4_back & BITMASK_CUBE_FACE_NOT_COL_02) | ((original_4x4x4_cube._4x4x4_bottom & BITMASK_CUBE_FACE_SQUARE_15) << 52) | ((original_4x4x4_cube._4x4x4_bottom & BITMASK_CUBE_FACE_SQUARE_11) << 20) | ((original_4x4x4_cube._4x4x4_bottom & BITMASK_CUBE_FACE_SQUARE_07) >> 12) | ((original_4x4x4_cube._4x4x4_bottom & BITMASK_CUBE_FACE_SQUARE_03) >> 44);
 new_cube._4x4x4_bottom  = (original_4x4x4_cube._4x4x4_bottom & BITMASK_CUBE_FACE_NOT_COL_03) | (original_4x4x4_cube._4x4x4_front & BITMASK_CUBE_FACE_COL_03);
 new_cube._4x4x4_front  = (original_4x4x4_cube._4x4x4_front & BITMASK_CUBE_FACE_NOT_COL_03) | (original_4x4x4_cube._4x4x4_top & BITMASK_CUBE_FACE_COL_03);

 return new_cube;
}
```

Even though the slice turns make no rotations of whole faces, they are still 6 lines of code, since 2 entire unaffected faces must have their 64-bit data copied as opposed to only 1 for face turns.



cuBerBruce said:


> By the way, speaking of single-threaded or multithreaded code, if you add multithreading support, I really hope you allow the user to have control over how many (cpu intensive) threads are allowed. I will not want other programs on my computer impacted terribly by your program trying to make use of every available virtual core.



I have all of your notes from prior posts in separate notebook. I added the code to specify which TFS databases to load, and I will have another interface option when a multi-core CPU is detected. I'm just going to have base-2 core counts as choices for those with the available cores (1,2,4,8,16,32) although a sweet 60-core server is available in the Intel Xeon E7-8895 v2 which has 15 CPU cores and 4-ways (even 8-way) capability. I've built a few 4-way systems but no 8-socket motherboards are out yet that I am aware of.



cuBerBruce said:


> For your internal code, there is nothing wrong with using viewing the cube from the U/R/F sides for turn directions. But if you want the code to be embraced by the speedcubing community, you should have (at least as an option) the input and output be in terms of conventions of the speedcubing community.



I coded with conversion to input/output in mind. When I "think" about the cube, I need to use my own system. Remember all of the errors in my earlier databases? That was mostly from not handling L and B properly. I had to remember "L = opposite of R" but when the move generator makes a move, the Turns-From-Solved database needs to reverse the move in order to solve it. It gave me fits! For some reason I had no problems dealing with D moves, must have been just a mental block on my part.



cuBerBruce said:


> The speedcubing community conventions started from Singmaster's notation for the 3x3x3 cube, where basically only face turns were considered by Singmaster to matter. By having the turn direction being determined by looking at the face being moved, there was no need to have to remember which three faces were considered as preferred faces. This was the big advantage of defining the face turns the way they were. This, of course, broke down when people wanted to use (central) inner layer turns and whole cube rotations, where there is no avoiding having three faces serving as reference faces. To make matters worse, the standards that developed for cube rotations and (central) inner layer turns were not made consistent with each other (except for S and z).



I am OK with it. It is my own subjective opinion that my 4x4x4 notation makes more sense. But the 5x5x5 notation system has additional absurdities. 

1. Middle slices are capital letters. That goes against the grain of all other slices being lower case.
2. By definition the middle slice is non-orientable. If you define a primed center slice move to follow a particular face, then why not continue along the same direction and adopt the convention I use?
3. My 5x5x5 notation solves the conundrum in item #2 above, and the implementation and execution thereof causes no pain to the cuber.

I am leaning towards only having my notation system in the 5x5x5 brute force solver but that is subject to change.


----------



## qqwref (Nov 27, 2014)

When you say "6 instructions" you mean 6 lines of code, right? Because some of those lines are very long and I really doubt they compile to a single assembly instruction 

As for moves, why not just make a simple translation table for printing moves? You only have to make that table once, and there are a small finite number of possible moves, and then afterwards you never have to worry about it again (and nobody will complain). Plus, there's no performance impact if you spend the vast majority of your time not printing moves.


----------



## cuBerBruce (Nov 27, 2014)

qqwref said:


> When you say "6 instructions" you mean 6 lines of code, right? Because some of those lines are very long and I really doubt they compile to a single assembly instruction



64-bit processors have some fancy instructions. The SSSE3 set of extended instructions contains a "shuffle" instruction (PSHUFB or _mm_shuffle_epi8) that can do pretty fancy permutations as one instruction. I've been wondering if _unsolved_ has been checking that the compiler generates instructions such as this one from his source. He has been somewhat vague about what actual code is generated.


----------



## unsolved (Nov 27, 2014)

qqwref said:


> When you say "6 instructions" you mean 6 lines of code, right? Because some of those lines are very long and I really doubt they compile to a single assembly instruction



Yes, I was using 1 instruction = 1 line of code.

Look how many lines of code were in the old matrix program: 32 instructions per face move. Getting it down to only 6 was a big improvement 


```
MATRIX_4x4x4_CUBE R_PLUS(MATRIX_4x4x4_CUBE original_4x4x4_cube)
{
	MATRIX_4x4x4_CUBE new_cube;

	new_cube = original_4x4x4_cube;

	new_cube.cube_top[4-1] = original_4x4x4_cube.cube_front[4-1];
	new_cube.cube_top[8-1] = original_4x4x4_cube.cube_front[8-1];
	new_cube.cube_top[12-1] = original_4x4x4_cube.cube_front[12-1];
	new_cube.cube_top[16-1] = original_4x4x4_cube.cube_front[16-1];

	new_cube.cube_front[4-1] = original_4x4x4_cube.cube_bottom[4-1];
	new_cube.cube_front[8-1] = original_4x4x4_cube.cube_bottom[8-1];
	new_cube.cube_front[12-1] = original_4x4x4_cube.cube_bottom[12-1];
	new_cube.cube_front[16-1] = original_4x4x4_cube.cube_bottom[16-1];

	new_cube.cube_bottom[4-1] = original_4x4x4_cube.cube_back[13-1];
	new_cube.cube_bottom[8-1] = original_4x4x4_cube.cube_back[9-1];
	new_cube.cube_bottom[12-1] = original_4x4x4_cube.cube_back[5-1];
	new_cube.cube_bottom[16-1] = original_4x4x4_cube.cube_back[1-1];

	new_cube.cube_back[1-1] = original_4x4x4_cube.cube_top[16-1];
	new_cube.cube_back[5-1] = original_4x4x4_cube.cube_top[12-1];
	new_cube.cube_back[9-1] = original_4x4x4_cube.cube_top[8-1];
	new_cube.cube_back[13-1] = original_4x4x4_cube.cube_top[4-1];

// rotation instructions below

	new_cube.cube_right[1-1] = original_4x4x4_cube.cube_right[13-1];
	new_cube.cube_right[2-1] = original_4x4x4_cube.cube_right[9-1];
	new_cube.cube_right[3-1] = original_4x4x4_cube.cube_right[5-1];
	new_cube.cube_right[4-1] = original_4x4x4_cube.cube_right[1-1];

	new_cube.cube_right[5-1] = original_4x4x4_cube.cube_right[14-1];
	new_cube.cube_right[6-1] = original_4x4x4_cube.cube_right[10-1];
	new_cube.cube_right[7-1] = original_4x4x4_cube.cube_right[6-1];
	new_cube.cube_right[8-1] = original_4x4x4_cube.cube_right[2-1];

	new_cube.cube_right[9-1] = original_4x4x4_cube.cube_right[15-1];
	new_cube.cube_right[10-1] = original_4x4x4_cube.cube_right[11-1];
	new_cube.cube_right[11-1] = original_4x4x4_cube.cube_right[7-1];
	new_cube.cube_right[12-1] = original_4x4x4_cube.cube_right[3-1];

	new_cube.cube_right[13-1] = original_4x4x4_cube.cube_right[16-1];
	new_cube.cube_right[14-1] = original_4x4x4_cube.cube_right[12-1];
	new_cube.cube_right[15-1] = original_4x4x4_cube.cube_right[8-1];
	new_cube.cube_right[16-1] = original_4x4x4_cube.cube_right[4-1];

	return new_cube;
}
```



qqwref said:


> As for moves, why not just make a simple translation table for printing moves? You only have to make that table once, and there are a small finite number of possible moves, and then afterwards you never have to worry about it again (and nobody will complain). Plus, there's no performance impact if you spend the vast majority of your time not printing moves.



I have that for 4x4x4. But in the case of the 5x5x5 program, my current thought process is that it's time to ditch the old notation. It's far too cumbersome. I can't really endorse it and encourage continuing something so obtuse. If I come up with a "perfect solution" I roll it out with my program, which people are free to accept or reject. If I am still undecided when the time comes, I'll begrudgingly support the old notation. Still up in the air on it.



cuBerBruce said:


> 64-bit processors have some fancy instructions. The SSSE3 set of extended instructions contains a "shuffle" instruction (PSHUFB or _mm_shuffle_epi8) that can do pretty fancy permutations as one instruction. I've been wondering if _unsolved_ has been checking that the compiler generates instructions such as this one from his source. He has been somewhat vague about what actual code is generated.



I haven't peeked into the object code, so I don't know. I just perform benchmarks and report the results. I provide the C-code so the adventurous can compile it and see what happens.


----------



## qqwref (Nov 27, 2014)

Fair enough, Bruce, if the compiler can deal with instructions like that. It should be possible to make a solver (for 4x4x4 or a smaller puzzle) that would apply permutations using that, but it would require some messy custom data structures to store the position data, and I dunno if it would be faster.



unsolved said:


> I have that for 4x4x4. But in the case of the 5x5x5 program, my current thought process is that it's time to ditch the old notation. It's far too cumbersome. I can't really endorse it and encourage continuing something so obtuse. If I come up with a "perfect solution" I roll it out with my program, which people are free to accept or reject. If I am still undecided when the time comes, I'll begrudgingly support the old notation. Still up in the air on it.


You can think what you like, but I haven't seen a single person agree with your proposed notation. If you intend for your program to be used by others, I don't understand why you wouldn't just take a few minutes to dump in a conversion table. You could even make it a command-line (or #define) option so you don't have to actually look at our notation.

I'm not sure why you still find L' and R confusing (do you not have a physical cube to play with, perhaps?) but if you really want to avoid that whole deal you could go the SiGN route and simply refer to the 5 slices on the R axis as R, 2R, 3R, 4R, and 5R. It's not something you'd normally see, but it's valid, and 5R turns like your L+.


----------



## cuBerBruce (Nov 27, 2014)

unsolved said:


> Thanks. I wasn't sure how much is gained at the cubie level. My own attempts did not do that well, perhaps I am not understanding it. And, once conjoined with other parts of my code, the cubie-ish stuff I wrote turned out to be a detriment, as I had to make a lot of back-and-forth conversions between move generator output and pruning data/TFS data.



As far as I have been aware, you've been using a facelet level representation, not a cubie level representation. Hasn't your code always viewed the 4x4x4 cube as 96 facelets rather than as 56 cubies?


----------



## unsolved (Nov 27, 2014)

cuBerBruce said:


> As far as I have been aware, you've been using a facelet level representation, not a cubie level representation. Hasn't your code always viewed the 4x4x4 cube as 96 facelets rather than as 56 cubies?



I tried about 15 different ways to generate moves for the 4x4x4 since the project's inception. I abandoned 14 of them, including my cubie attempts.

This is how I lay out the bits for the bitboard generator now. Each face = 1 64-bit integer.






I used shading changes to denote where 1-byte boundaries are. As you can see, TOP[0] through TOP[15] are the 16 slots for each facelet ("sticker") on the TOP face. Each facelet is 4 bits, so there are 2 facelets per byte. It's the perfect layout for a 64-bit implementation because each side of the 4x4x4 cube is exactly one *unsigned long long *or *UINT64* data type.

In this fashion, I don't have 96 separate facelets, since I have only 6 chunks of data. I can move them an entire row at a time by bit-ANDing the correct offsets and bit-masks. This also makes it easy to check all 24 possible solved states with only 1 instruction (ok, 2 if you count the *return* statement). If you count the comparisons, you see there are 36 all totaled. That's an average of 1.5 comparisons per cube orientation 


```
#define BITBOARD_TOP_FACE    2
#define BITBOARD_RIGHT_FACE  3
#define BITBOARD_FRONT_FACE  5
#define BITBOARD_LEFT_FACE   7
#define BITBOARD_BOTTOM_FACE 11
#define BITBOARD_BACK_FACE   13

#define BITBOARD_TOP_SOLVED    0x2222222222222222
#define BITBOARD_RIGHT_SOLVED  0x3333333333333333
#define BITBOARD_FRONT_SOLVED  0x5555555555555555
#define BITBOARD_LEFT_SOLVED   0x7777777777777777
#define BITBOARD_BOTTOM_SOLVED 0xBBBBBBBBBBBBBBBB
#define BITBOARD_BACK_SOLVED   0xDDDDDDDDDDDDDDDD

unsigned short bitboard_4x4x4_cube_is_solved_from_any_orientation(BITBOARD_4x4x4_CUBE your_4x4x4_cube)
{
	unsigned short result;

result =
((your_4x4x4_cube._4x4x4_top == BITBOARD_TOP_SOLVED) || (your_4x4x4_cube._4x4x4_top == BITBOARD_RIGHT_SOLVED) || 
(your_4x4x4_cube._4x4x4_top == BITBOARD_FRONT_SOLVED) || (your_4x4x4_cube._4x4x4_top == BITBOARD_LEFT_SOLVED) || 
(your_4x4x4_cube._4x4x4_top == BITBOARD_BOTTOM_SOLVED) || (your_4x4x4_cube._4x4x4_top == BITBOARD_BACK_SOLVED)) &&
((your_4x4x4_cube._4x4x4_right == BITBOARD_TOP_SOLVED) || (your_4x4x4_cube._4x4x4_right == BITBOARD_RIGHT_SOLVED) || 
(your_4x4x4_cube._4x4x4_right == BITBOARD_FRONT_SOLVED) || (your_4x4x4_cube._4x4x4_right == BITBOARD_LEFT_SOLVED) ||
(your_4x4x4_cube._4x4x4_right == BITBOARD_BOTTOM_SOLVED) || (your_4x4x4_cube._4x4x4_right == BITBOARD_BACK_SOLVED)) &&
((your_4x4x4_cube._4x4x4_front == BITBOARD_TOP_SOLVED) || (your_4x4x4_cube._4x4x4_front == BITBOARD_RIGHT_SOLVED) ||
(your_4x4x4_cube._4x4x4_front == BITBOARD_FRONT_SOLVED) || (your_4x4x4_cube._4x4x4_front == BITBOARD_LEFT_SOLVED) ||
(your_4x4x4_cube._4x4x4_front == BITBOARD_BOTTOM_SOLVED) || (your_4x4x4_cube._4x4x4_front == BITBOARD_BACK_SOLVED)) &&
((your_4x4x4_cube._4x4x4_left == BITBOARD_TOP_SOLVED) || (your_4x4x4_cube._4x4x4_left == BITBOARD_RIGHT_SOLVED) ||
(your_4x4x4_cube._4x4x4_left == BITBOARD_FRONT_SOLVED) || (your_4x4x4_cube._4x4x4_left == BITBOARD_LEFT_SOLVED) ||
(your_4x4x4_cube._4x4x4_left == BITBOARD_BOTTOM_SOLVED) || (your_4x4x4_cube._4x4x4_left == BITBOARD_BACK_SOLVED)) &&
((your_4x4x4_cube._4x4x4_bottom == BITBOARD_TOP_SOLVED) || (your_4x4x4_cube._4x4x4_bottom == BITBOARD_RIGHT_SOLVED) ||
(your_4x4x4_cube._4x4x4_bottom == BITBOARD_FRONT_SOLVED) || (your_4x4x4_cube._4x4x4_bottom == BITBOARD_LEFT_SOLVED) ||
(your_4x4x4_cube._4x4x4_bottom == BITBOARD_BOTTOM_SOLVED) || (your_4x4x4_cube._4x4x4_bottom == BITBOARD_BACK_SOLVED)) &&
((your_4x4x4_cube._4x4x4_back == BITBOARD_TOP_SOLVED) || (your_4x4x4_cube._4x4x4_back == BITBOARD_RIGHT_SOLVED) ||
(your_4x4x4_cube._4x4x4_back == BITBOARD_FRONT_SOLVED) || (your_4x4x4_cube._4x4x4_back == BITBOARD_LEFT_SOLVED) ||
(your_4x4x4_cube._4x4x4_back == BITBOARD_BOTTOM_SOLVED) || (your_4x4x4_cube._4x4x4_back == BITBOARD_BACK_SOLVED));

return result;
}
```

That last section of the spreadsheet shows how I set up the TFS-7 database. 

5 bits = cube orientation/rotation, from 0 to 23
3 bits = the actual number of moves in this particular TFS solution, from 001 to 111 (1 to 7)
The other bytes are the actual move numbers, 1 to 36, for the turn solution associated with the position.

A post-process turns this data into run-time probe-able data and all I have to do is compare move generator output to the bytes in the RAM buffer, and a match will give me those 64-bits indicating a TFS solution.


----------

