I have the following personnel ML project:
1. modeling a sort of single-player game with a big board of locations, pawns of different colors and quantities to place on it, and a set of rules to place each kind (color) of pawn on the board. There are also time considerations as a moving a pawn from one location to another takes some time related to the distance...
2. Have a Deep RL algorithm learning to play the game and finding the best solution (highest score in the smallest number of moves).
I have only been for a few weeks into Machine Learning stuff starting with openai gym. What I did so far is programming the gym environment, which works with a random agent (I mean the rules work correctly). I tried to train PPO on it but I am not sure if my strategy is the right one regarding action/observation spaces and rewards. I did not really do normalization as I am not sure how to handle it.
I am stuck at this stage with a ton of questions and doubts and would need an expert to coach me, and put me in the right direction as I don't have much time for trial and errors but want to learn how to tackle my specific problem... I might also need some help with coding when necessary (I code in Python).
Brief summary of how my gym env looks like:
I coded a gym environment for a single-player game that consists of:
- a board of n locations (n is 1122 in this example but, in the future, I would like to be able to handle boards of 30000 locations for instance), represented by a simple list of 1122 indexes
- 6 different kinds (colors) of pawns that you can place according to a set of rules that is handled by the gym environment (some pawns can pile up on the same locations, etc.). At the beginning, the player has a fixed number of available pawns per color (stock), which I represent by the code 1 to 6.
- 3 possible pawn actions: NEW, when putting a new pawn on the board from the stock, MOVE, when moving a pawn already on the board to a new location), and REMOVE, when removing a pawn from the board to put it back in the stock.
As an action_space I used a MultiDiscrete([pawn_actions_nr, total_pawns_nr, locations_nr]), where:
- pawn_actions_nr = 0 (NEW), 1 (MOVE) or 2 (REMOVE)
- total_pawns_nr = int from 0 to 60 with 0 to 6 being the 7 black pawns, 7 to 10 the 4 red pawns, and so on
- locations_nr = 0 to 1122, representing each of the 1122 possible locations
Every time a pawn stops at a location, the location takes its color (ex: I can place a red pawn on a given location and then move it to different locations, all these locations will turn red).
Observation space: a box of (1122+60) length, values can be integers from -1 to 1122. The first 1122 represent the index of the locations (and can take value from 0 to 6, 0 being the initial state and 1 to 6 the color of the location) and the last 60 represent all the available pawns from the initial stock, with a possible value from -1 to 1122, representing the location where a given is located, -1 meaning that it is not in the board but still in stock.
The environment does not manage the time so far as I am not sure how to handle the moving delay (do I have to set a fixed time per step and manage past present and future some how? I there a way to handle that as discrete simulations do?...)
11 freelancers are bidding on average €53/hour for this job
Hi I am Python expert and Control Engineer. I work on RL methods such as QL, DQL, PG, DPG and so on. Also I work on game theory methods. I can help you. We can discuss more in chat inbox Thank you