login sign up

offline
Last seen: May 24, 2013 7:50 AM EDT
Hire Me!
Rate: $10.00 USD/hour
Follow Invite to Project
 

rhythmist

an enthusiastic coder

Username: rhythmist

  • Has not made a deposit.
  • Has verified their email address.
  • Has completed their profile.
  • Has not verified their secure phone number.
  • Verified
  • Payment is not verified.

Location: Mumbai, India India

Member since: June 2012

Reputation:

4.5

(3 reviews)

2.2
[see more]

My projects:

  • $50 USD
    5.0
    Profile image for Seller diamalaye1

    diamalaye1 United States

    Apr 17, 2013

    he is excellent and professional

    Project Description: hi, my friend refer me to you as you did a great job with him.
    [more]
  • $45 USD
    3.6
    Profile image for Seller diamalaye1

    diamalaye1 United States

    Mar 6, 2013

    It was a hard project and he did he's best to help me.

    Project Description: N/A
    [more]
  • $30 USD
    5.0
    Profile image for Seller koop

    koop United States

    Sep 23, 2012

    Excellent Work on projets and deliver on time. Definitely recommended to all postrs.

    Project Description: Write a C++ program, escape.cpp, that randomly finds a path from the center of a two dimensional array. The program first initializes a two dimensional char array to all periods. The program starts at the middle of the array...
    [more]
    rhythmist has not completed any projects.
  • $30 USD In Progress

    Write a program, blockhead.cpp, that uses a struct called block which has three fields:1. An integer rows2. An integer cols3. An character letterA block will be declared and initialized in main. The program should repeatedly read one of the following commands until a q is entered:r: Change the number of rows in the blockc: Change the number of columns in the blockl: Change the letter in the blocka: Change the letter in the block based on an ASCII valuep: Print the block in two dimensional formatIf an illegal command is entered, the program should print an appropriate error message.You are required to write the following six functions:// initializes the block to 4 rows, 4 columns and the// letter *void init_block (block& b);// prompts for and reads a number of rows from the user// and assigns that number of rows to the blockvoid change_rows (block& b);// prompts for and reads a number of columns from // the user and assigns that number of columns to the// blockvoid change_cols (block& b); // prompts for and reads the ASCII value of a character// from the user and sets the letter in the block to// that charactervoid change_ascii (block& b);// prompts for and reads a letter from // the user and assigns that letter to the// blockvoid change_letter (block& b);// prints the block in two dimensional format with a // labelvoid print (block b);The following page shows a sample run of the program (user input in bold). Your program should use the same output formatting.

    [more]
  • $30 USD Feb 6, 2013

    Hi, my friend (koop) refer me to you for my project. This project must be done using matlab . --------------------------------------------------------------------------------------------Windy Grid World due 2/10/2012 This assignment is to use Reinforcement Learning to solve the following "Windy Grid World" problem illustrated in the above picture. Each cell in the image is a state. There are four actions: move up, down, left, and right. This is a deterministic domain -- each action deterministically moves the agent one cell in the direction indicated. If the agent is on the boundary of the world and executes an action that would move it "off" of the world, it remains on the grid in the same cell from which it executed the action.Notice that there are arrows drawn in some states in the diagram. These are the "windy" states. In these states, the agent experiences an extra "push" upward. For example, if the agent is in a windy state and executes an action to the left or right, the result of the action is to move left or right (respectively) but also to move one cell upward. As a result, the agent moves diagonally upward to the left or right. This is an episodic task where each episode lasts no more than 30 time steps. At the beginning of each episode, the agent is placed in the "Start" state. Reward in this domain is zero everywhere except when the agent is in the goal state (labeled "goal" in the diagram). The agent receives a reward of positive ten when it executes any action {\it from} the goal state. The episode ends after 30 time steps or when the agent takes any action after having landed in the goal state. You should solve the problem using Q-learning. Use e-greedy exploration with epsilon=0.1 (the agent takes a random action 10 percent of the time in order to explore.) Use a learning rate of 0.1 and a discount rate of 0.9. The programming should be done in MATLAB. Students may get access to MATLAB here. Alternatively, students may code in Python (using Numpy). If the student would rather code in a different language, please see Dr Platt or the TA. Students should submit their homework via email to the in the form of a ZIP file that includes the following:1. A PDF of a plot of gridworld that illustrates the policy and a path found by Q-learning after it has approximately converged. The policy plot should identify the action taken by the policy in each state. The path should begin in the start state and follow the policy to the goal state. 2. A PDF of a plot of reward per episode. It should look like the diagram in Figure 6.13 in SB. 3. A text file showing output from a sample run of your code. 4. A directory containing all source code for your project. UpdatesYou can initialize the Q function randomly or you can initialize it to a uniform value of 10. That is, you can initialize Q such that each value in the table is equal to 10.6.5 Q-Learning: Off-Policy TD Control One of the most important breakthroughs in reinforcement learning was the development of an off-policy TD control algorithm known as Q-learning (Watkins, 1989). Its simplest form, one-step Q-learning, is defined by (6.6)In this case, the learned action-value function, , directly approximates , the optimal action-value function, independent of the policy being followed. This dramatically simplifies the analysis of the algorithm and enabled early convergence proofs. The policy still has an effect in that it determines which state-action pairs are visited and updated. However, all that is required for correct convergence is that all pairs continue to be updated. As we observed in Chapter 5, this is a minimal requirement in the sense that any method guaranteed to find optimal behavior in the general case must require it. Under this assumption and a variant of the usual stochastic approximation conditions on the sequence of step-size parameters, has been shown to converge with probability 1 to . The Q-learning algorithm is shown in procedural form in Figure 6.12. Figure 6.12: Q-learning: An off-policy TD control algorithm. What is the backup diagram for Q-learning? The rule (6.6) updates a state-action pair, so the top node, the root of the backup, must be a small, filled action node. The backup is also from action nodes, maximizing over all those actions possible in the next state. Thus the bottom nodes of the backup diagram should be all these action nodes. Finally, remember that we indicate taking the maximum of these "next action" nodes with an arc across them (Figure 3.7). Can you guess now what the diagram is? If so, please do make a guess before turning to the answer in Figure 6.14. Figure 6.13: The cliff-walking task. The results are from a single run, but smoothed. Figure 6.14: The backup diagram for Q-learning. Example 6.6: Cliff Walking This gridworld example compares Sarsa and Q-learning, highlighting the difference between on-policy (Sarsa) and off-policy (Q-learning) methods. Consider the gridworld shown in the upper part of Figure 6.13. This is a standard undiscounted, episodic task, with start and goal states, and the usual actions causing movement up, down, right, and left. Reward is on all transitions except those into the the region marked "The Cliff." Stepping into this region incurs a reward of and sends the agent instantly back to the start. The lower part of the figure shows the performance of the Sarsa and Q-learning methods with -greedy action selection, . After an initial transient, Q-learning learns values for the optimal policy, that which travels right along the edge of the cliff. Unfortunately, this results in its occasionally falling off the cliff because of the -greedy action selection. Sarsa, on the other hand, takes the action selection into account and learns the longer but safer path through the upper part of the grid. Although Q-learning actually learns the values of the optimal policy, its on-line performance is worse than that of Sarsa, which learns the roundabout policy. Of course, if were gradually reduced, then both methods would asymptotically converge to the optimal policy. Exercise 6.9 Why is Q-learning considered an off-policy control method? Exercise 6.10 Consider the learning algorithm that is just like Q-learning except that instead of the maximum over next state-action pairs it uses the expected value, taking into account how likely each action is under the current policy. That is, consider the algorithm otherwise like Q-learning except with the update rule Is this new method an on-policy or off-policy method? What is the backup diagram for this algorithm? Given the same amount of experience, would you expect this method to work better or worse than Sarsa? What other considerations might impact the comparison of this method with Sarsa?

    [more]
    rhythmist does not have any open projects.
    rhythmist does not have any work in progress.
[see more]

Portfolio

[see more]

Resume

Education

B.Tech

Indian Institute of Technology, Bombay

2010-2012