A couple of weeks ago, I learnt how to juggle. It took one YouTube video and 15 minutes to learn how to do it. In this blog, I want to analyse my experience from a reinforcement learning perspective.
First, I buy juggling balls, i.e. I create an environment for myself. I am curious and want to do a cool thing and challenge myself. I have a desire to master this environment. (Funnily enough, now as I can do that, my curiosity has lessened, and I'm not juggling that often.)
Before I start to use the environment, I already know what the goal is. However, this goal is not a specific state or score. This is an infinite number of states sharing one thing in common: the balls are in the air and are repeatedly changing the hand they are in. I happen to know the reward function: I know what I'm rewarded for. Usually in RL you randomly do stuff until you get reinforcement, here, I want to learn the behaviour leading me to the state I care about.
Doing random stuff for a bit is the next important step. I want to see how hard it is. I want to know how the real experience is different from my understanding of juggling. My biases are already in place: I can move hands, I know what my hand movements lead to, and I can see when the balls are in my hands, and I can throw them in the air.
Next, I'm doing imitation learning watching the youtube video recommended by the leaflet coming with the balls (yes, I can read, and I also know how and where to type in the URL). Interestingly, this step is not imitation only. This is curriculum learning, as well! You learn step by step, and it is you who decides when to switch tasks. You can go to previous tasks just to test yourself. You can also come up with your own tasks because you understand how to compose simpler tasks: there is some hierarchy here as well. Finally, this is multitask learning, as well. In the meantime, I have to negotiate with my son that his turn is in 20 seconds.
All right, 15 minutes and we are done! But there's a lot of mistakes. I need to fine-tune the model. I can track my progress: I can count, and I can see if I can juggle for longer periods of time before one of the balls falls on the ground. I can see where I have problems and try to come up with exercises to solve these problems. If I see that I can't throw a ball evenly, I practice this particular thing, I count to help myself.
My juggling is still far from perfect. I suffer from the compounding error problem. It is easy for me to start juggling from an initial state: two balls in the left hand, one in the right. It is a bit harder when one ball is in the left, two in the right. When one of my throws is bad (the ball is too far from me), I have problems with stabilising back to the accurate throwing position, the error accumulates: paying more attention to bad ball leads to less optimal throws for the second one. As a result, I fail. This is not very different from a MuJoCo agent.
Transfer to similar objects is okay. It takes a bit to learn how to do this with oranges, apples, other round or cubic objects. I tried it with shoes, and it did not work out: they are heavier, and the shape is irregular. Transfer to four objects is harder. I don't quite understand what I should do with the hands, and I do not have 4 balls of the same shape yet.
I can modify the environment just for fun. I can throw not as high as I did before, I will need to move faster. I can also throw the balls higher, and it will be harder for me to do it precisely. The skill requires less and less of my mental effort. What I find interesting, not all parts of my state-action set are relevant to the task: after a while, it does not matter if I stand, sit or walk. I can talk and juggle. I can walk and juggle. I can stand on one foot and juggle. I can't ride a monocycle and juggle at the same time yet, but this is my next frontier.
Thanks to everyone if you make it till the very end. I hope this gives someone food for thought. If you have something to share, welcome to comments or just drop me a line to vitaliykurin@gmail.com or https://twitter.com/y0b1byte.