On Modeling Agents in Dynamic Systems

Since we are interested in modeling social and economic systems we are constantly asking what our agents will do. In order to ask this question it is first important to understand how a dynamic (or differential game) differs from a one shot game. A dynamic game has a persistent, potentially not totally observable state we call X.

When an agent takes an action we call that action u and the new state of the system is X^+ = f(X, u) where f:\mathcal{U}\times \mathcal{X}\rightarrow \mathcal{X} is a function, also called a mechanism, defining how that action changes the state. Here \mathcal{U} is the set of actions the agent is allowed to take and \mathcal{X} is the state space. In a simple system with one agent and one mechanism the agents sequence of actions would uniquely define a path through the state space, called a trajectory. An agent observing that they are at X_0 \in \mathcal{X} wishing to arrive at point X^* \in \mathcal{X} could chart a course of actions u_0,..., u_t \in \mathcal{U} such that they can expect to arrive at X_t =X^*. This is the canonical description of an open loop or non-feedback controller. In practice, the moment perturbations \delta are introduced, or there are other agents mutating the state but with goals not aligned with the original agent, it quickly becomes clear that our agent may never reach X^*.

If we suppose for a moment that our agent’s desire to be at X^* can be encoded as a private desire for the system to in state \sigma_t at any time t and their desire is encoded by an objective to \min \Phi where

\Phi(X, \sigma) = (X - \sigma)^2

Given a simple definition for the mechanism

f(X,u) = X+u\\ u\in \mathcal{U}=\{u:|u|<1\}

it is possible to define a feedback controller to define the agents actions according to its own goals. If \sigma is constant and there is no noise then our agent can choose

u_t = \hbox{arg}\min_{|u|<1} \Phi(\sigma,X_t)

In this case the answers can be computed closed form and implemented directly. If |\sigma - X_t|<1 then choose u_t = \sigma - X_t, otherwise if u_t = \hbox{sign}(\sigma-X_t) taking the largest allowable increment in the desired direction. Deriving control rules works even after injecting uncertainty \delta and hiding parts of the state. That is only allowing our agent to observe Y\subset X.

The dynamics start to get complex when there is state-feedback in the incentives, multiple mechanisms f_1, f_2,... and/or many agents interacting concurrantly with the same system. At this point we can define strategies for agents but those agents abilities to achieve their goals are uncertain. In some cases those agent’s goals may be unbounded, such as maximizing profit. This means that outcomes are determined as much by the interactions between the agents as they are by the individual agents strategies. In the case of multiscale systems the system itself may have adaptive properties.

One such case is the Bitcoin network. If the agents are Bitcoin miners and their actions involve buying and operating mining hardware and their object is to make profit by mining blocks and collecting rewards, then the changes to the mining difficulty represent system level adaptivity in response to the aggregate behavior of the miners.


Source: https://bitcoinwisdom.com/bitcoin/difficulty
Looking at the aggregate system dynamics hides the nuances of the individual invest decisions and outcomes of all of the miners whose actions gave rise to these dynamics as well as that systems coupling with the secondary market for bitcoin, and the halving scheduling for Bitcoin mining rewards. There is forthcoming research on Bitcoin as a multi-scale game coming out of the Vienna CryptoEconomics institute.

More broadly, as the systems become more complex the behavioral models u_t = P(Y_t, \sigma_t, \delta) can be approached one of three ways:

  • Agents behavior functions P(\cdot) can be encoded with Heuristic strategies derived from game theoretic, psychological decision sciences and/or behavioral economics literature.
  • Agents behavior functions P(\cdot) can be machine learned from past data where the feature space is some characterization of the agent and system states, and the labels are the actions u taken.
  • Agents can also have inherently adaptive strategies by encoding them as reinforcement learning agents who will learn to do whatever they can to achieve their goals within the bounds of the action space \mathcal{U}. In this case, P(\cdot) is itself time varying.

Models of all three types can be implemented in cadCAD; it is even possible for all of them to be used in the same model. I am looking forward to seeing this area of research pursued further as Bitcoin in particular has relatively simple dynamics and a 10 year history from which to draw data.

3 Likes

To make this more tangible, here is a relatively simple example of a system with agents arriving at random but interacting according to their own heuristic strategies.

In this case, when an agent arrives we randomly select whether they are greedy, fair, or giving:

def greedy_robot(src_balls, dst_balls):
    
    #robot wishes to accumlate balls at its source
    #takes half of its neighbors balls
    if src_balls < dst_balls:
        delta = -np.floor(dst_balls/2)
    else:
        delta = 0
    
    return delta

def fair_robot(src_balls, dst_balls):
    
    #robot follows the simple balancing rule
    delta = np.sign(src_balls-dst_balls)
    
    return delta


def giving_robot(src_balls, dst_balls):
    
    #robot wishes to gice away balls one at a time
    if src_balls > 0:
        delta = 1
    else:
        delta = 0
    
    return delta

This is a variation of our Robot and Marbles tutorial series that where the robots are agents in a growing community of marble traders. You can see the outcome of one realization here:

The code that generated this particular example can be viewed here. We also made a short video to show the network evolving in time.

The robots whose lines and nodes are colored in green are the Giving, the blue ones are Fair and the purple ones are Greedy. While the general trend is for greedy robots to accumate the most marbles it is important to note that the network topology plays into the outcome. Both giving and greedy robots insulated by fair robots will maintain approximately the average number of marbles of its fair neighbors. A fair robot who is bounded by greedy robots can accumulate quite a horde of marbles.

The main point here is that unintuitive outcomes can arise from very simple heuristics, especially when the dynamics are embedded on a time varying network. Things get even more interesting when the network isn’t just growing randomly but connections are chosen based on observations of past behavior. What if I break links with my greedy neighbors?

3 Likes

Modeling and Simulating the effect of incentive design on energy systems that @solsista is also working on would have multiple agents with different optimization functions, meaning different goal states. Some components of the optimization function, like maximizing people with electricity, may be common, but others, like monetary flows, would compete for most actors.

That would lead to a breakout of each agents goal state X^*_{agent} into its component parts. For example, if state is represented by state attributes such as X=[x_1,x_2,...,x_n] , we could then set individual state attribute targets such as X^*=[x_1^*,x_2,...,x_n^*] where x_1^* and x_n^* have a target, but x_2 acts as a don’t care.

Then, the 3 behavioral models you describe (loving the RL ideas) could be learned by each agent independently for their own optimization function, while also potentially learning goal values for their don’t care state attributes based on what they observe as the connection between those values and their opposing agents optimizations over simulation or real data outcomes.

We could then measure the overall effectiveness of the incentive mechanisms we are testing based on our overall system goal… e.g. transfer of ownership of the future cash flows of an energy-producing microgrid from the clean energy developers to the consumers.

Further, those initial don’t care variables that turn out to be important to their overall optimization against other actors in the system could be prescribed to the relevant actor… e.g. telling consumers of electricity to petition developers to decrease rate of periodic equipment replacement.

Any thoughts on how this could look in cadCAD, especially something like multiple runs of a simulated system to train multiple competing RL agents? I’d think that the way states are represented in cadCAD lends itself well to optimizing on sets of state attributes.

2 Likes

Reinforcement Learning can actually be derived from control theory principals; its basically stochastic optimal control combined with neural networks. Since cadCAD is built on the same first principles one can definitely do this; anyone interested should check out this textbook to better understand RL in the context of dynamical systems. We’re looking to create patterns to streamline this work in the future.

2 Likes

ok, cool! (1) What can we produce whilst working on the preparation to deduce - or at least recognize - such patterns?
(2) wrt @Colin’s question “…something like multiple runs of a simulated system…” - is it rather “Multiple simulation execution” or “Parameter Sweep” (e.g. if the range is not defined by user but learnt from data?!)
[http://www.sonnetsoftware.com/support/help-v1554/help_topics/what_is_a_parameter_sweep_.htm]

1 Like

We can/should model a machine in the system (e.g. smart microgrid) also as an agent, right? Except instead of heuristics we can have a(nother detailed) system model of the microgrid. Asking for a friend :slight_smile: Community Power Grids

Do you have any updates here?

It is part of Kris Paruch PhD research. Nothing new yet but I’ll ask him to share his work as it picks up.