Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

PlantUML Macro
@startuml


alt "Deploying the instance"
  activate "ACM Runtime"
  "ACM Runtime" -> "Participant-intermediary" : [ASYNC] Deploying the instance
  deactivate  "ACM Runtime"

  activate "Participant-intermediary"
  activate Participant
  "Participant-intermediary" -> Participant : Create Deploy thread
  deactivate "Participant-intermediary"
  note right
  Deploy thread is stuck
  end note
end

alt "Instance in Timeout"
  activate "ACM Runtime"
  "ACM Runtime" -> "ACM Runtime" : set instance in Timeout
  deactivate  "ACM Runtime"
end

alt "Undeploying the instance"
  activate "ACM Runtime"
  activate "Participant-intermediary"
  "ACM Runtime" -> "Participant-intermediary" : [ASYNC] Undeploying the instance
  deactivate  "ACM Runtime"
  "Participant-intermediary" -> Participant : Terminate Deploy thread
  deactivate Participant
  "Participant-intermediary" -> Participant : Create Undeploy thread
  activate Participant
  deactivate "Participant-intermediary"
  Participant -> "Participant-intermediary" : instance Undeployed
  activate "Participant-intermediary"
  deactivate Participant
  "Participant-intermediary" -> "ACM Runtime" : [ASYNC] instance Undeployed
  deactivate "Participant-intermediary"
end

@enduml

Solutions

Solution 1: Replicas and Dynamic participantId - still using cache

Changes in Participant:

  • UUID participantId will be generated in memory instead to fetch it in properties file.
  • consumerGroup will be generated in memory instead to fetch it in properties file.

...

  • When participant go OFF_LINE:
    • if there are compositions connected to that participant, ACM-runtime will find other ON_LINE participant with same supported element type;
    • if other ON_LINE participant is present it will change the connection with all compositions and instance;
    • after that, it will execute restart for all compositions and instances to the ON_LINE participant.
  • When receive a participant REGISTER:
    • it will check if there are compositions connected to a OFF_LINE participant with same supported element type;
    • if there are, it will change the connection with all compositions and instances to that new registered participant;
    • after that it will execute restart for all compositions and instances changed.
    • Refactor restarting scenario to apply the restarting only for compositions and instances in transition

NoteIssues:  

  • Participants create randomly participantId and Kafka consumerGroup. This solution has been tested and has the issue to create a new Kafka queue in restarting scenario. 
    During restart scenario, a new consumerGroup is created, that cause some missing initial messages due the creation of new Kafka queue . The result is that to fail to receive messages from ACM to restore compositions and instances.

Solution 2: StatefulSets - still uses cache

Participant replicas can be a kubernetes StatefulSets that consume two different properties file with unique UUIDs and unique consumer groups.

...

Note: In a scenario of two participants in replicas (we are calling "policy-http-ppnt-0" and "policy-http-ppnt-1"), ACM-Runtime will assignee randomly any composition definition in prime time to specific participant based of supported element definition type. So we could have a scenario where a composition definition "composition 1.0.0" is assigned to policy-http-ppnt-0 and the instance too; the new composition "composition 1.0.1" is assigned to policy-http-ppnt-1. In that scenario the migration of an instance from "composition 1.0.0" to "composition 1.0.1" wouldn't work, because policy-http-ppnt-0 do not have "composition 1.0.1" assigned.

Issues:

  • At migrate time - In that scenario the migration of an instance from "composition 1.0.0" to "composition 1.0.1" wouldn't work, because policy-http-ppnt-0 do not have "composition 1.0.1" assigned. This is a critical issue.

Solution 3: Replicas

...

and Database support - no cache

Changes in Participant:

  • Redesign TimeOut scenario: Participant has the responsibility to stop the thread in execution after a specific time.
  • Add client support for database (MariaDB or PostgreSQL).
  • Add mock database for Unit Tests.
  • Refactor CacheProvider to ParticipantProvider to support insert/update, intermediary-participant with transactions.
  • Refactor Intermediary to use insert/update of ParticipantProvider.
  • Refactor Participants that are using own HashMap in memory (Policy Participant saves policy and policy type in memory)

...

  • Db migrator will alter old version of the db to add new parts of the schema required by this participant change
  • Liquibase used for script generation
  • Separate image needed for DB Migrator - this will have to be released as a new dependency
  • New Job in kubernetes and new service in docker should be added for this migration

Advantages of DB use

  • Multiple participant replicas possible - it can deal with messages across many participants
  • All participants should have same group-id in kafka
  • All should have the same participant-id.

Solution 4: Distributed Cache

Issues:

  • Not persistent - if the application that handles cache server restarts - data is lost.
  • Approval issues - with Redis, Etcd, Search Engine.

Optimal Solution:

After analysis, it is clear that the best solution to use is number 3.

  • Arbitrary number of participants possible
  • DB migrator upgrades older versions
  • Restart scenario not applicable anymore. Could be removed.
  • Approval not an issue - postgres already used by acm