...
PlantUML Macro |
---|
@startuml alt "Deploying the instance" activate "ACM Runtime" "ACM Runtime" -> "Participant-intermediary" : [ASYNC] Deploying the instance deactivate "ACM Runtime" activate "Participant-intermediary" activate Participant "Participant-intermediary" -> Participant : Create Deploy thread deactivate "Participant-intermediary" note right Deploy thread is stuck end note end alt "Instance in Timeout" activate "ACM Runtime" "ACM Runtime" -> "ACM Runtime" : set instance in Timeout deactivate "ACM Runtime" end alt "Undeploying the instance" activate "ACM Runtime" activate "Participant-intermediary" "ACM Runtime" -> "Participant-intermediary" : [ASYNC] Undeploying the instance deactivate "ACM Runtime" "Participant-intermediary" -> Participant : Terminate Deploy thread deactivate Participant "Participant-intermediary" -> Participant : Create Undeploy thread activate Participant deactivate "Participant-intermediary" Participant -> "Participant-intermediary" : instance Undeployed activate "Participant-intermediary" deactivate Participant "Participant-intermediary" -> "ACM Runtime" : [ASYNC] instance Undeployed deactivate "Participant-intermediary" end @enduml |
Solutions
Solution 1: Replicas and Dynamic participantId - still using cache
Changes in Participant:
- UUID participantId will be generated in memory instead to fetch it in properties file.
- consumerGroup will be generated in memory instead to fetch it in properties file.
...
- When participant go OFF_LINE:
- if there are compositions connected to that participant, ACM-runtime will find other ON_LINE participant with same supported element type;
- if other ON_LINE participant is present it will change the connection with all compositions and instance;
- after that, it will execute restart for all compositions and instances to the ON_LINE participant.
- When receive a participant REGISTER:
- it will check if there are compositions connected to a OFF_LINE participant with same supported element type;
- if there are, it will change the connection with all compositions and instances to that new registered participant;
- after that it will execute restart for all compositions and instances changed.
- Refactor restarting scenario to apply the restarting only for compositions and instances in transition
NoteIssues:
- Participants create randomly participantId and Kafka consumerGroup. This solution has been tested and has the issue to create a new Kafka queue in restarting scenario.
During restart scenario, a new consumerGroup is created, that cause some missing initial messages due the creation of new Kafka queue . The result is that to fail to receive messages from ACM to restore compositions and instances.
Solution 2: StatefulSets - still uses cache
Participant replicas can be a kubernetes StatefulSets that consume two different properties file with unique UUIDs and unique consumer groups.
...
Note: In a scenario of two participants in replicas (we are calling "policy-http-ppnt-0" and "policy-http-ppnt-1"), ACM-Runtime will assignee randomly any composition definition in prime time to specific participant based of supported element definition type. So we could have a scenario where a composition definition "composition 1.0.0" is assigned to policy-http-ppnt-0 and the instance too; the new composition "composition 1.0.1" is assigned to policy-http-ppnt-1. In that scenario the migration of an instance from "composition 1.0.0" to "composition 1.0.1" wouldn't work, because policy-http-ppnt-0 do not have "composition 1.0.1" assigned.
Issues:
- At migrate time - In that scenario the migration of an instance from "composition 1.0.0" to "composition 1.0.1" wouldn't work, because policy-http-ppnt-0 do not have "composition 1.0.1" assigned. This is a critical issue.
Solution 3: Replicas
...
and Database support - no cache
Changes in Participant:
- Redesign TimeOut scenario: Participant has the responsibility to stop the thread in execution after a specific time.
- Add client support for database (MariaDB or PostgreSQL).
- Add mock database for Unit Tests.
- Refactor CacheProvider to ParticipantProvider to support insert/update, intermediary-participant with transactions.
- Refactor Intermediary to use insert/update of ParticipantProvider.
- Refactor Participants that are using own HashMap in memory (Policy Participant saves policy and policy type in memory)
...
- Db migrator will alter old version of the db to add new parts of the schema required by this participant change
- Liquibase used for script generation
- Separate image needed for DB Migrator - this will have to be released as a new dependency
- New Job in kubernetes and new service in docker should be added for this migration
Advantages of DB use
- Multiple participant replicas possible - it can deal with messages across many participants
- All participants should have same group-id in kafka
- All should have the same participant-id.
Solution 4: Distributed Cache
Issues:
- Not persistent - if the application that handles cache server restarts - data is lost.
- Approval issues - with Redis, Etcd, Search Engine.
Optimal Solution:
After analysis, it is clear that the best solution to use is number 3.
- Arbitrary number of participants possible
- DB migrator upgrades older versions
- Restart scenario not applicable anymore. Could be removed.
- Approval not an issue - postgres already used by acm