Generating Peer synchronization issue

Having issues with your Timekoin Server? Someone might be able to answer your questions here.
User avatar
tiker
Posts: 81
Joined: Wed Jun 20, 2012 2:13 pm
Contact:

Generating Peer synchronization issue

Post by tiker »

I haven't read all of the TimeKoin code, but from what I have read in trying to determine why generating servers are dropping for no reason, here's my thoughts.

TimeKoin tries to keep two queues synchronized (Transaction history and Generating Peers). Both of these seem to be synchronized independently which could be part of the problem.

The generating peer list synchronization is a check of a hash between the server and all of the connected peers (or only the permanent peers if Peer Priority option is enabled). If the local server's hash is different from 51% of the peers it checks it begins the repair process.

Let's assume the following:
- 10 connected peers
- 4 peers say hash = abc123 (our server has this same hash)
- 1 peer says hash = bcd234
- 1 peer says hash = ccd235
- 1 peer says hash = dcd236
- 1 peer says hash = ecd237
- 1 peer says hash = fcd238
- 1 peer says hash = gcd239

Using this example, 6 peers have a hash different than our hash so we're assuming something is wrong. TimeKoin will pick one of the 6 servers we don't match and query that server for it's peers. TimeKoin will then connect to that peer and will start comparing the generating peers. If the peer has an extra public key we don't have, add it to our list. If we have an extra key that the peer doesn't have, increment a counter +1. Once our counter reaches 20, clear our list and copy the list from one of those peers. This is when the "Generation Peer List has Become Stale, Attempting to Purge and Rebuild." message is logged and we get dropped from the generation queue. (In the thread I posted where my server rebuilt the queue about 10 times in 2 hours - this could be the reason. It grabbed the generating lists of one of the servers that disagreed, but that single hash was still different from the rest so it tried again. It is connected to 25 peers.)

Can this be exploited?

What would happen if I were to bring 10 servers online where I crafted my own generating peer list. These 10 servers were permanent peers to each other and enabled the "Peer Priority" option. Then I tweak the genpeers.php code to prevent the other nodes on the TimeKoin network from overwriting or changing my crafted peer list.

Now I pick a node which has a small number of connected peers (8 I believe is the default) and have all of my nodes join that single node. (Set that node as a "First Contact" node.) That small node will suddenly have 10 servers reporting a generation hash different than itself, and different from the other 8 nodes it is connected to. Because the 10 servers won't change their generating node list, the server they connected to eventually would by clearing it's cache and accepting whatever crafted list I created. At that point I query that small server to see what it's other peers are, find the next smallest one, doe the same thing. Eventually I'd have enough peers connected and converted that I could alter the generating peer list for everyone.

The possible fix?

Instead of having TimeKoin depend on a majority of the peers to determine the accuracy of the generating peer list, why not have it use the transaction history to determine what's write and what's wrong? In the case of servers being dropped in the past few weeks, had the nodes actually checked the transaction history (2 hours worth of history) for generation transactions for their own public keys, they would see that they were generating, it's in the history, therefore still valid. The servers should also check the transaction history for the other public keys that might have gone missing and make it's own decision if the public key should be dropped from the generating queue or not.

Ultimately, if two servers disagree on the queue it's because they have or have not seen generation transactions and one node decided to drop a public key from the list while the other server did not. This should trigger both conflicting nodes to check their transaction histories, verify that the transaction cycle hashes match, fix if they don't, then review the transactions to see if the nodes did in fact send generating transactions or not.

There's going to be more checks to do going along this path but something like this needs to be put in place to prevent generating nodes from being kicked from the queue even when the servers can see the generation transactions in the queue.


I have some other ideas relating to this that might fix some other potential issues but I've got to head home from work now. What are your thoughts on this? Is there something in the code already that I haven't seen that prevents anything I've said above?
User avatar
PoisonWolf
Posts: 186
Joined: Fri Apr 12, 2013 10:39 am

Re: Generating Peer synchronization issue

Post by PoisonWolf »

If I'm reading this right, does having a higher peer limit help reduce the chance of this happening? Say 30 active peers & 15 in reserve? Or did I get it completely wrong?

The two hour-back tracking into one's transaction history (TH) makes sense, but one side-effect that I foresee happening is when people with corrupted histories keep perpetuating their generation status. Maybe the fix would include checking both their own TH and all of the connected peer's TH to confirm that it has indeed been generating at least ONCE in the past 2 hours? Would this be better as a 51% rule? or should it require a 100% confirmation that all connected peers have TH showing that the target server did infact generate a TK within the past two hours? How stringent this setting is has implications when a rogue problematic server screws it up for everyone else.

tiker wrote:I haven't read all of the TimeKoin code, but from what I have read in trying to determine why generating servers are dropping for no reason, here's my thoughts.

TimeKoin tries to keep two queues synchronized (Transaction history and Generating Peers). Both of these seem to be synchronized independently which could be part of the problem.

The generating peer list synchronization is a check of a hash between the server and all of the connected peers (or only the permanent peers if Peer Priority option is enabled). If the local server's hash is different from 51% of the peers it checks it begins the repair process.

Let's assume the following:
- 10 connected peers
- 4 peers say hash = abc123 (our server has this same hash)
- 1 peer says hash = bcd234
- 1 peer says hash = ccd235
- 1 peer says hash = dcd236
- 1 peer says hash = ecd237
- 1 peer says hash = fcd238
- 1 peer says hash = gcd239

Using this example, 6 peers have a hash different than our hash so we're assuming something is wrong. TimeKoin will pick one of the 6 servers we don't match and query that server for it's peers. TimeKoin will then connect to that peer and will start comparing the generating peers. If the peer has an extra public key we don't have, add it to our list. If we have an extra key that the peer doesn't have, increment a counter +1. Once our counter reaches 20, clear our list and copy the list from one of those peers. This is when the "Generation Peer List has Become Stale, Attempting to Purge and Rebuild." message is logged and we get dropped from the generation queue. (In the thread I posted where my server rebuilt the queue about 10 times in 2 hours - this could be the reason. It grabbed the generating lists of one of the servers that disagreed, but that single hash was still different from the rest so it tried again. It is connected to 25 peers.)

Can this be exploited?

What would happen if I were to bring 10 servers online where I crafted my own generating peer list. These 10 servers were permanent peers to each other and enabled the "Peer Priority" option. Then I tweak the genpeers.php code to prevent the other nodes on the TimeKoin network from overwriting or changing my crafted peer list.

Now I pick a node which has a small number of connected peers (8 I believe is the default) and have all of my nodes join that single node. (Set that node as a "First Contact" node.) That small node will suddenly have 10 servers reporting a generation hash different than itself, and different from the other 8 nodes it is connected to. Because the 10 servers won't change their generating node list, the server they connected to eventually would by clearing it's cache and accepting whatever crafted list I created. At that point I query that small server to see what it's other peers are, find the next smallest one, doe the same thing. Eventually I'd have enough peers connected and converted that I could alter the generating peer list for everyone.

The possible fix?

Instead of having TimeKoin depend on a majority of the peers to determine the accuracy of the generating peer list, why not have it use the transaction history to determine what's write and what's wrong? In the case of servers being dropped in the past few weeks, had the nodes actually checked the transaction history (2 hours worth of history) for generation transactions for their own public keys, they would see that they were generating, it's in the history, therefore still valid. The servers should also check the transaction history for the other public keys that might have gone missing and make it's own decision if the public key should be dropped from the generating queue or not.

Ultimately, if two servers disagree on the queue it's because they have or have not seen generation transactions and one node decided to drop a public key from the list while the other server did not. This should trigger both conflicting nodes to check their transaction histories, verify that the transaction cycle hashes match, fix if they don't, then review the transactions to see if the nodes did in fact send generating transactions or not.

There's going to be more checks to do going along this path but something like this needs to be put in place to prevent generating nodes from being kicked from the queue even when the servers can see the generation transactions in the queue.


I have some other ideas relating to this that might fix some other potential issues but I've got to head home from work now. What are your thoughts on this? Is there something in the code already that I haven't seen that prevents anything I've said above?
User avatar
tiker
Posts: 81
Joined: Wed Jun 20, 2012 2:13 pm
Contact:

Re: Generating Peer synchronization issue

Post by tiker »

PoisonWolf wrote:If I'm reading this right, does having a higher peer limit help reduce the chance of this happening? Say 30 active peers & 15 in reserve? Or did I get it completely wrong?
From what I have seen and from what I think is happening, having a larger queue might reduce the chances of this happening.

The servers with smaller connected peers (or use Permanent Peer priority enabled) would be the first ones to convert if there's ever a disagreement since it would take a smaller number of conflicting peers to cause them to change. However, if most of the servers on the network right now have a small number of peers connected and can be converted easily, having a larger number of peers will only delay the rebuild.

My first server which had been dropped in previous weeks was connected to 10 - 15 peers each time it happened. The server that dropped last night is connected to 25 peers when it got dropped, but my little server survived this time.

It's really more about the list that the rogue server (or group of servers if the generating peers list is split between nodes), what it contains and how it replicates through the rest of the network.
User avatar
tiker
Posts: 81
Joined: Wed Jun 20, 2012 2:13 pm
Contact:

Re: Generating Peer synchronization issue

Post by tiker »

PoisonWolf wrote: The two hour-back tracking into one's transaction history (TH) makes sense, but one side-effect that I foresee happening is when people with corrupted histories keep perpetuating their generation status. Maybe the fix would include checking both their own TH and all of the connected peer's TH to confirm that it has indeed been generating at least ONCE in the past 2 hours? Would this be better as a 51% rule? or should it require a 100% confirmation that all connected peers have TH showing that the target server did infact generate a TK within the past two hours? How stringent this setting is has implications when a rogue problematic server screws it up for everyone else.
It would be the same as if a new server that was catching up or had a corrupted history tried to process a normal transaction. It would mark it as failed, won't include it in the history and it's transaction cycle hash will be different from everyone else. That server would remove the server from it's own generating list but all of the other servers with accurate transaction histories would still process the generating transactions and would keep the peer in the generating list.

If a server thought it was able to generate created a generation transaction but wasn't allowed to, it would still be denied. I have actually seen this watching the queues.
warmach
Posts: 404
Joined: Thu Jun 21, 2012 5:18 pm

Re: Generating Peer synchronization issue

Post by warmach »

I hadn't thought about having it fall back to looking at the transaction history, :oops: . Yes, it will add more DB calls, but the parameters are indexed so it should be a quick lookup.

When is a transaction considered "confirmed?" I mean, if a generation transaction only propogates through have the network before the end of a cycle, you will have two "teams" of servers, one with the transaction and one without. They will then duke it out to determine the majority and said transaction will be included/not included. How long does this take? Longer than 2 hours? If so, then it would invalidate your fix.

The other part, what about newly elected servers? they do not have any transactions yet but still need to be in the generation list. Your fall back method would not work for these. The only way to do that would be have a special election log that you could lookup to see when a key was actually elected if there are not transactions in the history.
User avatar
PoisonWolf
Posts: 186
Joined: Fri Apr 12, 2013 10:39 am

Re: Generating Peer synchronization issue

Post by PoisonWolf »

warmach wrote:I hadn't thought about having it fall back to looking at the transaction history, :oops: . Yes, it will add more DB calls, but the parameters are indexed so it should be a quick lookup.

When is a transaction considered "confirmed?" I mean, if a generation transaction only propogates through have the network before the end of a cycle, you will have two "teams" of servers, one with the transaction and one without. They will then duke it out to determine the majority and said transaction will be included/not included. How long does this take? Longer than 2 hours? If so, then it would invalidate your fix.

The other part, what about newly elected servers? they do not have any transactions yet but still need to be in the generation list. Your fall back method would not work for these. The only way to do that would be have a special election log that you could lookup to see when a key was actually elected if there are not transactions in the history.
What would happen if the system moved to say a 60% system as opposed to a 51% system? And if the queue list is expanded to not just include the active peer list, but to also check the lists of all peers in reserve.

Would this make attacks like what tiker proposed more difficult?
User avatar
tiker
Posts: 81
Joined: Wed Jun 20, 2012 2:13 pm
Contact:

Re: Generating Peer synchronization issue

Post by tiker »

warmach wrote:I hadn't thought about having it fall back to looking at the transaction history, :oops: . Yes, it will add more DB calls, but the parameters are indexed so it should be a quick lookup.

When is a transaction considered "confirmed?" I mean, if a generation transaction only propogates through have the network before the end of a cycle, you will have two "teams" of servers, one with the transaction and one without. They will then duke it out to determine the majority and said transaction will be included/not included. How long does this take? Longer than 2 hours? If so, then it would invalidate your fix.

The other part, what about newly elected servers? they do not have any transactions yet but still need to be in the generation list. Your fall back method would not work for these. The only way to do that would be have a special election log that you could lookup to see when a key was actually elected if there are not transactions in the history.
It can take a day or two, depending on when the cycles get hashed into a foundation and foundation hashes don't match. This is no different than what already happens today. Right now, there is a fight with my server trying to repair transaction cycles and having "too much peer conflict" messages being logged. That goes back to my question of "why are we trying to keep two networks in sync and why not just focus on 1?".

Newly elected servers have to have some transaction cycles to be around long enough to send election requests and see the votes that elected them. If a newly elected server gets removed from the list before it sends out a generation transaction then there's not much we can do about that either way. If that same server sees another public key being removed and it does have a generation transaction within 2 hours previous, that node would protect that removal by not removing the node from the list. The more servers that refuse to remove valid generating servers the changes of servers being removed improperly are reduced.

I thought about having an election log but that is more data that would need to be stored and synchronized which doesn't change things much.
User avatar
PoisonWolf
Posts: 186
Joined: Fri Apr 12, 2013 10:39 am

Re: Generating Peer synchronization issue

Post by PoisonWolf »

Do you think it is happening again? Just noticed quite a plummet from 90+ to 80 peers left. We seem to lose about 10% of the generating servers about once every fortnightly or so?
User avatar
tiker
Posts: 81
Joined: Wed Jun 20, 2012 2:13 pm
Contact:

Re: Generating Peer synchronization issue

Post by tiker »

PoisonWolf wrote:Do you think it is happening again? Just noticed quite a plummet from 90+ to 80 peers left. We seem to lose about 10% of the generating servers about once every fortnightly or so?
Yes and no...

Every so often you'll notice a period where no elections will take place for 2 days. During this period the number of generating peers decreases to the high 70s or low 80s. The first election period following the 2 day election break will contain election requests with (what is now considered) invalid time stamps. (The second value is > than 01.) There will be close to 10 of these requests meaning there's still about 10 servers that have not updated the code yet. Eventually they'll get elected, the generating count will be +10 (or so) bring the count into the 90s. Since the servers that have updated the code won't accept generation transactions from these older servers they eventually get removed again making it appear as if the problem occurred again.

A patch should probably be put in place to to ignore election requests with invalid time stamps to prevent them from being elected over an updated server.

While I haven't confirmed this is the case, it would make sense considering my servers have been surviving and no one else has recently reported being dropped from the generating peers list.
User avatar
PoisonWolf
Posts: 186
Joined: Fri Apr 12, 2013 10:39 am

Re: Generating Peer synchronization issue

Post by PoisonWolf »

Knight, I know you just got back, but do you have any thoughts on the issues discussed in this particular thread? Although this was talked about a long time ago, I still feel that it is an extremely critical issue that needs to be addressed.
Post Reply