main_loop_status records

Development & Technical discussion about Timekoin.
Forum rules
Bug Collecting Database is Click Here
GitHub Account is Click Here
Post Reply
warmach
Posts: 404
Joined: Thu Jun 21, 2012 5:18 pm

main_loop_status records

Post by warmach »

I have come across an issue that has happened several times in the last year or two but I haven't been able to nail down yet. Not sure if anyone else has had similar issues.

Problem:
I have come across times where my home page lists a script or two as "OFFLINE" while the rest of the scripts are running or listed as idle. Stopping and restarting the timekoin server software does not reset the offline scripts and re-activate them. this is particularly frustrating when the generation script is the one affected. Often it is hours after it happens that I notice and have since been kicked from the generation queue.

Upon further investigation, certain records are missing from the table main_loop_status. This latest instance, "generation_heartbeat_active" was missing. Simply inserting it back in restarted the script.

Issue theory:
I have not been able to reproduce this, which makes it hard to debug. My theory is related to the deleting of these records from the main_loop_status table. Looking at the code, it is common practice to delete these records. The problem lies with re-inserting them. Often, the "active" and "heartbeat" records are checked and then inserted together. Many times, only one of the two is deleted and if the script stops (power failure, non-graceful system shutdown) before the second is updated, then you get a case where you have one and the system assumes the other is there and never re-inserts it. This then leads to the case where the script never starts.

Solution:
Modify the code so that both records are updated/delete in the same database transaction. This way, you will not run into the case where system failure throws your whole server out of whack.

Did that make sense?
User avatar
KnightMB
Site Admin
Posts: 1019
Joined: Thu Feb 23, 2012 5:03 pm

Re: main_loop_status records

Post by KnightMB »

The main_loop_status table runs in RAM directly. As you stated, it gets a lot of updates to it so hitting the disk all the time was burning I/O hence the reason for moving it to RAM. If the system reboots, you lose those records and they have to be recreated.

So, it would be easy enough to basically add an extra step to the variable update (since it is all in RAM, the few nanoseconds of processing should not matter even on the Pi systems) to have it check if the record exist first before updating. If the record does not exist, create it, then update.

It does make sense, especially if it is causing random issues. For example, say the web server itself runs out of RAM and shutdowns/restarts. Technically, the RAM still has those variables in the main_loop_status because the system did not technically reboot and clear all the RAM. The webs server though, tries to recover and in the process of restarting everything, all the scripts see the variables still exist in RAM and assume it was already running, shutdowns, etc.
goldbuy02
Posts: 2
Joined: Wed May 07, 2014 9:45 pm
Contact:

Re: main_loop_status records

Post by goldbuy02 »

Is strictly in which there is certainly this individual means inside you must firm.
Post Reply