Ticket #368 (closed defect: fixed)

Opened 16 years ago

Last modified 16 years ago

Errors during MG setup can 'ruin' an application

Reported by: cfjedimaster Owned by: boomfish
Priority: high Milestone: 3.1 Maintenance Beta
Version: 3.0.178 Severity: blocker
Keywords: Cc:

Description

The summary describes it well, but in general, if something goes wrong during the first hit to a MG app, it will be broken (known events won't work) until you reinit. Dennis Clark discovered the core of the issue and described a fix:

In MG3 the framework is not fully loaded all at once. Instead, the framework is bootstrapped with only the core event handlers, and controllers for those event handlers complete the different tasks required for loading the rest of the framework. One of these tasks (done in the initialization phase) is to put the framework into the application scope.

The problem is that if an exception such as a request timeout occurs after the framework is put into the application scope but before the framework is fully loaded, the framework is left persisted in an incomplete state. The application will then be stuck with a partially loaded framework object until the next init.

My suggestion for the MG team is to move the code that puts the framework object into the application scope from the initialization phase to the termination phase (which is currently commented out in the source). This has the side-effect of reloading the framework on every request until it can complete a request successfully. I would consider this side-effect a bonus feature: if my first request after an init fails I'm likely to want to re-init anyway.

While I'm at it, I have another suggestion for the MG team: remove the cflock from ModelGlue?.cfm, the MG3 framework bootstrapper doesn't need it.

Attachments

ModelGlue_ticket368_lockAndLoad.patch (14.3 kB) - added by boomfish 16 years ago.
Patch against trunk that implements a "lock-and-load" solution to the framework initalization issue
ModelGlue_ticket368_lockAndLoad_V2.patch (16.9 kB) - added by cfgrok 16 years ago.
Modification to original patch to correct additional failure scenario
ModelGlue_ticket368_lockAndLoad_V3.patch (17.2 kB) - added by boomfish 16 years ago.
Updated patch that addresses event generation issues
ModelGlue_ticket368_lockAndLoad_V4.patch (16.9 kB) - added by cfgrok 16 years ago.
Update fixing bugs introduced in V3 patch

Change History

Changed 16 years ago by DanWilson

  • milestone set to 3.1.x-next-patch

Changed 16 years ago by boomfish

  • owner set to boomfish
  • status changed from new to assigned

Changed 16 years ago by boomfish

  • priority changed from normal to high
  • severity changed from normal to blocker

Concerns have been raised that my proposed solution has some downsides, particularly with regards to efficiency/performance.

I agree that my proposal has a negative performance impact and is not a good long-term solution. On the other hand, the current situation is quite serious as it can leave an application completely unusable until someone performs a reinit. We definitely need some kind of fix for the next patch release, even if it's not perfect.

I'll discuss the implementation challenges of fixing this issue with the rest of the Model-Glue team so we can determine the approach we want to take. In the meantime I'll bump up the priority & severity so that it stays on the top of our TODO list for the next patch release.

Changed 16 years ago by boomfish

Patch against trunk that implements a "lock-and-load" solution to the framework initalization issue

Changed 16 years ago by boomfish

I have attached a patch that implements an alternate solution that I call "lock-and-load". This solution should be less susceptible to the dreaded "dogpiling" problem than my original solution.

In the "lock-and-load" solution, the execution of each request phase is divided into two sub-phases: setup/load and execute. The load() method performs phase initialization, and the setup() method uses exclusive locking to ensure that the phase is loaded exactly once before it any requests execute it.

This approach differs from the existing code in that any request can load a phase, not just the initialization request. This way, if an initialization request fails part-way through loading the phases, any other request that follows it can complete the loading process. In fact, another request could potentially "leapfrog" the initialization request during its normal processing and load one of the phases without breaking the request phase semantics.

This doesn't make the initialization perfect. For example, if an initialization request fails part-way through loading a phase the loading is not rolled back, so the next request may load parts of that phase a second time. Fortunately this is not likely to cause serious problems with the application.

Feedback on this solution is welcome.

Changed 16 years ago by cfgrok

This looks to me to be a very well thought out approach, and from my preliminary testing, it gets us most of the way to fixing the reported issue (the possibility of a "partially-loaded" application which will continue to fail until another reinit).

I did turn up another scenario under which the same problem occurs, however, which is due to what I will call a cross-phase dependency:

Currently, the configuration controller accesses the "modelglueReloaded" event value, and will only load the primary module if this value is true. A problem can arise here due to the fact that this value is set in the initialization phase (specifically, in the loadFrameworkIntoScope() method of the initialization controller), so if a request times out after initialization but before configuration, the framework will still be left in a broken state, as none of the application's event-handlers will exist.

For anyone looking to replicate this, I tested it out by adding a line calling the sleep() function to the initialization controller immediately after the framework is stored in the application scope (line 15 in the unpatched version), thus forcing a timeout.

I've gone ahead and modified the patch to correct this by adding a "loaded" flag to the configuration controller and testing for that value instead in order to remove the dependency on the "modelglueReloaded" event value. (I also added a modification from Dan that adds a couple of values to the ModelGlueAbstractTestCase CFC in order to keep the unit tests working.)

I'm attaching the modified patch to the ticket, and would also be quite interested to hear any feedback anyone may have on this issue.

Changed 16 years ago by cfgrok

Modification to original patch to correct additional failure scenario

Changed 16 years ago by boomfish

Hey Ezra,

Thanks for reviewing the code and identifying the other race condition. After reviewing your changes I spent some time trying to work out how the configuration phase works, and I believe I discovered another existing bug related to event generation.

I believe that event generation code in trunk requires that the modelglue.modulesLoaded event be executed on every request, not just on initialization. To generate an event, the event controller has a method that listens for the modelglue.modulesLoaded message. If that method generates a new event, it adds a result that causes the modelglue.readyForModuleLoading event to be queued so that the primary module will be reloaded.

I recall that there was a discussion recently on the list about an event generation bug, and rereading it I see that you created a branch to test out a solution. I haven't looked at your branch yet but it sounds like you took a different approach to the event generation issues.

I have modified the patch to have modelglue.readyForModuleLoading be executed on configuration load time and have modelglue.modulesLoaded be executed on configuration execute time. This should allow event generation to be triggered properly.

I have also removed the checks on the initialization and configuration controllers so that they can be executed multiple times if needed. The new lock/load logic in the request phases will ensure that the primary module will not be reloaded if it has already been loaded and the current request did not generate a new event. Reloading of the primary module will still be performed if the generation controller adds a "configurationInvalid" result.

One of the consequences of these changes is that request._modelglue.bootstrap.initializationRequest is now no longer used. I left it in ModelGlue.cfm in case removing it would break tests or external code, but we may eventually want to remove it completely.

I also noticed that much of the scaffold module (in particular scaffolding.xml and ScaffoldController.cfc) appears to be dead code. I wasted a lot of time looking at those so it'd be nice if those files were either removed or identified as dead in header comments.

Please review the new patch to verify that it still fixes the identified race conditions. It would also be great if we can confirm that it fixes any existing event generation issues (or at least doesn't add any new ones).

Changed 16 years ago by boomfish

Updated patch that addresses event generation issues

Changed 16 years ago by cfgrok

Hi Dennis,

I took a look at the latest patch, and while I noted that this version eliminated some completely unnecessary code I added in the V2 patch, it also introduced two new bugs. I'm therefore attaching a V4 patch which incorporates the correction for my error, and reverts the other two modifications -- here are the details:

First, as far as my mistake goes, after testing the V3 patch and taking another look at the code, it is quite clear that the "loaded" checking I introduced in the configuration controller was entirely pointless. Good call on removing this, as it just added confusion to the entire process.

The first bug that was introduced was in relation to the event generation module, and was caused by moving the execution of the modelglue.modulesLoaded event from the setup() to the execute() method of the configuration phase.

One of the key requirements of the event generation feature is that the generation should only occur on an initialization request. This is by design, as this feature is only intended to be used in development, and ensuring that it will only fire when the application reloads is an additional failsafe that will help to prevent event generation from occurring on production sites if people forget to turn off the generationEnabled switch.

(Also, although this is certainly subjective, I felt that altering the modelglue.modulesLoaded message so that it ran on every request invalidated the semantic meaning of the message, as it would then be running regardless of whether modules were actually loaded or not.)

As this modification broke the specification for event generation, I removed it and reverted to the previous behavior.

As far as the mentioned event generation bug is concerned (ticket #389), I'm afraid that this change wouldn't have made any difference, as that issue arises due to the fact that controller CFCs are instantiated before the generation of a new message listener function takes place. In order to fix that bug, the generation pass would need to execute before primary module loading (see the ticket for a more complete explanation, and also a workaround suggested by Elliott Sprehn that we may wish to implement instead).

The other new bug that was introduced was caused by the removal of the boolean test for request._modelglue.bootstrap.initializationRequest in the loadFrameworkIntoScope() method of the initialization controller. Unfortunately this check is still needed, as if it is removed it creates another scenario whereby the framework can get "stuck" in a partially-loaded state.

The problem that can arise here would be a situation where the initialization process hits an error (a timeout, for example) after the framework is stored in the application scope, but before the setup() method of the initialization phase completes. If this occurs, the next request will fail on the following line in the loadFrameworkIntoScope() method:

<cfset mg = request._modelglue.bootstrap.framework />

The exception thrown is "Element _MODELGLUE.BOOTSTRAP.FRAMEWORK is undefined in REQUEST." -- this occurs because once the framework is loaded into the application scope, the bootstrapper is not reloaded, and thus subsequent requests cannot access it.

I went ahead and restored the previous if statement, as it makes sense to me to check for the request._modelglue.bootstrap.initializationRequest value in this context, given that the next couple of lines reference the framework and bootstrapper in the request scope.

Given those alterations, I would say this change seems to be working as expected, and I too would be very curious to hear the outcome of the load testing pass.

As far as the scaffolding module goes, you are quite right that it was in a defunct state. I note from your comment on ticket #376 that you have found a new use for it, which is helpful as one of the items I'd like to see addressed (ideally in 3.2) is to actually implement scaffold generation as an actual module, replacing the current approach which has some issues of its own.

Changed 16 years ago by cfgrok

Update fixing bugs introduced in V3 patch

Changed 16 years ago by boomfish

Thanks Ezra. It looks like my last revision to the patch did more damage than repair, so thanks for catching all that :-) I figured you had a better grasp of Model-Glue's request processing cycle than I. Your code analysis was quite thorough and I agree with your conclusions.

I feel like we've thoroughly inspected the code for this patch, and further review is unlikely to yield productive results. The best next step is to subject the latest revision of the patch to load testing.

Changed 16 years ago by boomfish

  • status changed from assigned to closed
  • resolution set to fixed

Fixed in SVN commit 261.

Dan completed some initial load testing against the framework with and without the patch (V4). Results showed that the patch significantly reduced the number of errors during startup. The patch also increased response times during startup, but performance differences after startup were within margins of statistical error.

While the results show there is likely room for improvement, Dan gave the nod to commit the patch to trunk so as to close this stability issue.

Note: See TracTickets for help on using tickets.