Friday, 4 October 2019

Things that sometimes occasionally fail or sometimes permanently fail in google apps scripts - the 'art' of writing goole apps scripts . . .

We have a google apps script that takes automatically generated status and software upgrade and test reports and processes them into spreadsheets.

It's cool and great to have this stuff in spreadsheet :)
And be able to use it manually and do automatic things with it.

We have learned over time a bit about quirkiness of scripting for google drive.

* resource limits

* timing of script, . . . crafting run time of script loop, batching up processing of files after we have problems

* writing code to be robust against certain types of failures

* oops I used up all cells in spreadsheet

* it works great for a long time and then . . . moving files to a folder stops working

* it works great for a long time and then . . . call to getOwner.getEmail doesn't work for some files
   (not "at random" but not consistent over owner of file or other things)
  and . . . on occasion, removing a file stops working


BATCHING files processing:

We have a cron/regular scheduled task processing files coming in. 
Every 5 mins.

We don't want to update every page n sheet on every event as that is really slow.
So we read multiple of files in one go and process them in batch.
 And then update sheets.

Originally we read in ALL files . . 
 but if there was service problem(or we caused new behaviour and script error)
  we ended up with multiple of same script running all processing same files
   our spreadsheet would be updated at end of each job and ended up getting corrupt.
To solve this we put time limit in loop in script to try and make sure it doesn't overlap.
Script runs every 5 mins. 
We give it 3 mins to run it's file processing loop.
If service is really really slow this works ok just a few events processed.
If service running ok all events from 10 to 100s are processed.
If events coming in faster than being processed they build up but they eventually end up being processed.
If we had a problem and 1000s 10000s of files built up to be processed then this batch processing works ok gradually cranks through events. Also we have helper functions which process quickly less important events and remove them. AND the processing script is written to be tolerant of files disappearing(from the helper functions).


We tried to solve this by taking a lock and releasing.
But several flavours of locking mechanisms . . just didn't work.


How do we know what files are processed . . . 

Easy, just move them to Processed dir when we are done. :) Yes?
2 stages of processing so move to Processed_Partial and Processed_Done

Myeh, problem:
The simple file moving to Processed dir just stopped working for us at one stage.
So we are living with this behaviour for now. 
As long as script runs all way through without error it works ok.

If something goes wrong in script it ends with failure and we have already removed the files half processed. Oops. 
(so we have to handle any kind of occasional service error, permissions error, resource limiting error)

ORIGINALLY:
      fld.removeFile(childFile);
      fldDone.addFile(childFile);
    }


      fld.removeFile(childFile);
      
      try {
         fldDone.addFile(childFile);
      } catch(e) {
        Logger.log("ERROR moving " + fn + " to Processed");
        console.log("ERROR moving " + fn + " to Processed");
      }
    }

CHANGE TO: add before remove


CHANGE also: try catch around remove (as saw that cause a fail)
although that might not be needed, script continued I think.

I think on this line:
      fld.removeFile(childFile);

noticed because QA xxx event didn't get through

Debug Oct 4, 2019, 6:37:50 AM File Name:
Debug Oct 4, 2019, 6:37:52 AM ERROR moving to Processed


YAY and can see the try catch in yesterday is working:
BUT, interesting, error on removing file ? access denied . . but it didn't cause script to exit with failure . . . until a bit later.

Debug Oct 4, 2019, 6:39:12 AM File Name:
Debug Oct 4, 2019, 6:39:12 AM Cannot get Owner/Email of the file, eh?
Debug Oct 4, 2019, 6:39:12 AM Cannot read the file
Debug Oct 4, 2019, 6:39:12 AM more stuff 
Debug Oct 4, 2019, 6:39:12 AM more updated row 98 . . . 

Error Oct 4, 2019, 6:39:13 AM Access denied: DriveApp.
    at ourEventProcessingFunction(Code:875)


ADDITIONAL quirk: the execution doesn't show up as Failed in the executions list/
It ran @6:37am, it's 2:30pm now and that job and many others showing as Running even though from log our last log entry "done" is in there.



oops we used up all cells in spreadsheet

So when a particular biggest sheet gets very big we:
 * send email with warning
 * make a copy of sheet for reference
 * delete all rows in that sheet
 * send email with successful cleanup message

We used to do this by hand after we received the email warning.


Oct 3 some events were missed, the events processing had failed a few times because call to getOwner.getEmail doesn't work for some files!
   (not "at random" but not consistent over owner of file or other things)
## put try/catch around it, log error, set the email to a sensible default

Project. Apps script code. Function.

Using executions view . . .

Project
ourEventProcessingFunction Time-Driven Oct 3, 2019, 9:22:37 AM 113.192 s
Failed

Project
ourEventProcessingFunction Time-Driven Oct 3, 2019, 9:17:37 AM 114.537 s
Failed

Project
otherEventProcessingFunction Time-Driven Oct 3, 2019, 9:17:27 AM 4.279 s
Completed


Error Oct 3, 2019, 9:19:32 AM No item with the given ID could be found, or you do not have permission to access it.
    at ourEventProcessingFunction(Code:818)

    // sometimes getting permission denied/not found here:
    var user="error+" + fn + "@httpcats.com"
    try {
      user = childFile.getOwner().getEmail().toLowerCase();
    } catch (e) {
      Logger.log("WARNING: Cannot get Owner/Email of the file, eh? "+fn);
      console.log("WARNING: Cannot get Owner/Email of the file, eh? "+fn);
    }
    


On occasion scripts were ended prematurely due to service limits being reached:

: googleapi: Error 403: Rate Limit Exceeded, rateLimitExceeded


$ drive trash -id $g
- 1C2JttNUnEbRceBDinb8HGksdM8qSay6r ()
Deletion count 1 dest: 3.47KB
Proceed with the changes? [Y/n]:
 0 / 3550 [---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]   0.00%googleapi: Error 403: Insufficient permissions for this file, forbidden
 3550 / 3550 [===================================================================================================================================================================================================================================================] 100.00% 1s




move: 195ZOyMqKQHtoNknHCGdLOmeJrHEy_uG0: already exists. Use `force` flag to override this behaviour

move: 1X5DL54FdRf1cv4qCe_esysypCICA2xNN: dest: '' remote path doesn't exist



3/8/19 9:14 AM  monitorEvents   Service using too much computer time for one day        time-based      3/8/19 9:14 AM
3/8/19 2:14 PM  monitorEvents   There are too many LockService operations against the same script. (line 710, file "Code")      time-based      3/8/19 2:14 PM
3/8/19 10:29 PM monitorEvents   Service using too much computer time for one day        time-based      3/8/19 10:29 PM

Lockservice



The error report emails are useful . . . 

Error Message   Count
Service using too much computer time for one day        2
We're sorry, a server error occurred. Please wait a bit and try again. (line 848, file "Code")  1
Start   Function        Error Message   Trigger End
3/9/19 9:02 AM  ourEventProcessingFunction  Service using too much computer time for one day        time-based      3/9/19 9:02 AM
3/9/19 9:12 AM  ourEventProcessingFunction  Service using too much computer time for one day        time-based      3/9/19 9:12 AM
3/9/19 10:12 PM ourEventProcessingFunction  We're sorry, a server error occurred. Please wait a bit and try again. (line 848, file "Code")  time-based      3/9/19 10:14 PM