CursorNotFound after some time running large process

Hi,
I’m running a sharded cluster and I recently upgraded mongo from 3.0 to 4.2. Some programs that run previously without errors now raise CursorNotFound error in loops like

for rec in aggregation_query:
   process(rec)

where process can be quite time consuming and aggregation_query can return > 10000 values…
Errors come around 65mn from the beginning of the job (so much lower than the cursor timeout parameter, see below)

I don’t use explicit sessions

DB parameters cursorTimeoutMillis and localLogicalSessionTimeoutMinutes have already be “ugraded” to the equivalent of 2 hours (7200000 and 120 respectively).

How can I get more information (which systemLog component should I put to a debug level) ?
Any idea of how to solve that ?

Messages are like
CursorNotFound: Cursor not found (namespace: 'my_db.my_collection', id: 5454438319793971081)
In which log (mongod or mongos) can I find this id ?

Context is : mongoDB 4.2.3, pymongo 3.10.1

Hi @RemiJ, welcome!

That seems to be quite a long time to iterate, is it possible to refactor the application code to reduce the process iteration time ? Perhaps utilise $out to a temporary collection and spawn multiple processes.

Cursor timeout is one of the possible reasons why the cursor could no longer be found. Could you ensure that the options were set correctly ? If an iteration of a cursor batch takes longer than the default cursor timeout of 10 minutes, the server deemed the cursor idle and will close it.

Could you check on logs whether there’s anything happening on the shard (i.e. replica set or config servers election) around the 65 minutes ?

Regards,
Wan.

Hi @wan,

I had a look in the log. Actually, it comes from the sessions handler that kills the session after 30mn and not the 2 hours as specified by localLogicalSessionTimeoutMinutes=120. All the cursors attached to the session are deleted at same time.

I updated my program to issue sessions refresh every 10mn so sessions don’t expire before the end of processing.

But now, when cursor in the main “for” loop is a find and not an aggregate, I still get cursor timeout due to sharding… (cursor is active in (one) shard but expires in others)