Post stops due to error as below:
Info 2009-07-06 11:37:53.291360 1753160 1 Command server launched, pid = 1753160 (connecting from servername)
Error 2009-07-06 14:46:49.827765 1946030 4178 Poster: 15036 - invalid session# 0 (posting from SOURCE_SID, queue queue_name, to TARGET_SID) [module opo]
Error 2009-07-06 14:46:49.830413 1946030 4178 Poster stopped: Internal error encountered; cannot continue (posting from SOURCE_SID, queue queue_name, to TARGET_SID)
Info 2009-07-06 14:46:50.012822 1397420 1 Poster exited with code=1, pid = 1946030 (posting from SOURCE_SID, queue queue_name, to TARGET_SID)
Info 2009-07-07 10:59:17.655240 1777876 1 Command server launched, pid = 1777876 (connecting from servername)
There may be a corrupted message in Post queue having a subque # 0 in it.
The Post has non-zero subque numbers unlike Capture and Export queues that only have subque 0. If Post queue does have a message with subque # 0, then it is a corrupted message and will need to be removed (discarded) before Post can run. To ascertain and fix this, the following steps need to be carried out:
A. Set debug as:
1. Rename all opo log (the Post log) once Post is stopped due to error so as to generate fresh one.
2. sp_ctrl> set param SP_OPO_DEBUG_FLAG 0x0f0021ff
3. start Post.
4. Once the post errors out, turn off the debug:
sp_ctrl> reset param SP_OPO_DEBUG_FLAG
5. Rename the latest *opo* log so as to preserve it prior to starting Post.
B. View the latest *opo* log from bottom up and search for the string odr_trans and see if it contains the string '(0)' in it. The following portion of *opo* log does have this string. Also scroll a few lines below and look for the value against odr_obj. This value represents the object id of the replicating table in source database. In our example it happens to be 41420:
opo 2009-07-07 14:56:10.141185 413808 2057 2057: odr_op ODR_DDL_AT (29)
opo 2009-07-07 14:56:10.141209 413808 1 AddToRBList: args session=2 addrFrom=-1 addrTo=-1
opo 2009-07-07 14:56:10.141229 413808 2057 2057: odr_trans 0(-1).863-0(0)
opo 2009-07-07 14:56:10.141247 413808 1 AddToRBList: return - add -1 - -1 to rb_addr.addr_pair[0]
opo 2009-07-07 14:56:10.141271 413808 2057 2057: odr_time 07/06/09 14:50:28 (691512628)
opo 2009-07-07 14:56:10.141293 413808 1 FrbRemoveFromCache: args rmsg=30093158
opo 2009-07-07 14:56:10.141313 413808 2057 2057: odr_sid source_SID
.
.
opo 2009-07-07 14:56:10.141476 413808 2057 2057: odr_obj 41420
C. Now search from this part of *opo* log backwards and look for occurrence of string 'seq=' and note down the number next to it. In our example below, it is 3447052:
opo 2009-07-07 14:56:10.140915 413808 2057 CreateSQL: args msg=326ea900 seq=3447052 mid=674228008 subqno=0 rs_ln=860 op_subtype=0 msgLen=326ea890
D. Having determined the subque # and sequence # of the corrupted message, it can easily be removed as below (the example uses values derived in above para and may be different in customer's environment):
1. shutdown Shareplex
2. Discard the message in qview as:
Invoke the qview utility located in /proddir/bin directory as:
/proddir/bin ./qview -i
qview>qsetup
qview>open p w (in case of named Post queues, it prompts for the queue name, open the problem named Post queue)
qview>discard 0 3447052 (the syntax of discard is "discard subque# sequence#)
qview>exit
E. Then restart Shareplex and Post.
F. At a later time, sync the table whose object id on source was gathered from *opo* log by first synchronizing its structure and then its data. For the latter, use compare with repair option or any other suitable means.
If in doubt, call Support so that they can help with removal of corrupted message as well as suggestions about sync of the affected table.