At times when starting up Shareplex, the following error is observed:
# ./sp_cop &
[1] 18256
# ./sp_ctrl
Your tcp port is not set properly or 'sp_cop' is not running
Attempted to connect to sp_cop on port 8516
sp_ctrl > exit
[1] + Bus Error ./sp_cop &
#
Can be a number of different reasons, the most common one being memory related.
The error is typically memory related. The various workarounds described below are listed in the order of ease:
1. If you find any error logged in event_log during the failed startup, resolve it and then attempt another Shareplex restart.
2. Make sure that no Shareplex processes are running by issuing "ps -ef|grep sp_". Then check for any hung semaphores or shared memory segments and remove them if they exist. See SOL6531 and SOL2663 for details on how to remove them. Then restart Shareplex.
3. If running other Shareplex instances on this server, shut them down, then attempt restart of this and other Shareplex instances to see if it succeeds. You can start other Shareplex instances after this workaround, regardless of its success or failure.
4. Run "fixup all" using qview utility and then attempt to start Shareplex as:
$ qview -i
qview>fixup all
qview>exit
5. You can try another install of this version of Shareplex by specifying new proddir and new vardir. Once the installation and startup is successful, you can then use the new proddir and old vardir and see if Shareplex is able to start for our problem instance. Nothing has changed in terms of existing replication because using a new proddir should not impact it. We are just using the similar binaries, albeit from another location.
6. Bounce the server if feasible. Then restart sp_cop.
7. Rename the following shareplex internal files in $SP_SYS_VARDIR/data folder:
1. dirty
2. services
Rename dirty and services files even if they are zero byte.
Try to start up sp_cop. If sp_cop still coredumps, then rename dirty, service, and statusdb.lck
If sp_cop still coredumps, rename dirty, service, statusdb.lck, and statusdb
NOTE: Renaming statusdb might result in loosing the activation. Make a copy of the statusdb file, copy over the activation line from statusdb into the newly created statusdb(after restarting cop).
If sp_cop still coredumps, then kill any hung shared memory sessions and rename dirty, service, statusdb.lck, and statusdb.
8. If you can afford to clean up the existing replication environment, then run ora_cleansp on all nodes for this port number and then attempt restart of Shareplex. This will require activation and resync before replication can resume and hence is not desirable in a production environment unless used as a last resort.
If despite all these efforts you cannot start Shareplex, then you may want to contact Support for further troubleshooting.
Additional Information:
The following excerpt from Wikipedia sheds some light on Bus Error:
In computing, a bus error is generally an attempt to access memory that the CPU cannot physically address. Bus errors can also be caused by any general device fault that the computer detects. A bus error rarely means that computer hardware is physically broken - it is sometimes caused by a bug in a program's source code.
There are two main causes of bus errors:
non-existent address:
The CPU is instructed by software to read or write a specific physical memory address. Accordingly, the CPU sets this physical address on its address bus and requests all other hardware connected to the CPU to respond with the results, if they answer for this specific address. If no other hardware responds, the CPU raises an exception, stating that the requested physical address is unrecognised by the whole computer system. Note that this only covers physical memory addresses. When software tries to access an undefined virtual memory address, that is generally considered to be a segmentation fault rather than a bus error, though if the MMU is separate, the processor can't tell the difference.
unaligned access:
Most CPUs are byte-addressable, where each unique memory address refers to an 8-bit byte. Most CPUs can access individual bytes from each memory address, but they generally cannot access larger units (16 bits, 32 bits, 64 bits and so on) without these units being "aligned" to a specific boundary, such as 16 bits (addresses 0, 2, 4 can be accessed, addresses from 1, 3, 5, are unaligned) or 32 bits (0, 4, 8, 12 are aligned, all addresses in-between are unaligned). Attempting to access a value larger than a byte at an unaligned address can cause a bus error.
CPUs generally access data at the full width of their data bus at all times. To address bytes, they access memory at the full width of their data bus, then mask and shift to address the individual byte. This is inefficient, but tolerated as it is an essential feature for most software, especially string-processing. Unlike bytes, larger units can span two aligned addresses and would thus require more than one fetch on the data bus. It is possible for CPUs to support this, but this functionality is rarely required directly at the machine code level, thus CPU designers normally avoid implementing it and instead issue bus errors for unaligned memory access.