My current project requires an orchestration to be built that will call out to one or potentially many WCF Services through a sequential loop, the specific services to be called on within the loop being resolved from the Business Rules Engine based on message context. The orchestration needs to provide for guaranteed delivery and has retries built around the WCF Service calls in case of routing failures since the logical send port is of the direct binding type, or soap faults encountered when calling the WCF Service in question, or exhaustion of retries on the send port, or in case the orchestration doesn’t hear back from the send port after a specified timeout period. The orchestration needs to perform as quickly as possible and use the minimum amount of machine resources since it is a high throughput orchestration which needs to process many millions of messages per day.
I managed to achieve the above using the below patterns.
- Catch a WCF fault in a BizTalk orchestration (though in my case since the WCF Services I am calling are WSHttp binding based I need to catch a BTS.soap_envelope_1__2.Fault instead of BTS.soap_envelope_1__1.Fault)
- BizTalk orchestration direct binding and routing failures/PersistenceExceptions
- Scenarios Using Long-Running Transactions (specifically scenario 1 to catch timeouts)
- Lastly I have the “enable routing for failed messages” flag turned on for the WCF Service send ports in question which ensured that rather than suspend after retries were exhausted, an error message would get generated and I would have this routed back to my orchestration instance, handle the exception, and then suspend the orchestration instance.
My orchestration looked a bit like the below after implementing this logic (note that I have cut out a lot of my internal logic in this screenshot, there was a lot more to it but this gives you a jist of the flow).
While running load tests it occurred to me that by catching timeouts the regular way with a long running transactional scope I was effectively forcing a persistence point every time the scope completed. To make things worse, my scope was nested within two parent scopes for error handling and variable scoping purposes and I was forced to mark both of these as long running too since you can’t nest a transactional scope within a non-transactional scope. If there were no orchestration shapes in between the ends of my three scopes then the persistence points would be collapsed into a single persistence point but my error handling really didn’t allow for this so I now had a minimum of three persistence points to deal with per orchestration instance, which my gut told me would definitely be causing performance issues and draining my server resources under heavy load thus constraining my throughput.
An alternative I decided to explore was using the listen shape with a receive branch to handle response messages and a delay branch to catch timeouts instead and to change all my scopes to be non-transactional (I wasn’t taking advantage of compensation so didn’t think I would lose any benefits of using a long running transactional scope). However it looks like because I was using a logical request-response send shape in my orchestration this was not possible and I was encountering the error “incomplete requestresponse; missing receive” as described in this forum post whose poster was trying to achieve exactly the same ends. It appears that the listen/delay timeout catching pattern does not work with request-response logical send ports in orchestrations.
When sending out a request-response message from my orchestration on a direct bound send port I noticed that my orchestration had an instance subscription to receive back the response messages based on the BTS.CorrelationToken context property. This led me to believe that using a request-response logical send port in an orchestration automatically generates a GUID value in the BTS.CorrelationToken context property and promotes it, using that value to receive the response message back to the orchestration. physical send ports appear to automatically copy over the BTS.CorrelationToken promoted property from the request to the response or fault messages. I decided to do the same thing except with a one-way logical port for the send and another one-way logical port for the receive and by manually promoting the BTS.CorrelationToken context property, thus enabling me to use the listen/delay timeout catching pattern.
I created a one-way logical send port instead replacing my request-response send port and also created a new one-way receive port with an operation with a message matching the response message from the send port. When constructing the request message I created a new GUID and set it’s value to the BTS.CorrelationToken context property on the message. I created a correlation type containing the BTS.CorrelationToken context property, created a correlation set of the aforementioned type in my innermost scope which I initialized on my send shape and followed on my receive shape to force the property to be promoted. I then created a listen shape and moved my receive into the first branch, and created a delay shape in the second branch to catch timeouts and put my timeout exception handling logic in that branch. I could now safely mark all my scopes as non-transactional and my orchestration looked like the below.
Load testing immediately garnered better results, with my transactions per second on a previously base-lined application rising from 78 to 90 (a 15% increase in throughput) and the CPU utilization on my message box SQL Server dropping massively. I instantly felt vindicated that the extra effort had paid off, but then realized that I had lost out on my ability to catch SOAP faults and messages indicating retries on my send port had exhausted which was not acceptable.
To catch SOAP faults I had to add a new operation to my one-way receive port with a message type of BTS.soap_envelope_1__2.Fault, add a new branch to my listen shape in which I put a receive shape for a message of the same SOAP fault type and linked it to the new operation on the receive port, and also had this receive shape follow the same correlation set that I initialized on the send. I could then run XPath statements against the soap fault message to extract the exception details and handle it accordingly. Since send ports copy over the BTS.CorrelationToken context property to all response messages including fault messages this wasn’t too hard to do.
The failed messages on send ports that had exhausted retries were a bit trickier to deal with. This is because I could not find any clean way (there was one method which could work as described in this blog post but I really wanted to avoid having to receive back the original request message into my orchestration instance to support this pattern as that would add inefficiencies and affect throughput) to correlate these failed messages back to my orchestration as the failed messages were of the same message type as the original request message except they now had some error context properties.
I decided to take advantage of NACK (negative acknowledgment) messages instead (see this blog series if you want more information about generating NACKs on send ports). NACK messages are simply messages of type BTS.soap_envelope_1__1.Fault however do not have a BTS.MessageType context property set against them. They are also only generated by send ports upon retry exhaustion if there is an existing subscription for the NACK (or if you use the orchestration delivery notification functionality or BTS.ACKRequired context property but those weren’t suitable for my purposes).
I decided to use a loopback send port (the loopback adapter in question being developed by my friend and colleague Mark Brimble and is proprietary so can’t be shared, though you can find other implementations on the internet) to subscribe to NACKs off the WCF service send ports as well with an XMLReceive receive pipeline to resolve the message type of the NACK since they don’t have a BTS.MessageType context property by default. An orchestration could have been used instead of a loopback send port however that means i would not be able to adjust the filter properties at runtime which wasn’t flexible enough for my purposes. See an example of filter properties on the loopback send port to have my WCF service send port generate NACKs.
I then added another operation to my logical one-way receive port in my orchestration with a message type of BTS.soap_envelope_1__1.Fault, added a new branch to my listen shape with a receive shape for a message of type BTS.soap_envelope_1__1.Fault linked to the aforementioned operation which also followed the correlation set initialized on the send of the WCF Service request message. I could then run XPath statements against the received NACK message to extract fault details and handle the exception appropriately. The orchestration now looks like the below.
The one catch now is that I need to discard the failed messages that get generated as a result of the “routing for failed messages” flag being enabled on the send ports (I still need this enabled because my solution calls for error handling to be done from within the orchestration for this specific message flow rather than from send ports for guaranteed delivery purposes, and I want to ensure that send port instances do not remain suspended after retries get exhausted) as they will be created in addition to NACKs. This does result in extra unnecessary messaging when retry exhaustion occurs, but this is expected to be the exception rather than the norm and was deemed acceptable for this solution. The same justification applies for the addition of the loopback send port.
Something else to keep in mind is that if the Web Service you’re calling is not a WSHttp binding (or equivalent WS* based) based Web Service then the fault you will need to catch will most likely be of type BTS.soap_envelope_1__1.Fault which is the same type as NACK messages. In this case you would have to consolidate the listen branch for the SOAP fault and NACKs into a single branch and inspect the message to find out if it is a SOAP fault or a NACK before dealing with it appropriately.
There is no question that this solution is more complicated than using a request-response logical send port in your orchestration in combination with a long running transaction scope to catch timeouts, and adds more components for future developers to wrap their heads around as well as making life more complicated for support people. However if throughput is of the utmost importance to you and every message per second processed by your BizTalk application makes a world of a difference, then this might just be the solution you need.