G120E Ethernet - long delays, buffer underrun error

I am infrequently, but consistently, experiencing some long delays in communicating with the G120E over its built in Ethernet adapter. I am using the 2016 SDK R1.

We have a product using the G120E SoM with a requirement that the user must be able to query it for a small amount of information a few times per second. A customer is reporting (and I have reproduced on my own) that he receives most of the responses promptly (10s up to a couple hundred milliseconds). However, occasionally there is a very long delay in receiving the response from our device. The delay can be seconds, 10s of seconds, or even a minute!

I have done a lot of debugging on my own prior to writing this post. What I have found out is that, when one of these long delays occurs, our software receives the request from the customer and sends the response fine. However, the sending of the response fails one or more times, and the long delay he observes is the delay until the response packet is successfully retransmitted by the G120E.

As far as I can tell, the problem seems to be a buffer underrun when the G120E sends the response.

I will post a simple program in order for you to reproduce the problem on the G120E, along with a companion program for PC that sends repeated requests to it.

Has anyone else seen this? I feel like I have found some similar posts about networking delays in these forums, but nothing with a resolution or discussing this particular underrun error.

For example, in Networking stack performance alexs says:

“…even a simple ping works weird, sometimes i’m getting more than a second response time or even timeouts.”

Posts like this might be due to the same problem.

This is as far as I can get in debugging without looking into the framework and device drivers. Can someone with this access and knowledge look into this and see if it can be fixed? It is causing major problems for me.

Thank you in advance for any help that GHI or the community can provide!

.NET MF code:

using System;
using Microsoft.SPOT;
using GHI.Networking;
using System.Threading;
using System.Net;
using System.Net.Sockets;
using System.Text;
using GHI.Processor;
using System.Diagnostics;

namespace G120E_Ethernet_Test
{
    public class Program
    {
        public static void Main()
        {
            // Open ethernet interface
            EthernetBuiltIn eth = new EthernetBuiltIn();
            eth.Open();
            eth.EnableDhcp();
            eth.EnableDynamicDns();

            // Setup registers that we will look at later
            uint base_address = 0x20084000;
            Register IntStatus = new Register(base_address + 0xFE0);
            Register TxDescriptor = new Register(base_address + 0x11C);
            Register TxStatus = new Register(base_address + 0x120);
            Register TxProduceIndex = new Register(base_address + 0x128);
            Register TxConsumeIndex = new Register(base_address + 0x12C);
            Register TxDescriptorNumber = new Register(base_address + 0x124);
            uint N_Tx_descriptors = TxDescriptorNumber.Value + 1;  // TXDESCRIPTORNUMBER uses minus 1 encoding
            Register[] Tx_StatusInfo = new Register[N_Tx_descriptors];
            for (uint i = 0; i < N_Tx_descriptors; i++)
            {
                Tx_StatusInfo[i] = new Register(TxStatus.Value + i * 4);  // Every 4 bytes, starting at value of TXSTATUS
            }

            // Wait for DHCP address
            while (eth.IPAddress == "0.0.0.0")
            {
                Debug.Print("Waiting for DHCP");
                Thread.Sleep(250);
            }
            Debug.Print(DateTime.Now.Ticks + "\tUsing IP address " + eth.IPAddress);

            // Listen for connections on TCP port 23
            Socket listen = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
            listen.Bind(new IPEndPoint(IPAddress.Any, 23));
            listen.Listen(1);
            Socket conversation = listen.Accept();
            Debug.Print(DateTime.Now.Ticks + "\tConnection from " + conversation.RemoteEndPoint);

            // Conversation
            conversation.ReceiveTimeout = 1000;
            int N_bytes;
            byte[] receive_buffer = new byte[80];
            byte[] send_buffer;
            uint IntStatus_value;
            uint TxProduceIndex_value;
            uint TxConsumeIndex_value;
            uint[] Tx_StatusInfo_value = new uint[N_Tx_descriptors];
            while (true)
            {
                try
                {
                    // Receive command from PC
                    N_bytes = conversation.Receive(receive_buffer);
                    string text = new String(Encoding.UTF8.GetChars(receive_buffer, 0, N_bytes));
                    Debug.Print(DateTime.Now.Ticks + "\tReceived \"" + text.Trim() + "\"");

                    // Send response to PC
                    text = "Response to " + text;
                    send_buffer = Encoding.UTF8.GetBytes(text);
                    conversation.Send(send_buffer);
                    Debug.Print(DateTime.Now.Ticks + "\tSent \"" + text.Trim() + "\"");
                }
                catch (SocketException x)
                {
                    if (x.ErrorCode == 10060)
                    {
                        // Quickly get the values of the registers
                        IntStatus_value = IntStatus.Value;
                        TxProduceIndex_value = TxProduceIndex.Value;
                        TxConsumeIndex_value = TxConsumeIndex.Value;
                        for (int i = 0; i < N_Tx_descriptors; i++)
                        {
                            Tx_StatusInfo_value[i] = Tx_StatusInfo[i].Value;
                        }

                        // Print out a message to console
                        string message = DateTime.Now.Ticks + "\tTimeout receiving next command" +
                            "\r\n\tIntStatus\t" + IntStatus_value +
                            "\r\n\tTxConsumeIndex\t"+ TxConsumeIndex_value;
                        for (int i = 0; i < N_Tx_descriptors; i++)
                        {
                            message += "\r\n\tTx_StatusInfo[" + i + "]\t" + Tx_StatusInfo_value[i];
                        }
                        Debug.Print(message);
                        ;  // Can set breakpoint here for further investigation
                    }
                    else
                    {
                        Debug.Print(DateTime.Now.Ticks + "\t" + x.Message);
                    }
                }
                catch (Exception x)
                {
                    Debug.Print(DateTime.Now.Ticks + "\t" + x.Message);
                }
            }
        }
    }
}

PC code:

using System;
using System.IO;
using System.Net;
using System.Net.Sockets;

namespace G120_Ethernet_Test_PC
{
    class Program
    {
        static void Main(string[] args)
        {
            // Connect to G120E
            IPAddress remote_ip = new IPAddress(new byte[] { 192, 168, 0, 3 });
            IPEndPoint remote_endpoint = new IPEndPoint(remote_ip, 23);
            TcpClient client = new TcpClient();
            client.Connect(remote_endpoint);
            Console.WriteLine(DateTime.Now.Ticks + "\tConnected to " + remote_endpoint);

            // Network stream
            NetworkStream stream = client.GetStream();
            StreamWriter Out = new StreamWriter(stream) { AutoFlush = true };
            StreamReader In = new StreamReader(stream);

            // Conversation
            for (int i = 0; i < 1000; i++)
            {
                // Send request
                string text = "Request " + i;
                Out.WriteLine(text);
                Console.WriteLine(DateTime.Now.Ticks + "\tSent \"" + text + "\"");

                // Receive response
                text = In.ReadLine();
                Console.WriteLine(DateTime.Now.Ticks + "\tReceived \"" + text + "\"");
            }
        }
    }
}

When the issue occurs, the G120E will have sent the response and be waiting for the next request. The PC will still be waiting for the previous response before sending the next request.

The output from the programs will look something like this (I added the time deltas in parenthesis in post-processing):

PC:

636465214035412000	Sent "Request 480"	(+ 0 ms)
636465214038490000	Received "Response to Request 480"	(+ 308 ms)
636465214038490000	Sent "Request 481"	(+ 0 ms)
636465214038605000	Received "Response to Request 481"	(+ 12 ms)
636465214038605000	Sent "Request 482"	(+ 0 ms)
<timeout messages, see below>
636465214078911000	Received "Response to Request 482"	(+ 4,031 ms)
636465214078911000	Sent "Request 483"	(+ 0 ms)
636465214078986000	Received "Response to Request 483"	(+ 8 ms)
636465214078986000	Sent "Request 484"	(+ 0 ms)
636465214082069000	Received "Response to Request 484"	(+ 308 ms)

G120E:

129513680755275000	Received "Request 480"	  (+ 303 ms)
129513680755324000	Sent "Response to Request 480"	  (+ 5 ms)
129513680755395000	Received "Request 481"	  (+ 7 ms)
129513680755442000	Sent "Response to Request 481"	  (+ 5 ms)
129513680755513000	Received "Request 482"	  (+ 7 ms)
129513680755562000	Sent "Response to Request 482"	  (+ 5 ms)
129513680795774000	Received "Request 483"	  (+ 4,021 ms)
129513680795823000	Sent "Response to Request 483"	  (+ 5 ms)
129513680798856000	Received "Request 484"	  (+ 303 ms)
129513680798905000	Sent "Response to Request 484"	  (+ 5 ms)

Wireshark:

Relative Time	Source	Destination	Telnet
33.349484	192.168.0.2	192.168.0.3	Request 480\\r\\n	(+ 0 ms)
33.657255	192.168.0.3	192.168.0.2	Response to Request 480\\r\\n	(+ 308 ms)
33.657354	192.168.0.2	192.168.0.3	Request 481\\r\\n	(+ 0 ms)
33.669035	192.168.0.3	192.168.0.2	Response to Request 481\\r\\n	(+ 12 ms)
33.669096	192.168.0.2	192.168.0.3	Request 482\\r\\n	(+ 0 ms)
37.699325	192.168.0.3	192.168.0.2	Response to Request 482\\r\\n	(+ 4,030 ms)
37.699451	192.168.0.2	192.168.0.3	Request 483\\r\\n	(+ 0 ms)
37.706969	192.168.0.3	192.168.0.2	Response to Request 483\\r\\n	(+ 8 ms)
37.70705	192.168.0.2	192.168.0.3	Request 484\\r\\n	(+ 0 ms)
38.015173	192.168.0.3	192.168.0.2	Response to Request 484\\r\\n	(+ 308 ms)

In this case, it took the G120E only 5 ms to send response #482, but over 4 seconds for it to appear on the network and for the PC program to receive it.

When the long delay happened, the G120E additionally wrote out the following register values:

129513680767124827	Timeout receiving next command
	IntStatus	98
	TxConsumeIndex	0
	Tx_StatusInfo[0]	0
	Tx_StatusInfo[1]	0
	Tx_StatusInfo[2]	2684354560

A value of 98 in the IntStatus register means that the TXERRORINT bit (bit 5) is set, meaning “Interrupt trigger on transmit errors: LateCollision, ExcessiveCollision and ExcessiveDefer, NoDescriptor or Underrun.” (see Table 189 on page 236 of the LPC1788 User Manual, UM10470, Rev. 3.1 — 15 September 2014)

The last Tx descriptor to be sent was #2, and it TxConsumeIndex wrapped around to the current value of 0. The StatusInfo value for Tx descriptor 2 was 2684354560, meaning that UNDERRUN (bit 29) and ERROR (bit 31) are set. (see Table 202 on page 245)

The UNDERRUN means “A Tx underrun occurred due to the adapter not producing transmit data.”

ERROR is simply a logical or of several error conditions, including the underrun.

try to add a Thread.Delay(20) after the call conversation.Send(…) at the G120E side

That doesn’t help. Also, the code I posted is just to reproduce the problem as quickly as possible. In my actual software there is already a 250 ms delay between incoming requests.

I’d love to hear whether anyone at GHI has read this, has any thoughts, or is looking into it.

These aren’t the solution to the root issue that I’m looking for, but if it helps in the debugging process - I do not see this issue if I:

(1) Run the Ethernet PHY (TI DP83848) on the G120E at 10 Mbps and set the SUPP register on the LPC1788 to 0.

(2) Use UDP instead of TCP.

Workaround #1 creates flakiness with IP addressing. DHCP sometimes works and mostly doesn’t. Static addresses are seemingly set, but I cannot ping the IP address or connect to it with my application. I’d like to hear a solution to this as well.

Workaround #2 just doesn’t fit with my customers application, so I’d really love to see the root issue fixed.

Thanks!

We have seen it but unfortunately I am not sure I have a quick answer myself. We can take this under our consulting services if this helps.

serious ? is this GHI 2.0 to demand money to help customers with issues they have with your product ?

We are always here to help. In this instance we do not have a direct answer and we may need to debug your own code, which is not free.

You can try to contain a very simple demo that shows the problem on our and and we would gladly take a look, for free of course.

Either way, if you are a commercial customer I suggest you work with us directly. GHI can provide a lot more if we understand your need.

How often did you see this issue, please?

I run exactly your code without modification (1000 msg), looks like no error.

I changed to 30,000 msg and could not see any delay longer than 1 second.

Did you run with/without VS attach? We want more information about how you reproduce the issue.

Biggest value I got is 511ms as attachment below.

Thanks for looking at it. I’m pretty sure that it’s been with and without the debugger. The actual product is a G120E SoM, not running in the debugger. The test program, I was running mostly on a FEZ Spider II in the debugger and probably without as well.

I will try to give you a number on the frequency of the issue that I see.

Can you try running it again, or for more than 1000 messages?

yes, about more than an hour I got longest value 3 secs. But there was no exception, probably 3 secs is still short.

I ran 10,000 trials. I have included a histogram of the response times, and a plot of the cumulative distribution function.

Out of 10,000 requests, I see 99 that take longer than 1 second, 26 that take longer than 10 seconds, 5 over 11 seconds, and 4 over 1 minute. The longest was 65 seconds.

When going for maximum response, the Nagle algorithm should be turned off.

MF socket option

NoDelay A value that disables the Nagle algorithm for send coalescing

Not sure what the delay value for Nagle is on MF. On PC I think it is 250ms.

Also receive timeout is not being handled on MF, leading to sending on timeout. Timeout might result in funny sequence?

While we were able to see longer response times the longer we let it run, though not yet 10 seconds, we don’t have any more information at this time.

@Mike - Thanks. Yes, playing with socket options was the first thing I tried - NoDelay, BsdUrgent, Expedited, SendBuffer, SendLowWater. None of these worked. It was also time consuming because they are not very well documented and it was just pretty much trial and error. If I missed some magical combination of options that you know works (maybe using the code I posted to reproduce the error), let me know.

@John_Brochue - Hmm. I don’t know why you don’t see quite the same magnitude of delays. Two different people here have reproduced it (with FEZ Spider IIs and G120Es, our real application and the simple test program above), and of course our customer who first complained has also. But even a few seconds is bad, which you have seen right?

One of the things stopping me from figuring it out is that I don’t have access to what is going on at the lower levels. Is there any debugging that you can do on your end?

Or is there anything you can share with me that would help me to go further with my own debugging? I have the Porting Kit. Is there anything there that would help? Or would you share anything specific to the G120E?

Thanks!

A lot of this is very environment dependent. If you can narrow down in your environment what is causing this then we can try to reproduce that here. Ideally get it down to a single router just connected to the board and a PC that doesn’t have a lot of other network stuff running on it. Then see if the slow down can be isolated to specific cables, lengths, routers, or PCs.

@John_Brochue - I have done that. My usual setup for testing the delay (from which I plotted the data above) is my computer and a G120E/Ethernet J11 connected to a Linksys router with nothing else attached to it. I have also connected the G120E directly to my PC’s Ethernet port, i.e. with no switch, and to a building network with a Cisco 40-port managed switch.

Another software developer here, our technicians in our laboratory, and our customer at their facility have done their own tests in similarly diverse conditions.

I have narrowed it down (as far as I am able with the information available to me in my managed application + Wireshark) to the buffer underrun error, as indicated in the Tx_StatusInfo register. When this occurs, the frame has to be retransmitted, causing the delay in the PC receiving a response to its request. It may take multiple attempts, with each one waiting longer than the previous attempt (2x ?), which is why the delay time varies and can grow rather large.

Can you send the Wireshark capture to support at ghi…, reference my post here so it gets routed to me.

To try and narrow down the issue a bit more, can you try this with the ENC28 Ethernet? This should tell us whether the issue is likely in the TCP stack or the Ethernet driver.

To confirm, you’re not actually losing data, it just occasionally takes a long time to arrive?

@John_Brochue - Thanks, I will do as you suggested. You are correct - we do not lose data, but it occasionally takes a long time to arrive.

I do not have an ENC28 module. Is there any way I can still buy one (or a few of them)?

I have a couple Wifi RS21 modules, which are also SPI. I will also try that and let you know.

Thanks again for your help.