AOLserver Virtual Hosting with TCP

Borkware »
Rants »
AOLserver Virtual Hosting with TCP

By MarkD on June 29, 2004

(I don't care about the back story. Take me to the configuration section)

Update June 29, 2004: I had too many problems with active sites being reported as 'down' by nsvhr, so I've installed Pound as my reverse proxy. It's very nice.

I'm currently running a number of OpenACS based sites, Integrated Badgertronics, Borkware, The Loudoun Symphony, and others off of one IP address on a machine hosted by the fine folks at Acorn Hosting. I have one IP address at my disposal, so I run my sites as independent AOLserver instances running as back-ends, being fed from an AOLserver running nsvhr on the front end.

The Situation

In mid January, the facility where I was colocating a machine got screwed by WorldCom and had their plug pulled with no notice. I had 10 IP addresses for my websites, and a development address for each of them too (dev.borkware.com for instance). Since I suddenly found myself out of an online home and my sites were dark, I had to find a new place to roost quickly. The Acorn Hosting folks got us set up really quickly. Aside from the work of porting sites from Oracle to PostgreSQL, I also only had one IP address to work with, and couldn't really justify getting additional IPs.

I quickly dismissed the idea of running everyone in one OpenACS instance. I would have collisions in page names, plus I've heard of problems with using subsites in this manner. I didn't particularly want to learn and debug all the subsite code, especially given the pressure of getting the sites back up and running. Also, the user communities and audiences amongst the different sites are very different, so it didn't make sense to lump them all together.

The Setup

Setting virtual hosting involves having a front-end web server (also known as a "Reverse Proxy") which accepts all of the requests on the given IP. It then looks at the HTTP Host: header to decide which one of multiple back-end servers should handle the request. (the Host: header contains the hostname part of the request you see in your browser. for http://borkware.com, there would be a

Host:
borkware.com

header in the HTTP request)

nsvhr (which stands for NS Virtual Hosting, and uh, R-something) is the AOLserver module that looks at the Host: header and makes the decision which back-end to use. nsvhr can communicate with the back-ends in one of two ways One way is by using unix domain sockets (via the nsunix module), which are pretty neat. It passes a file descriptor from one process to the other: the back-end gets the file descriptor of the network connection and then writes the resulting data through it. You can also use TCP sockets which use standard networking calls to move data back and forth. Unix sockets are a more efficient transport. I use the TCP sockets, so my setup looks something like this:

Why not unix sockets?

Why am I using TCP sockets instead of the groovier unix sockets? For me the unix sockets didn't work out very well. I'm guessing it's because of some interactions with the virtual machine setup that Acorn Hosting uses, where every user has root access and complete control over their virtual machine. I got my sites originally set up using nsvhr + nsunix sockets, but there were a couple of problems. The first is the front-end would go numb and stop accepting requests. I'd check my sites in the morning and find them unresponsive. Restarting the front-end would make everyone come alive again. I eventually setup the arsDigita keepalive to restart the front end if it would become unresponsive, but that Just Felt Wrong having to do that.

Even worse, the back-ends would spaz out occasionally, going into tight loops reading from the unix socket. The server would still handle requests, but some threads would be stuck in loops, maxing out CPU usage. This is decidedly anti-social behavior for a shared server, and I didn't want to get kicked off the machine. So TCP sockets was the next thing to try.

Setting up TCP Virtual Hosting

Here is how I have things set up:

the front-end listens on port 80 and has the nsvhr module loaded
the back-ends listen on high ports (800X)
port 8000 is left open to bring up a development site on when needed

The default setup the Acorn folks supply has everything blocked but port 80 (http), port 443 (https), port 8000 (for http development) and some other ports like ssh and smtp. The other ports can still be listened on by the webservers but the outside world can't get to them.

Front-End Configuration

Here are the interesting bits for the front-end configuration file (call it front-end.tcl. The first is increasing the socktimeout for the nssock module. This fixes a problem where folks uploading big files via HTTP POST would get "Invalid HTTP Request" errors:

    ns_section ns/server/${server}/module/nssock
    ns_param socktimeout            240
    ...

Add nsvhr.so to your modules:

    ns_section ns/server/${server}/modules
    ns_param nsvhr  ${bindir}/nsvhr.so

and configure it:

ns_section "ns/server/${servername}/module/nsvhr"
ns_param        Method  "GET"   ;# methods allowed to proxy (can have > 1)
ns_param        Method  "POST"
ns_param        Method  "HEAD"
ns_param        Timeout 600 ;# timeout waiting for back-end

I've got a 10 minute timeout waiting for the back-end for supporting large file uploads. I'm not sure if it's 100% necessary, but I haven't seen any bad behavior by having a large timeout there.

And then you give it the hosts to proxy in the ns/server/server-name/module/nsvhr/maps section.:

    # hosts to proxy

    ns_param "loudounsymphony.org"          "http://loudounsymphony.org:8006"
    ns_param "loudounsymphony.org:80"       "http://loudounsymphony.org:8006"
    ns_param "www.loudounsymphony.org"      "http://loudounsymphony.org:8006"
    ns_param "www.loudounsymphony.org:80"   "http://loudounsymphony.org:8006"

    ns_param "borkware.com"                 "http://borkware.com:8007"
    ns_param "borkware.com:80"              "http://borkware.com:8007"
    ns_param "www.borkware.com"             "http://borkware.com:8007"
    ns_param "www.borkware.com:80"          "http://borkware.com:8007"

    ns_param "badgertronics.com"            "http://badgertronics.com:8008"
    ns_param "badgertronics.com:80"         "http://badgertronics.com:8008"
    ns_param "www.badgertronics.com"        "http://badgertronics.com:8008"
    ns_param "www.badgertronics.com:80"     "http://badgertronics.com:8008"

These settings tell nsvhr how to map incoming requests to back-end requests.

Back-End Configuration

There wasn't a whole lot I had to do on the back-end that is different from directly serving stuff from port 80. I increased the socktimeout parameter to eliminate those HTTP Request errors. This is from the borkware.tcl configuration file:

    set httpport               8007
    set hostname               borkware.com
    set address                207.142.4.59

    ns_section ns/server/${server}/module/nssock
    ns_param   address            $address
    ns_param   hostname           $hostname
    ns_param   port               $httpport
    ns_param   socktimeout        240

AOLserver code modifications

I based my setup on AOLserver 3.5.1, mainly because I don't need the i18n patches that AOLserver 3.3ad13 has, and I couldn't locate Jerry Asher's vhosting patches to that version (his site was down). There are two problems with the code that I had to fix. Request IP addresses were bad, and nobody could upload binary files.

Due to the way nsvhr works, all of the back-ends were seeing the IP address of the front end as the request IP. So my server logs had all the request IPs the same, and it looked like some loser at 59.acornhosting.net was hammering my site. (no wait, that's me).

Luckily I'm not afraid to dig into the AOLserver source and figure things out. (it's actually very beautiful code.) I added a header to the request between the front-end and the back-end. x-bork-ip: has the IP address of the true originator. TCPProxy() in nsvhr.c and Ns_ConnPeer() in nsd/conn.c each needed a bit of code to support that.

The binary file upload problem was due to SockWrite() in nsvhr.c. It was doing a strlen() on the data going to the back end. If there happened to be a zero byte in the stream (like when uploading photos), the data would be truncated. The back-end would be sitting waiting for more data to come in, and eventually would fail with a "Error writing content: resource temporarily unavailable" error in ns_conncptofp. Explicitly passing in the length of the string to write fixes the problem (and also saves the CPU time of spinning over the string with strlen()).

Here are some patches for nsvhr.c and nsd/conn.c

Finally

A big upside is that this actually works. Folks can upload images to their photo albums now, my sites don't lock up or spaz out and make my host mad. The downsides is that there is extra resource and CPU consumption, and depending on your bandwidth metering, you could get double-counted for the network traffic as it counts the front-end <-> back-end traffic too. (The fine folks at Acorn Hosting have since fixed this) Luckily Acorn Hosting has a very liberal bandwidth policy, and I'm far, far from going over my monthly allotment.

I haven't done a lot of the spit and polish yet, like handling page not found errors, and fixing things when the back-end doesn't respond. I haven't had any problems with the back-end locking up, so I haven't been too worried. I also haven't done anything about SSL / https yet. I don't think SSL can be reverse-proxied like this. My plans currently only include one site that needs SSL, and it's not ready for prime time yet, so for me it's not too much of the issue.

Appendix: Fixing the zero-byte error

I'm always interested in how folks found programming errors, so I figured I'd share how I tracked down the zero-byte error SockWrite() in nsvhr.c had. My friend Kevin has a photo album on one of my sites, and he was having problems uploading photos. Which was very odd since he had been able to previously upload a bunch of stuff. After having him turn off those IE "Friendly" error messages that mask the true error generated (yet another disservices foisted on the world), he was getting "Invalid HTTP Request" errors. Dirk Gomez on #openacs suggested increasing the socktimeout parameter for nssock, since 30 seconds wasn't long enough for big amounts of data to make the trip from computer to DSL modem to front-end to back-end. A higher timeout (240 seconds) fixed that.

His next attempt at uploading stuff gave the "Error writing content: resource temporarily unavailable" error in ns_conncptofp, after about 4 minutes of uploading. WTF was going on?"

So in trying it myself, the upload of an image would just hang. I didn't know if it was due to the new satellite internet system we have which replaced a very bad ISP (avoid Alltel for any of your business dealings), or some weird Mac+Mozilla issue.

I wanted to see what traffic was happening on each of the servers (the front-end and the badgertronics back-end). Was anyone receiving any data? Only part of it? Was some stuck in the queue on my end? After a bit of poking around in the code I discovered that Ns_SockRecv() in nsd/sock.c is the call that does reading on the sockets. I stuck in this right before the return

    {
	int i;
	int blah;
	unsigned char saved;
	blah = (nread == toread) ? nread - 1 : nread;
	saved = ((unsigned char*)buf)[blah];
	((unsigned char*)buf)[blah] = 0;
	for (i = 0; i < nread; i++) {
	    Ns_Log (Notice, "TRACE: %d: %x : %c", i, 
		    ((char*)buf)[i], ((unsigned char*)buf)[i]);
	}
	Ns_Log (Notice, "TRACE2: %s", (unsigned char*)buf);
	((unsigned char*)buf)[blah] = saved;
    }

Which prints out each byte, as well as the whole block read. I stuck in a zero byte to terminate the printing, making sure to restore it when I was done.

The front-end was getting all of the data, so it wasn't sitting in some modem buffer. The back-end was getting all the header data, but none of the image data. It was stopping right after the MIME header in the HTTP POST for the image which declared the rest of the data to be a GIF image. I wasn't sure why it would be doing that. Maybe something was looking at the data and only reading header information. Looking closer at the output there was a zero byte there. How suspicious. I used emacs to change all of the zero bytes to 1 bytes in a gif file and uploaded that. Almost instantly I got a "this is a corrupt image" from the identify program, so it definitely it wasn't connectivity issues on my end. The zero byte is the culprit. That means someone is assuming the data is pure text and nothing binary. There's probably a call to strlen() somewhere in the process.

Working backwards from ns_conncptofp, everything looked like it was doing the Right Thing on the back-end, everyone passing lengths of buffers around. So the back-end was doing everything right. Looking at the nsvhr code, I found SockWrite() doing a strlen(), and that gets called in the TCP path of the virtual hosting. The call to strlen() was unnecessary since we could easily get the length of the data to be sent to the back end. Explicitly passing in a length fixed the problem, plus made it a hair more efficient, since strlen() is an O(N) operation, having to walk over every byte in the string.

So why the "Error writing content: resource temporarily unavailable" error in ns_conncptofp? That particular error is coming from NsTclWriteContentCmd, and the actual error is happening on the reading from the socket, rather than the writing to the file, so the error is misleading. The "resource temporarily unavailable" is simply the EAGAIN errno, meaning something timed out, try again. In this case, the timeout is waiting for more data from the front-end, which isn't coming since it stopped at the zero byte. The timeout happens, EAGAIN is returned in errno And why was Kevin able to upload photos previously? He did those while I was still struggling with nsunix as the transport mechanism between the front-end and back-end. Recently was the first time he's uploaded photos since I made the switchover.