Not Another Go/Golang net/http Tutorial

Before we get into it, I’d like to provide a little bit of a preface for the motivation behind writing this tutorial. A few months back, Brent Anderson created the meetup.com group for Minnesota’s own Go User Meetup; and, a couple months later, I set about organizing the group and events. One of our own members, Craig Erdmann, and his trusty sidekick Jen Rajkumar offered to host and cater the meetup.

So I set about writing the presentation I wanted to see – not what the group needed. We were going to walk through a pretty obfuscated example from Hoare’s extremely influential ‘Communicating Sequential Processes’, which someone had implemented in Go. After asking for feedback though, it quickly became apparent that we had many new entrants; and, as a result of quickly rewriting the presentation, it was a pretty poor experience for the newcomers.

The idea with this post is to create interesting content for those who have used go net/http in the past, but keep it easy-to-follow for beginners.

Side note: For those of you who have contacted me asking to write about specific topics (i.e.: Setting up a hadoop cluster, using Mahout’s machine learning algorithms, using Groupcache as a replacement for Redis, etc.), I really appreciate your requests and still plan on writing them - they’re somewhere in the heap.

The Ubiquitous Example

We’ve all seen this example, either from the Golang wiki, or from blog posts regurgitating the same (wonder who does that?):

                
	package main
	
	import (
		"fmt"
		"net/http"
	)
	
	func handler(w http.ResponseWriter, r *http.Request) {
		fmt.Fprintf(w, "Hi there, I love %s!", r.URL.Path[1:])
	}
	
	func main() {
		http.HandleFunc("/", handler)
		http.ListenAndServe(":8080", nil)
	}

For those of us that haven’t, what’s going on here is:

First, we import the “net/http” package.
Then, we see a function definition for a function called handler.
- As arguments, handler takes an http.ResponseWriter and a pointer to an http.Request.
Next, in our main() function, we use http.HandleFunc(<path>, <function>) to direct all requests to "/" to our handler function.
Finally, we start a server listening on localhost:8080, with no specified handler (this results in using the built-in DefaultServeMux as our multiplexer).

The wiki goes on to talk about how this works at a high level, but let’s take a deeper look.

http.ResponseWriter && http.Request

Our handler function takes two arguments: One of type http.ResponseWriter, and the other a pointer to a http.Request type.

The source for the http.ResponseWriter shows that it’s an interface type defined as:


`type ResponseWriter interface {`	A ResponseWriter interface is used by an HTTP handler to construct an HTTP response.

`Header() Header`	Header returns the header map that will be sent by WriteHeader. Changing the header after a call to WriteHeader (or Write) has no effect.

`Write([]byte) (int, error)`	Write writes the data to the connection as part of an HTTP reply. If WriteHeader has not yet been called, Write calls WriteHeader(http.StatusOK) before writing the data. If the Header does not contain a Content-Type line, Write adds a Content-Type set to the result of passing the initial 512 bytes of written data to DetectContentType.

`WriteHeader(int)`	WriteHeader sends an HTTP response header with status code. If WriteHeader is not called explicitly, the first call to Write will trigger an implicit WriteHeader(http.StatusOK). Thus explicit calls to WriteHeader are mainly used to send error codes.
`}`

So, we know that w (our http.ResponseWriter variable in our handler function arguments) is pretty simple - it really only needs to conform to a role of doing three things:

Being able to send a response header with status code back to the connection
The ability to write data (received by Write() as a []byte) back to the connection held
The ability to return the header map to be used by WriteHeader

That last set of bullets may seem unnecessary, as it’s sort of in the comments, but it helps us distill the source down to the idea that instead of:

                
	func handler(w http.ResponseWriter, r *http.Request) {
	    fmt.Fprintf(w, "Hi there, I love %s!", r.URL.Path[1:])
	}

we could potentially do this in our handler function:

                
	func handler(w http.ResponseWriter, r *http.Request) {
		w.Header().Set("Content-Type", "application/json; charset=utf-8") 

		myItems := []string{"item1", "item2", "item3"}
		a, _ := json.Marshal(myItems)

		w.Write(a)
		return
	}

Pretty neat - we’re setting an HTTP Header, marshalling a string slice to JSON, and writing that out to our http connection.

If http.ResponseWriter is for handling our response on the connection, the http.Request is for the other half of the equation - the initial request. It’s type is actually struct, and it has a definition a bit lengthier than that of http.ResponseWriter, so I’ve reproduced it without most comments here:

                
	type Request struct {
	        Method string
	        URL *url.URL
	        Proto      string // "HTTP/1.0"
	        ProtoMajor int    // 1
	        ProtoMinor int    // 0
	        Header Header
	        Body io.ReadCloser
	        ContentLength int64
	        TransferEncoding []string
	        Close bool
	        Host string
	        Form url.Values
	        PostForm url.Values
	        MultipartForm *multipart.Form
	        Trailer Header
	        RemoteAddr string
	        RequestURI string
	        TLS *tls.ConnectionState
	}

A Request represents an HTTP request received by a server or to be sent by a client. The field semantics differ slightly between client and server usage. In addition to the notes on the fields below, see the documentation for Request.Write and RoundTripper.

Suffice it to say that if any request parameters, data, etc. had come through, we could retrieve and parse them from that http.Request pointer.

ServeMux

In our example, we’re not making use of the http.Request pointer argument at all – why then, do we need to include it as an argument to our handler function? To simplify the answer, we’re conforming to the requirements defined by the DefaultServeMux multiplexer. You might be wondering where the hell the code for that came from - we haven’t seen any declaration of this DefaultServeMux, or anything; but, when we passed in nil to our

                
	http.ListenAndServe(":8080", nil)

call, we implicitly asked ListenAndServe to multiplex incoming connections over the Default Serve Multiplexer, which will take care of routing requests to the correct handler functions, as long as we’ve “registered” them appropriately.

We’ve only registered a handler function for the "/" path with the DefaultServeMux. In order to do that, though, we needed to call http.HandleFunc, which is defined as:


`func HandleFunc(pattern string, handler func(ResponseWriter, *Request)) { DefaultServeMux.HandleFunc(pattern, handler) }`	HandleFunc registers the handler function for the given pattern in the DefaultServeMux. The documentation for ServeMux explains how patterns are matched.

As an aside: I’d like to point out how easy it is to reason about the code in the built-in packages. Comparing that to the actual call to http.HandleFunc in our example code:

                
	func handler(w http.ResponseWriter, r *http.Request) {
		fmt.Fprintf(w, "Hi there, I love %s!", r.URL.Path[1:])
	}

	func main() {
		http.HandleFunc("/", handler)
		...
	}

We can see that we satisfy the requirements, as we pass http.HandleFunc a string, and a function which takes both a ResponseWriter and a *Request as arguments.

So, what is ServeMux? From the docs:

ServeMux is an HTTP request multiplexer. It matches the URL of each incoming request against a list of registered patterns and calls the handler for the pattern that most closely matches the URL.

Patterns name fixed, rooted paths, like “/favicon.ico”, or rooted subtrees, like “/images/” (note the trailing slash). Longer patterns take precedence over shorter ones, so that if there are handlers registered for both “/images/” and “/images/thumbnails/”, the latter handler will be called for paths beginning “/images/thumbnails/” and the former will receive requests for any other paths in the “/images/” subtree.

Note that since a pattern ending in a slash names a rooted subtree, the pattern “/” matches all paths not matched by other registered patterns, not just the URL with Path == “/”.

Patterns may optionally begin with a host name, restricting matches to URLs on that host only. Host-specific patterns take precedence over general patterns, so that a handler might register for the two patterns “/codesearch” and “codesearch.google.com/” without also taking over requests for “http://www.google.com/”.

ServeMux also takes care of sanitizing the URL request path, redirecting any request containing . or .. elements to an equivalent .- and ..-free URL

ListenAndServe

Hopefully some of the ‘magic’ behind the ubiquitous Go/Golang net/http example has been demystified. Or you’re thoroughly confused, and should send me feedback.

Assuming the former, let’s look under the curtain of our final function call, http.ListenAndServe(":8080", nil). Once again, from the source:


`func ListenAndServe(addr string, handler Handler) error { server := &Server{Addr: addr, Handler: handler} return server.ListenAndServe() }`	ListenAndServe listens on the TCP network address addr and then calls Serve with handler to handle requests on incoming connections. Handler is typically nil, in which case the DefaultServeMux is used.

We get a new Server struct as server, and then call the http.ListenAndServe method on that server, which if we follow the source will lead us to calling the Serve method on our Server struct, passing it a tcpKeepAliveListener, which

sets TCP keep-alive timeouts on accepted connections. It’s used by ListenAndServe and ListenAndServeTLS so dead TCP connections (e.g. closing laptop mid-download) eventually go away.

For continuity, the source of the Server.ListenAndServe method, where we can see us calling Server.Serve is:

                
	func (srv *Server) ListenAndServe() error {
		addr := srv.Addr
		if addr == "" {
			addr = ":http"
		}
		ln, err := net.Listen("tcp", addr)
		if err != nil {
			return err
		}
		return srv.Serve(tcpKeepAliveListener{ln.(*net.TCPListener)})
	}

ListenAndServe listens on the TCP network address srv.Addr and then calls Serve to handle requests on incoming connections. If srv.Addr is blank, ":http" is used.

Returning from Server.ListenAndServe is the result of the Server.Serve call. The final call to our server’s Serve method is defined as:

                
	func (srv *Server) Serve(l net.Listener) error {
		defer l.Close()
        
		var tempDelay time.Duration // how long to sleep on accept failure
		for {
			rw, e := l.Accept()
			if e != nil {
				if ne, ok := e.(net.Error); ok && ne.Temporary() {
					if tempDelay == 0 {
						tempDelay = 5 * time.Millisecond
					} else {
						tempDelay *= 2
					}
					if max := 1 * time.Second; tempDelay > max {
						tempDelay = max
					}
					srv.logf("http: Accept error: %v; retrying in %v", e, tempDelay)
					time.Sleep(tempDelay)
					continue
				}
				return e
			}
			tempDelay = 0
			c, err := srv.newConn(rw)
			if err != nil {
				continue
			}
            
			c.setState(c.rwc, StateNew) // before Serve can return
			go c.serve()
		}
	}

Serve accepts incoming connections on the Listener l, creating a new service goroutine for each. The service goroutines read requests and then call srv.Handler to reply to them.

We can see that Server.Serve is busy waiting on calls to l.Accept (the tcpKeepAliveListener that was passed to Serve), and if we ignore the net.Error handling made up of nested if statements, it’s clear that we create a newConn (new connection) with what we received from l.Accept, and then set some state on that connection.

For our discussion, we’ll hand-wave that stuff, and go straight to the final piece of magic:

                
	...
	go c.Serve()
	...

Now, it is plain that no matter what happened with the hand-wavy stuff, we’ve called the Serve method on our new accepted connection with its own Goroutine. This, along with other Goroutines running (like our current main() Goroutine), will be multiplexed in the Go Runtime.

That’s it

In under 300 lines of markdown, we’ve uncovered the magic behind the fantastically simple net/http server. Granted, I skipped some source, comments, and tons of options (TLS anyone?); but, you now have the tools and links to source to learn more.

Also, this isn’t a stab at Rails, since I’m an active and avid user of it still, but you just can’t delve down into its innards as you can with Go. Sure, you might argue that Webrick, Puma, Unicorn, etc. aren’t part of Rails, that’s beside the point.

If you’re interested in further demystifying what’s going on, I’d recommend trying to understand the concept of the Go Runtime. Check out the Analysis of the Go runtime scheduler by Neil Deshpande, Erica Sponsler, and Nathaniel Weiss @ Columbia University - which, although somewhat dated, is still very relevant. The only note I would make while you read that, though, is that the changes discussed (proposed by Dmitry Vyukov) are partially in the current Go Runtime Scheduler.

Last note: If you liked this, you should check out my Common Mistakes Made with Golang, and Resources to Learn From post.

Indepth Golang: Resources and Notes

I was hanging out at work for a little after-hours learning/coding this past Friday, when, in IRC, user Kharybs on #go-nuts asked, “Anyone know where I can find documentation on the structure of the Go source?”

After providing kharybs some general links, I started to think that it wouldn’t be a bad thing if I went and reread the Go Spec and Memory Model documents, since I’d read them as I was just learning the language. Plus, there were additional talks and documents I’d linked that looked great, but hadn’t yet reviewed.

So, today (Saturday), I’m starting to work through them; and will update this post with links to highlights and/or notes for the larger resources as I do.

Resources

Language Overview

Go at Google: Language Design in the Service of Software Engineering - Rob Pike @ SPLASH 2012
- My Highlights/Notes
Another Go at Language Design - Rob Pike talk @ Stanford
- Video
- Slides
The Go Programming Language (3 part keynote by Rob Pike @ Go Conference 2014)
- Slides
  - Part 1
  - Part 2
  - Part 3

Go Language Research

The Go frontend for GCC - Ian Lance Taylor @ Google
The Research Problems of Implementing Go - Russ Cox @ Google

Reference Documents

Misc & Tangentially Related

Frequently Asked Questions (about Go)
Experiments in Using Google’s Go Language for Optimization - John W. Chinneck @ Carleton University

AJAX/JavaScript Enabled Parsing with Apache Nutch and Selenium

Web Crawling with Apache Nutch

Nutch is an open-source large scale web crawler which is now based on the MapReduce paradigm. I won’t get too deep into the specifics, as there’s a really great article on Gigaom that describes Nutch’s history in a bit more depth; but, work on Nutch originally began back in 2002 by, “then-Internet Archive search director Doug Cutting and University of Washington graduate student Mike Cafarella.” Over the course of the next few years, Yahoo! would hire Cutting and ‘split’ the Hadoop project out of nutch.

MapReduce

MapReduce: Simplified Data Processing on Large Clusters - Google

Nutch became an Apache incubator project in 2005, and a Top Level project in 2010; and, thanks to many committers’ work, you can be up and running a large scale web crawl within just a few minutes of downloading the source. Sidenote: See the Nutch 1.x tutorial for a more user-friendly tutorial.

Ungraceful Degredation and Empty Document Woes

After reading the above, you’re probably pretty excited to download Nutch, donate some money to Apache, and start a large scale web crawl; and, you should be! Let’s imagine that you run off and start a crawler immediately. Once the crawler has been running for a while, you might decide to start doing some analysis on your truly awesome set of documents only to find out that

A) Some websites seem to have the same content for each page

B) That content looks pretty much exactly like the static areas of the site that don’t change

Header nav
Sidebar
Footer

C) Despite Twitter’s announcement that deferring JavaScript execution until content has been rendered improves performance and user experience, many developers (including myself at times with AngularJS) throw graceful degredation and progressive enhancement out of the window in favor of the “Too bad, enable JavaScript” pattern.

Wait a second, though, there’s no way that many sites are using AJAX/JavaScript to load key bodies of content, right? They’d miss out on all the SEO that you can tell they love, because they still stuff their <meta name="keywords" /> tag despite not using Google Site Search, and Google not using it for their web rankings anymore!

With a bit more Googling, you stumble across a Google Webmasters document about AJAX Crawling, which describes a hack/workaround that Google suggests AJAX based websites implement in order to get properly crawled.

In opening up the source of some of the wacky pages, you discover that sure enough, there’s a <meta name="fragment" content="!"> tag in the content. To add to your good fortune, you see that someone has already put in some work to patch in escaped_fragment following functionality into Nutch.

You’re so close, you can almost taste success.

Then, you head over to the site’s “HTML Snapshot” at http://typicalajaxsite.com/#!key=value only to find that they’ve improperly implemented Google’s recommended hack/work-around. Now you’re back to square 1.

Fuck.

picture by Moyan Brenn on Flickr

JavaScript Parsing in Java

Maybe someone else has encountered this issue; and, since Nutch is designed to be pluggable, maybe they’ve even written a plugin for it! Additional searching reveals to you the closest solution to your problem: a lone plugin on github that made use of htmlunit, a “GUI-Less browser for Java programs.” That sounds just fine - you’ve already burned an hour+ just getting to this point, and figure that even with an acknowledgement from the htmlunit team that it provided, “fairly good JavaScript support (which is constantly improving)”, it might be worth a shot. Just a bit of tinkering was needed to get the plugin working with Nutch 2.2.1; so you set about re-crawling the problem sites.

And checking the Lucene index on Solr 2’s awesome web-gui reveals…

Fuck #2.

picture by Atlantic Community on Flickr

No Dice.

Some sites used knockout.js, and, despite its best efforts, htmlunit just didn’t fit the bill.

So, where to from here? Well, it’d be really neat to make use of the v8 JavaScript Engine via either Rhino or the JNI as StackOverflow User Dhaivat Pandya mentions, but that seems like a pretty big yak to shave – not to mention that htmlunit already uses Rhino for the core of its JavaScript functionality, and that didn’t work.

Bummer.

This is where I was about a month ago, with a very tight deadline. My goal was to get Nutch parsing AJAX/JavaScript within a day or two, and, it didn’t seem like it would be an easy process.

Enter: Selenium

As someone who has and continues to write quite a bit of Ruby/Rails, I’d written a few tests and small-scale crawl scripts using the Selenium WebDriver. Selenium is, simply put by the authors, “a suite of tools specifically for automating web browsers.”

Typically, Selenium is used for automating testing for web applications; but, it can also be used to programmatically manipulate complex forms during e.g. repetitive web crawls. So, I quickly put together a plugin that relies on Selenium/Firefox to parse the JavaScript, and pass the content back down the Nutch toolchain.

Going back through the rigmarole of creating a Nutch plug-in based on Selenium stand-alone, adding the dependencies and compiling Nutch, running a crawl, and finally checking the Solr web-gui again reveals something great: the AJAX/JavaScript dependent content is being stored in my Lucene index! Take that “Post Graceful Degredation Era”.

Triumph

picture by Evan Long on Flickr

Errors, File Descriptors, and Zombies - Oh My!

Things were good for about an hour. I noticed in top that my test box was creating zombies at an alarming rate; and, they were not being reaped. I went home a little annoyed, but remembered the next day how I’d read that a Selenium Hub / Node (a.k.a. Selenium Grid) set-up would be self-maintaining in that the hub would remove nodes which stopped responding, and accept them back into the hub/spoke system if they re-registered and behaved.

One thing was sure: this is a good thing; and, definitely an improved design over opening/closing Firefox windows like they were going out of style.

Quickly, I put together two docker containers:

and a Nutch plugin to make use of the new set-up:

nutch-selenium-grid-plugin

Started up my containers, and then started up Nutch. And was well on my way.

Almost.

Earlier, when I’d created and tried the Nutch-Selenium stand-alone plugin/configuration, I was using a python script to start off my Supervisor daemon in the same fashion as this old version of a Docker Cassandra image by using python’s run_service() function. After making the switch to os.exec() as recommended by a pythonista on IRC, the parent PID of the running Selenium Node process was switched from Python to Supervisord; and, Supervisord’s subprocesses were able to heal themselves once again.

However, Firefox continued to cause issues; and, when asking Supervisord to kill and restart the Selenium Node process didn’t resolve them, I settled on a very hackish solution… I set a cron job to literally kill -9 Firefox periodically.

Well, it works

picture by Nagesh Jayaraman on Flickr

Deep Breath

So, in practice, it all works fine. Every once in a while a few pages will respond with errors to Nutch due to Firefox being down, so that zombie issue still sort-of nags at me. However, even though it may not be the prettiest solution to cleaning up after Firefox, I’m still crawling tens of thousands of pages and getting their dynamically loaded content whether they break their own work-arounds or not. Eventually, due to my only crawling a specific set of sites, those errored out pages will get crawled; and, that fits my project’s requirements.