Yet another simple url parser: github.

Motivation

Parsing an URL is neither a challenging nor even an interesting problem, and there have already been lots of implementations.

I still started this side-project, since 1) the amount of work is moderate: it should be done in a few hundred lines of code; 2) it is somehow practical even it is a toy-project.

After all, as the old saying goes: “learning by doing”.

Design

Since URL/URI is relatively straightforward, currently I follow the description on Wiki/URL.1

The syntax of an URI1:

URI = scheme:[//authority]path[?query][#fragment]
authority = [userinfo@]host[:port]

  • scheme is mandatory
  • authority is optional, and if authority is present:
    • user info is optional
    • host is mandatory
    • port is optional
  • path is mandatory
  • query is optional
  • fragment is optional

Please notice that currently the url parser can only recognize a valid url format, that is, follows the syntax above.

Implementation

I refer to the zero-copy design of http-parser2, that is, instead of duplicating the url string, each field only points to the offset of the given url string, with a len limit.

I do not use regular expression (re) to parse urls. Instead, it simply scans the given url from beginning to end, and look for delimiters of each field. This gives a O(n) complexity.

The parsing returns a struct as the result:

typedef struct {
  field_t *scheme;     // mandatory
  field_t *usernm;     // optional
  field_t *passwd;     // optional
  field_t *host;       // optional
  field_t *port;       // optional
  field_t *path;       // mandatory
  field_t *query;      // optional
  field_t *frag;       // optional
} url_t;

with each field defined as:

typedef struct {
  char *offset;
  unsigned int len;
} field_t;

If a field is not NULL, then

  • [field_t]->offset: points to the start character of the filed in the original url
  • [field_t]->len: give the len of the field

API

url_t *
url_parse(char *url); // parse the given url, returns the url_t as result

void
url_print(url_t *url_stru); // print parsing result

void
url_del(url_t *url_stru); // delete parsing result, free memory