HTML is my API

April 26, 2015

In August 2012, I wrote a quick script to stream front-page Hackernews stories to an IRC channel on Freenode (##hackernews in case you're interested) so that I could quickly glance at popular stories there instead of needing to load Hackernews. Since IRC is my feed reader, I've always tried to pipe as much there as possible.

Parsing the Hackernews front page HTML

In 2012 there was no API for Hackernews, so my only option was to use one of the unofficial ones, or read the HTML from the front page myself. I opted to parse the HTML directly since it wasn't super complex. I know you shouldn't parse HTML with a regex, but I did anyway. It worked great, from August 2012 up until April 2015.

Finally this month I stopped seeing updates in the channel, and went to take a look. It appears they made a slight change to the HTML which of course broke my regex, since I was matching against some specific markup. I also learned from someone else in the channel, that Hackernews had launched an official JSON API in October. I decided to rewrite the code to work with that API rather than try to update my regex.

Switching to the JSON API

It took only a few minutes to rewrite the script using two endpoints, topstories.json and fetching an item. However, within two days, I had already encountered a problem. The IRC bot started spitting out empty lines in IRC, one for each front-page story. My code was by no means foolproof, and I realized that this would happen if the API call to fetch story details returned an empty result.

So in 2.5 years of parsing the HTML, I never had any problems. In 2 days of parsing the JSON API, I hit a glitch where all the stories were empty.

2015-04-26 19:43:03     Loqi    [hackernews]  ( points) https://news.ycombinator.com/item?id=
2015-04-26 19:43:06     Loqi    [hackernews]  ( points) https://news.ycombinator.com/item?id=
2015-04-26 19:43:08     Loqi    [hackernews]  ( points) https://news.ycombinator.com/item?id=
2015-04-26 19:43:11     Loqi    [hackernews]  ( points) https://news.ycombinator.com/item?id=
2015-04-26 19:43:14     Loqi    [hackernews]  ( points) https://news.ycombinator.com/item?id=
2015-04-26 19:43:17     Loqi    [hackernews]  ( points) https://news.ycombinator.com/item?id=
2015-04-26 19:43:20     Loqi    [hackernews]  ( points) https://news.ycombinator.com/item?id=

Make the visible data machine readable

Since more people and programs see the HTML than use the API, the HTML ends up being more reliable. Luckily there's a simple solution to avoid nasty regex parsing hacks in order to be able to generate a machine-readable (and JSON!) version of an HTML page. It requires the author of the HTML to add a few classes to their existing markup. We'll use the Hackernews front page markup as an example.

Here's a snippet of HTML from the current Hackernews front page for a single story.

<tr class="athing">
  <td align="right" valign="top" class="title">
    <span class="rank">19.</span>
  </td>
  <td>
    <center><a id="up_9443241" onclick="return vote(this)" href="vote?for=9443241&amp;dir=up&amp;auth=4f617c803a44687927291cd133ac825ca9e61c4b&amp;goto=news"><div class="votearrow" title="upvote"></div></a></center>
  </td>
  <td class="title">
    <span class="deadmark"></span>
    <a href="http://www.dmiller.io/blog/2015/4/26/comparing-the-php7-and-hack-type-systems">Comparing the PHP 7 and Hack Type Systems</a>
    <span class="sitebit comhead"> (dmiller.io)</span>
  </td>
</tr>
<tr>
  <td colspan="2"></td>
  <td class="subtext">
    <span class="score" id="score_9443241">40 points</span>
    by <a href="user?id=jazzdan">jazzdan</a>
    <a href="item?id=9443241">7 hours ago</a>
    | <a href="flag?id=9443241&amp;on=t&amp;dir=down&amp;auth=4f617c803a44687927291cd133ac825ca9e61c4b&amp;goto=news">flag</a>
    | <a href="item?id=9443241">20 comments</a>
  </td>
</tr>

If this looks like a chunk of HTML you don't want to touch with a 10-foot pole, I don't blame you. In order to make this easily machine-readable, without requiring custom parsing of HTML, we can add a few Microformats classes to turn these into h-entry posts.

First, wrap both table rows with a <tbody> tag with a class of h-entry in order to group these rows into a single element. (And yes, it's okay for a table to have multiple tbody tags.)

<tbody class="h-entry">
...
</tbody>

Find the line that contains the permalink to the story as well as the name, and add the u-url and p-name classes:

<a class="u-url p-name" href="http://www.dmiller.io/blog/2015/4/26/comparing-the-php7-and-hack-type-systems">Comparing the PHP 7 and Hack Type Systems</a>

Find the link to the person who submitted the post, and add the p-author class:

by <a class="p-author h-card" href="user?id=jazzdan">jazzdan</a>

Lastly, indicate the time the post was created as well as the post's permalink on news.ycombinator.com:

<a class="u-url" href="item?id=9443241"><time datetime="2015-04-26T14:15:00-0700" class="dt-published">7 hours ago</time></a>

Here's a summary of the changes in colorized diff format. You can see the additions are extremely minor, and didn't involve changing the structure of the HTML with the exception of adding the <tbody> tag.

@@ @@ -1,3 +1,4 @@ @@
++<tbody class="h-entry">
  <tr class="athing">
    <td align="right" valign="top" class="title">
      <span class="rank">19.</span>
@@ @@ -7,7 +8,7 @@ @@
    </td>
    <td class="title">
      <span class="deadmark"></span>
--    <a href="http://www.dmiller.io/blog/2015/4/26/comparing-the-php7-and-hack-type-systems">Comparing the PHP 7 and Hack Type Systems</a>
++    <a class="u-url p-name" href="http://www.dmiller.io/blog/2015/4/26/comparing-the-php7-and-hack-type-systems">Comparing the PHP 7 and Hack Type Systems</a>
      <span class="sitebit comhead"> (dmiller.io)</span>
    </td>
  </tr>
@@ @@ -15,9 +16,10 @@ @@
    <td colspan="2"></td>
    <td class="subtext">
      <span class="score" id="score_9443241">40 points</span>
--    by <a href="user?id=jazzdan">jazzdan</a>
--    <a href="item?id=9443241">7 hours ago</a>
++    by <a class="p-author h-card" href="user?id=jazzdan">jazzdan</a>
++    <a class="u-url" href="item?id=9443241"><time datetime="2015-04-26T14:15:00-0700" class="dt-published">7 hours ago</time></a>
      | <a href="flag?id=9443241&amp;on=t&amp;dir=down&amp;auth=4f617c803a44687927291cd133ac825ca9e61c4b&amp;goto=news">flag</a>
      | <a href="item?id=9443241">20 comments</a>
    </td>
--</tr>
++</tr>
++</tbody>

What does this give us? If you run the new HTML through a microformats parser (online, PHP, Ruby, Python, Node.js), then the result is a data structure representing the item!

{
  "items": [
    {
      "type": [
        "h-entry"
      ],
      "properties": {
        "author": [
          {
            "type": [
              "h-card"
            ],
            "properties": {
              "name": [
                "jazzdan"
              ],
              "url": [
                "http:\/\/news.ycombinator.com\/user?id=jazzdan"
              ]
            },
            "value": "jazzdan"
          }
        ],
        "name": [
          "Comparing the PHP 7 and Hack Type Systems"
        ],
        "url": [
          "http:\/\/www.dmiller.io\/blog\/2015\/4\/26\/comparing-the-php7-and-hack-type-systems",
          "http:\/\/news.ycombinator.com\/item?id=9443241"
        ],
        "published": [
          "2015-04-26T14:15:00-0700"
        ]
      }
    }
  ]
}

This means consuming code can rely on a Microformats parser to deal with the (potentially messy) HTML, and only needs to handle the well-structured output that you see above.

If Hackernews were to add this markup to the front page, it would give people an easy way to parse the stories. This would enable me to add Hackernews to my IndieWeb reader, as well as give me an easy way to stream the front-page stories to IRC, while not relying on a third-party API.

IndieWebCamp

For more about how we're using Microformats to enable cross-site following, commenting and other interactions, check out the IndieWebCamp wiki!

You might also like to join us for an IndieWebCamp in Düsseldorf, Portland, or Edinburgh!

Sun, Apr 26, 2015 9:48pm -07:00 #indieweb #api #html #microformats

8 likes 5 reposts 2 bookmarks 1 reply 55 mentions

Jacky Alciné playvicious.social/@jalcine

Could even take it one step further by using HTML as one's API (hat tip to @aaronpk on https://aaronparecki.com/2015/04/26/22/html-is-my-api) to peek at the <source type=> to even filter out particular media types in advance using a whitelist on the client's side.
That could be extended for the <video> and <audio> tags too.

Tue, Jan 15, 2019 8:46am +00:00

Other Mentions

[snarfed] snarfed.org

^ begging for a reply with https://aaronparecki.com/2015/04/26/22/html-is-my-api

Fri, Jan 31, 2020 6:29pm +00:00
Joshua Ramon Enslin jrenslin.de

Likes: aaronparecki.com/2015/04/26/22/… jrenslin.de/i.php/note/166

Thu, Sep 8, 2016 11:00am +00:00 (via brid-gy.appspot.com)
felix schwenzel wirres.net
mf2 geotagging mashup
Sat, Oct 31, 2015 3:10pm +00:00
felix schwenzel wirres.net

also den titel wegzulassen habe ich vor ner weile angefangen. ersten weils geht (die RSS-spezifikation erlaubt das eigentlich, blogspezifikationen sowieso) und zweitens weil ich es sinnlos finde bei (einzel) links den linktitel in die überschrift zu setzen und dann aber mit dem blog-permalink von wirres.net auszustatten. und auch die gruber-methode, den titel nicht mit dem permalink, sondern dem external link auszugeben halte ich für quark.
deshalb baue ich seit einer weile meine (einzel) ...

Wed, Jul 1, 2015 8:53pm +00:00
felix schwenzel wirres.net
indie war gestern — oder umgekehrt
Sun, Jun 7, 2015 11:16pm +02:00
L B twitter.com/lbenedix

likes this.

Sun, May 10, 2015 12:57pm -07:00 (via brid-gy.appspot.com)
ePirat twitter.com/ePirat

likes this.

Sun, May 10, 2015 12:57pm -07:00 (via brid-gy.appspot.com)
André Cedik twitter.com/AndreCedik

likes this.

Sun, May 10, 2015 12:39pm -07:00 (via brid-gy.appspot.com)
sidasa twitter.com/sidasa

likes this.

Sun, May 10, 2015 11:23am -07:00 (via brid-gy.appspot.com)
Hendrik Mans twitter.com/hmans

likes this.

Sun, May 10, 2015 10:29am -07:00 (via brid-gy.appspot.com)
Hendrik Mans hmans.io
reposts this.
Sun, May 10, 2015 5:20pm +00:00 (via brid-gy.appspot.com)
Just UK Freebies twitter.com/justukfreebies

likes this.

Thu, Apr 30, 2015 4:14am -07:00 (via brid-gy.appspot.com)
Kai Hendry hendry.iki.fi

@t @aaronpk Is there no sane generic HTML parser?

Thu, Apr 30, 2015 11:12am +00:00 (via brid-gy.appspot.com)
Marek Raida svg.kvalitne.cz
reposts this.
Wed, Apr 29, 2015 4:13am +00:00 (via brid-gy.appspot.com)
Georg Portenkirchner portenkirchner.withknown.com/profile/portenkirchner

“HTML is my API” by Aaron Parecki https://aaronparecki.com/articles/2015/04/26/1/html-is-my-api #HTML #API

Tue, Apr 28, 2015 9:52pm +00:00
Sylvain Machefert twitter.com/symac

likes this.

Tue, Apr 28, 2015 8:57am -07:00 (via brid-gy.appspot.com)
Zeljko Dakic twitter.com/desireco

likes this.

Tue, Apr 28, 2015 8:24am -07:00 (via brid-gy.appspot.com)
Sean twitter.com/pzxc0

likes this.

Tue, Apr 28, 2015 3:36am -07:00 (via brid-gy.appspot.com)
RafG twitter.com/RafG

likes this.

Tue, Apr 28, 2015 2:11am -07:00 (via brid-gy.appspot.com)
Thierry Marianne thierry.marianne.io
reposts this.
Tue, Apr 28, 2015 6:49am +00:00 (via brid-gy.appspot.com)
Jens Wonke-Stehle twitter.com/wonkestehle

likes this.

Mon, Apr 27, 2015 11:48pm -07:00 (via brid-gy.appspot.com)
karlos g liberal twitter.com/patxangas

likes this.

Mon, Apr 27, 2015 11:16pm -07:00 (via brid-gy.appspot.com)
Dr. Yannick Loiseau twitter.com/yannick_loiseau

likes this.

Mon, Apr 27, 2015 9:57pm -07:00 (via brid-gy.appspot.com)
Scott W. H. Young twitter.com/hei_scott

likes this.

Mon, Apr 27, 2015 9:57pm -07:00 (via brid-gy.appspot.com)
Benoît Launay twitter.com/Nephou

likes this.

Mon, Apr 27, 2015 9:57pm -07:00 (via brid-gy.appspot.com)
Gorka Julio teketen.com
reposts this.
Tue, Apr 28, 2015 4:49am +00:00 (via brid-gy.appspot.com)
Mark Lindner twitter.com/mrlindner

likes this.

Mon, Apr 27, 2015 8:41pm -07:00 (via brid-gy.appspot.com)
Bill Dueber robotlibrarian.billdueber.com
reposts this.
Tue, Apr 28, 2015 3:11am +00:00 (via brid-gy.appspot.com)
Giso Broman twitter.com/giso6150

likes this.

Mon, Apr 27, 2015 8:05pm -07:00 (via brid-gy.appspot.com)
Bret Comnes bret.io
reposts this.
Tue, Apr 28, 2015 2:53am +00:00 (via brid-gy.appspot.com)
Ryan Baumann twitter.com/ryanfb

likes this.

Mon, Apr 27, 2015 7:13pm -07:00 (via brid-gy.appspot.com)
Ed Summers keybase.io/edsu
reposts this.
Tue, Apr 28, 2015 2:04am +00:00 (via brid-gy.appspot.com)
Object Adjective objectadjective.com
reposts this.
Tue, Apr 28, 2015 1:10am +00:00 (via brid-gy.appspot.com)
Keen IO twitter.com/keen_io

likes this.

Mon, Apr 27, 2015 5:43pm -07:00 (via brid-gy.appspot.com)
Bradley Allen twitter.com/analogrealm
reposts this.
Tue, Apr 28, 2015 12:39am +00:00 (via brid-gy.appspot.com)
dietrich ayala metafluff.com
reposts this.
Tue, Apr 28, 2015 12:26am +00:00 (via brid-gy.appspot.com)
Kai Hendry twitter.com/kaihendry

likes this.

Mon, Apr 27, 2015 5:22pm -07:00 (via brid-gy.appspot.com)
Meekostuff meekostuff.net

@t @aaronpk How do you use a HTML-payload API in a web-app? @HackerNews

Tue, Apr 28, 2015 12:14am +00:00 (via brid-gy.appspot.com)
FredxCoders.com twitter.com/CoderDojoVA

likes this.

Mon, Apr 27, 2015 5:00pm -07:00 (via brid-gy.appspot.com)
MaisMedia twitter.com/maismedia

likes this.

Mon, Apr 27, 2015 4:36pm -07:00 (via brid-gy.appspot.com)
MaisMedia www.maismedia.com
reposts this.
Mon, Apr 27, 2015 11:33pm +00:00 (via brid-gy.appspot.com)
Naser codepen.io/naser
reposts this.
Mon, Apr 27, 2015 11:21pm +00:00 (via brid-gy.appspot.com)
Ruud Steltenpool twitter.com/steltenpower

likes this.

Mon, Apr 27, 2015 4:19pm -07:00 (via brid-gy.appspot.com)
Korvin M twitter.com/kjmobb

likes this.

Mon, Apr 27, 2015 4:18pm -07:00 (via brid-gy.appspot.com)
B. Aleman Meza twitter.com/bam

likes this.

Mon, Apr 27, 2015 4:18pm -07:00 (via brid-gy.appspot.com)
rknDE twitter.com/rknLA

likes this.

Mon, Apr 27, 2015 4:01pm -07:00 (via brid-gy.appspot.com)
Object Adjective twitter.com/ObjectAdjective

likes this.

Mon, Apr 27, 2015 4:01pm -07:00 (via brid-gy.appspot.com)
dylan hassinger twitter.com/dylanized

likes this.

Mon, Apr 27, 2015 4:01pm -07:00 (via brid-gy.appspot.com)
Paul Watson twitter.com/paulmwatson

likes this.

Mon, Apr 27, 2015 4:01pm -07:00 (via brid-gy.appspot.com)
nicolas debock twitter.com/ndebock

likes this.

Mon, Apr 27, 2015 4:01pm -07:00 (via brid-gy.appspot.com)
Dan Lyke twitter.com/danlyke

likes this.

Mon, Apr 27, 2015 4:01pm -07:00 (via brid-gy.appspot.com)
Curt Gardner twitter.com/perival

likes this.

Mon, Apr 27, 2015 4:00pm -07:00 (via brid-gy.appspot.com)
One2Ad_com www.one2ad.com
reposts this.
Mon, Apr 27, 2015 10:57pm +00:00 (via brid-gy.appspot.com)
codebear bear.im
reposts this.
Mon, Apr 27, 2015 10:50pm +00:00 (via brid-gy.appspot.com)
Tantek Çelik tantek.com

“HTML is my API” @aaronpk on @HackerNews’s HTML vs JSON, reliability, and using #microformats2 https://aaronparecki.com/articles/2015/04/26/1/html-is-my-api

Fri, Apr 17, 2015 3:48pm -07:00

Posted in /articles