64°F

Aaron Parecki

  • Articles
  • Notes
  • Photos
  • HTML is my API

    April 26, 2015

    In August 2012, I wrote a quick script to stream front-page Hackernews stories to an IRC channel on Freenode (##hackernews in case you're interested) so that I could quickly glance at popular stories there instead of needing to load Hackernews. Since IRC is my feed reader, I've always tried to pipe as much there as possible.

    Parsing the Hackernews front page HTML

    In 2012 there was no API for Hackernews, so my only option was to use one of the unofficial ones, or read the HTML from the front page myself. I opted to parse the HTML directly since it wasn't super complex. I know you shouldn't parse HTML with a regex, but I did anyway. It worked great, from August 2012 up until April 2015.

    Finally this month I stopped seeing updates in the channel, and went to take a look. It appears they made a slight change to the HTML which of course broke my regex, since I was matching against some specific markup. I also learned from someone else in the channel, that Hackernews had launched an official JSON API in October. I decided to rewrite the code to work with that API rather than try to update my regex.

    Switching to the JSON API

    It took only a few minutes to rewrite the script using two endpoints, topstories.json and fetching an item. However, within two days, I had already encountered a problem. The IRC bot started spitting out empty lines in IRC, one for each front-page story. My code was by no means foolproof, and I realized that this would happen if the API call to fetch story details returned an empty result.

    So in 2.5 years of parsing the HTML, I never had any problems. In 2 days of parsing the JSON API, I hit a glitch where all the stories were empty.

    2015-04-26 19:43:03     Loqi    [hackernews]  ( points) https://news.ycombinator.com/item?id=
    2015-04-26 19:43:06     Loqi    [hackernews]  ( points) https://news.ycombinator.com/item?id=
    2015-04-26 19:43:08     Loqi    [hackernews]  ( points) https://news.ycombinator.com/item?id=
    2015-04-26 19:43:11     Loqi    [hackernews]  ( points) https://news.ycombinator.com/item?id=
    2015-04-26 19:43:14     Loqi    [hackernews]  ( points) https://news.ycombinator.com/item?id=
    2015-04-26 19:43:17     Loqi    [hackernews]  ( points) https://news.ycombinator.com/item?id=
    2015-04-26 19:43:20     Loqi    [hackernews]  ( points) https://news.ycombinator.com/item?id=

    Make the visible data machine readable

    Since more people and programs see the HTML than use the API, the HTML ends up being more reliable. Luckily there's a simple solution to avoid nasty regex parsing hacks in order to be able to generate a machine-readable (and JSON!) version of an HTML page. It requires the author of the HTML to add a few classes to their existing markup. We'll use the Hackernews front page markup as an example.

    Here's a snippet of HTML from the current Hackernews front page for a single story.

    <tr class="athing">
      <td align="right" valign="top" class="title">
        <span class="rank">19.</span>
      </td>
      <td>
        <center><a id="up_9443241" onclick="return vote(this)" href="vote?for=9443241&amp;dir=up&amp;auth=4f617c803a44687927291cd133ac825ca9e61c4b&amp;goto=news"><div class="votearrow" title="upvote"></div></a></center>
      </td>
      <td class="title">
        <span class="deadmark"></span>
        <a href="http://www.dmiller.io/blog/2015/4/26/comparing-the-php7-and-hack-type-systems">Comparing the PHP 7 and Hack Type Systems</a>
        <span class="sitebit comhead"> (dmiller.io)</span>
      </td>
    </tr>
    <tr>
      <td colspan="2"></td>
      <td class="subtext">
        <span class="score" id="score_9443241">40 points</span>
        by <a href="user?id=jazzdan">jazzdan</a>
        <a href="item?id=9443241">7 hours ago</a>
        | <a href="flag?id=9443241&amp;on=t&amp;dir=down&amp;auth=4f617c803a44687927291cd133ac825ca9e61c4b&amp;goto=news">flag</a>
        | <a href="item?id=9443241">20 comments</a>
      </td>
    </tr>
    

    If this looks like a chunk of HTML you don't want to touch with a 10-foot pole, I don't blame you. In order to make this easily machine-readable, without requiring custom parsing of HTML, we can add a few Microformats classes to turn these into h-entry posts.

    First, wrap both table rows with a <tbody> tag with a class of h-entry in order to group these rows into a single element. (And yes, it's okay for a table to have multiple tbody tags.)

    <tbody class="h-entry">
    ...
    </tbody>

    Find the line that contains the permalink to the story as well as the name, and add the u-url and p-name classes:

    <a class="u-url p-name" href="http://www.dmiller.io/blog/2015/4/26/comparing-the-php7-and-hack-type-systems">Comparing the PHP 7 and Hack Type Systems</a>

    Find the link to the person who submitted the post, and add the p-author class:

    by <a class="p-author h-card" href="user?id=jazzdan">jazzdan</a>

    Lastly, indicate the time the post was created as well as the post's permalink on news.ycombinator.com:

    <a class="u-url" href="item?id=9443241"><time datetime="2015-04-26T14:15:00-0700" class="dt-published">7 hours ago</time></a>

    Here's a summary of the changes in colorized diff format. You can see the additions are extremely minor, and didn't involve changing the structure of the HTML with the exception of adding the <tbody> tag.

    @@ -1,3 +1,4 @@
    +<tbody class="h-entry">
    <tr class="athing">
    <td align="right" valign="top" class="title">
    <span class="rank">19.</span>
    @@ -7,7 +8,7 @@
    </td>
    <td class="title">
    <span class="deadmark"></span>
    - <a href="http://www.dmiller.io/blog/2015/4/26/comparing-the-php7-and-hack-type-systems">Comparing the PHP 7 and Hack Type Systems</a>
    + <a class="u-url p-name" href="http://www.dmiller.io/blog/2015/4/26/comparing-the-php7-and-hack-type-systems">Comparing the PHP 7 and Hack Type Systems</a>
    <span class="sitebit comhead"> (dmiller.io)</span>
    </td>
    </tr>
    @@ -15,9 +16,10 @@
    <td colspan="2"></td>
    <td class="subtext">
    <span class="score" id="score_9443241">40 points</span>
    - by <a href="user?id=jazzdan">jazzdan</a>
    - <a href="item?id=9443241">7 hours ago</a>
    + by <a class="p-author h-card" href="user?id=jazzdan">jazzdan</a>
    + <a class="u-url" href="item?id=9443241"><time datetime="2015-04-26T14:15:00-0700" class="dt-published">7 hours ago</time></a>
    | <a href="flag?id=9443241&amp;on=t&amp;dir=down&amp;auth=4f617c803a44687927291cd133ac825ca9e61c4b&amp;goto=news">flag</a>
    | <a href="item?id=9443241">20 comments</a>
    </td>
    -</tr>
    +</tr>
    +</tbody>


    What does this give us? If you run the new HTML through a microformats parser (online, PHP, Ruby, Python, Node.js), then the result is a data structure representing the item!

    {
      "items": [
        {
          "type": [
            "h-entry"
          ],
          "properties": {
            "author": [
              {
                "type": [
                  "h-card"
                ],
                "properties": {
                  "name": [
                    "jazzdan"
                  ],
                  "url": [
                    "http:\/\/news.ycombinator.com\/user?id=jazzdan"
                  ]
                },
                "value": "jazzdan"
              }
            ],
            "name": [
              "Comparing the PHP 7 and Hack Type Systems"
            ],
            "url": [
              "http:\/\/www.dmiller.io\/blog\/2015\/4\/26\/comparing-the-php7-and-hack-type-systems",
              "http:\/\/news.ycombinator.com\/item?id=9443241"
            ],
            "published": [
              "2015-04-26T14:15:00-0700"
            ]
          }
        }
      ]
    }

    This means consuming code can rely on a Microformats parser to deal with the (potentially messy) HTML, and only needs to handle the well-structured output that you see above.

    If Hackernews were to add this markup to the front page, it would give people an easy way to parse the stories. This would enable me to add Hackernews to my IndieWeb reader, as well as give me an easy way to stream the front-page stories to IRC, while not relying on a third-party API.

    IndieWebCamp

    For more about how we're using Microformats to enable cross-site following, commenting and other interactions, check out the IndieWebCamp wiki!

    You might also like to join us for an IndieWebCamp in Düsseldorf, Portland, or Edinburgh!

    Sun, Apr 26, 2015 9:48pm -07:00 #indieweb #api #html #microformats
    8 likes 5 reposts 2 bookmarks 1 reply 55 mentions
    • sam
    • Jacky.
    • felix schwenzel
    • Bastian Allgeier
    • Christian Gheorghe
    • Johannes la Poutre
    • Kyle Mahan
    • Kim Reece
    • Benaiah Mischenko
    • Adrian Cochrane
    • Jacky.
    • Barry Frost
    • ed
    • Chris Aldrich
    • Jacky Alciné playvicious.social/@jalcine

      Could even take it one step further by using HTML as one's API (hat tip to @aaronpk on https://aaronparecki.com/2015/04/26/22/html-is-my-api) to peek at the <source type=> to even filter out particular media types in advance using a whitelist on the client's side.

      That could be extended for the <video> and <audio> tags too.

      Tue, Jan 15, 2019 8:46am +00:00

    Other Mentions

    • [snarfed] snarfed.org
      ^ begging for a reply with https://aaronparecki.com/2015/04/26/22/html-is-my-api
      Fri, Jan 31, 2020 6:29pm +00:00
    • Joshua Ramon Enslin jrenslin.de
      Likes: aaronparecki.com/2015/04/26/22/… jrenslin.de/i.php/note/166
      Thu, Sep 8, 2016 11:00am +00:00 (via brid-gy.appspot.com)
    • felix schwenzel wirres.net
      mf2 geotagging mashup
      Sat, Oct 31, 2015 3:10pm +00:00
    • felix schwenzel wirres.net
      also den titel wegzulassen habe ich vor ner weile angefangen. ersten weils geht (die RSS-spezifikation erlaubt das eigentlich, blogspezifikationen sowieso) und zweitens weil ich es sinnlos finde bei (einzel) links den linktitel in die überschrift zu setzen und dann aber mit dem blog-permalink von wirres.net auszustatten. und auch die gruber-methode, den titel nicht mit dem permalink, sondern dem external link auszugeben halte ich für quark.
      deshalb baue ich seit einer weile meine (einzel) ...
      Wed, Jul 1, 2015 8:53pm +00:00
    • felix schwenzel wirres.net
      indie war gestern — oder umgekehrt
      Sun, Jun 7, 2015 11:16pm +02:00
    • L B twitter.com/lbenedix
      likes this.
      Sun, May 10, 2015 12:57pm -07:00 (via brid-gy.appspot.com)
    • ePirat twitter.com/ePirat
      likes this.
      Sun, May 10, 2015 12:57pm -07:00 (via brid-gy.appspot.com)
    • André Cedik twitter.com/AndreCedik
      likes this.
      Sun, May 10, 2015 12:39pm -07:00 (via brid-gy.appspot.com)
    • sidasa twitter.com/sidasa
      likes this.
      Sun, May 10, 2015 11:23am -07:00 (via brid-gy.appspot.com)
    • Hendrik Mans twitter.com/hmans
      likes this.
      Sun, May 10, 2015 10:29am -07:00 (via brid-gy.appspot.com)
    • Hendrik Mans hmans.io
      reposts this.
      Sun, May 10, 2015 5:20pm +00:00 (via brid-gy.appspot.com)
    • Just UK Freebies twitter.com/justukfreebies
      likes this.
      Thu, Apr 30, 2015 4:14am -07:00 (via brid-gy.appspot.com)
    • Kai Hendry hendry.iki.fi
      @t @aaronpk Is there no sane generic HTML parser?
      Thu, Apr 30, 2015 11:12am +00:00 (via brid-gy.appspot.com)
    • Marek Raida svg.kvalitne.cz
      reposts this.
      Wed, Apr 29, 2015 4:13am +00:00 (via brid-gy.appspot.com)
    • Georg Portenkirchner portenkirchner.withknown.com/profile/portenkirchner
      “HTML is my API” by Aaron Parecki https://aaronparecki.com/articles/2015/04/26/1/html-is-my-api #HTML #API
      Tue, Apr 28, 2015 9:52pm +00:00
    • Sylvain Machefert twitter.com/symac
      likes this.
      Tue, Apr 28, 2015 8:57am -07:00 (via brid-gy.appspot.com)
    • Zeljko Dakic twitter.com/desireco
      likes this.
      Tue, Apr 28, 2015 8:24am -07:00 (via brid-gy.appspot.com)
    • Sean twitter.com/pzxc0
      likes this.
      Tue, Apr 28, 2015 3:36am -07:00 (via brid-gy.appspot.com)
    • RafG twitter.com/RafG
      likes this.
      Tue, Apr 28, 2015 2:11am -07:00 (via brid-gy.appspot.com)
    • Thierry Marianne thierry.marianne.io
      reposts this.
      Tue, Apr 28, 2015 6:49am +00:00 (via brid-gy.appspot.com)
    • Jens Wonke-Stehle twitter.com/wonkestehle
      likes this.
      Mon, Apr 27, 2015 11:48pm -07:00 (via brid-gy.appspot.com)
    • karlos g liberal twitter.com/patxangas
      likes this.
      Mon, Apr 27, 2015 11:16pm -07:00 (via brid-gy.appspot.com)
    • Dr. Yannick Loiseau twitter.com/yannick_loiseau
      likes this.
      Mon, Apr 27, 2015 9:57pm -07:00 (via brid-gy.appspot.com)
    • Scott W. H. Young twitter.com/hei_scott
      likes this.
      Mon, Apr 27, 2015 9:57pm -07:00 (via brid-gy.appspot.com)
    • Benoît Launay twitter.com/Nephou
      likes this.
      Mon, Apr 27, 2015 9:57pm -07:00 (via brid-gy.appspot.com)
    • Gorka Julio teketen.com
      reposts this.
      Tue, Apr 28, 2015 4:49am +00:00 (via brid-gy.appspot.com)
    • Mark Lindner twitter.com/mrlindner
      likes this.
      Mon, Apr 27, 2015 8:41pm -07:00 (via brid-gy.appspot.com)
    • Bill Dueber robotlibrarian.billdueber.com
      reposts this.
      Tue, Apr 28, 2015 3:11am +00:00 (via brid-gy.appspot.com)
    • Giso Broman twitter.com/giso6150
      likes this.
      Mon, Apr 27, 2015 8:05pm -07:00 (via brid-gy.appspot.com)
    • Bret Comnes bret.io
      reposts this.
      Tue, Apr 28, 2015 2:53am +00:00 (via brid-gy.appspot.com)
    • Ryan Baumann twitter.com/ryanfb
      likes this.
      Mon, Apr 27, 2015 7:13pm -07:00 (via brid-gy.appspot.com)
    • Ed Summers keybase.io/edsu
      reposts this.
      Tue, Apr 28, 2015 2:04am +00:00 (via brid-gy.appspot.com)
    • Object Adjective objectadjective.com
      reposts this.
      Tue, Apr 28, 2015 1:10am +00:00 (via brid-gy.appspot.com)
    • Keen IO twitter.com/keen_io
      likes this.
      Mon, Apr 27, 2015 5:43pm -07:00 (via brid-gy.appspot.com)
    • Bradley Allen twitter.com/analogrealm
      reposts this.
      Tue, Apr 28, 2015 12:39am +00:00 (via brid-gy.appspot.com)
    • dietrich ayala metafluff.com
      reposts this.
      Tue, Apr 28, 2015 12:26am +00:00 (via brid-gy.appspot.com)
    • Kai Hendry twitter.com/kaihendry
      likes this.
      Mon, Apr 27, 2015 5:22pm -07:00 (via brid-gy.appspot.com)
    • Meekostuff meekostuff.net
      @t @aaronpk How do you use a HTML-payload API in a web-app? @HackerNews
      Tue, Apr 28, 2015 12:14am +00:00 (via brid-gy.appspot.com)
    • FredxCoders.com twitter.com/CoderDojoVA
      likes this.
      Mon, Apr 27, 2015 5:00pm -07:00 (via brid-gy.appspot.com)
    • MaisMedia twitter.com/maismedia
      likes this.
      Mon, Apr 27, 2015 4:36pm -07:00 (via brid-gy.appspot.com)
    • MaisMedia www.maismedia.com
      reposts this.
      Mon, Apr 27, 2015 11:33pm +00:00 (via brid-gy.appspot.com)
    • Naser codepen.io/naser
      reposts this.
      Mon, Apr 27, 2015 11:21pm +00:00 (via brid-gy.appspot.com)
    • Ruud Steltenpool twitter.com/steltenpower
      likes this.
      Mon, Apr 27, 2015 4:19pm -07:00 (via brid-gy.appspot.com)
    • Korvin M twitter.com/kjmobb
      likes this.
      Mon, Apr 27, 2015 4:18pm -07:00 (via brid-gy.appspot.com)
    • B. Aleman Meza twitter.com/bam
      likes this.
      Mon, Apr 27, 2015 4:18pm -07:00 (via brid-gy.appspot.com)
    • rknDE twitter.com/rknLA
      likes this.
      Mon, Apr 27, 2015 4:01pm -07:00 (via brid-gy.appspot.com)
    • Object Adjective twitter.com/ObjectAdjective
      likes this.
      Mon, Apr 27, 2015 4:01pm -07:00 (via brid-gy.appspot.com)
    • dylan hassinger twitter.com/dylanized
      likes this.
      Mon, Apr 27, 2015 4:01pm -07:00 (via brid-gy.appspot.com)
    • Paul Watson twitter.com/paulmwatson
      likes this.
      Mon, Apr 27, 2015 4:01pm -07:00 (via brid-gy.appspot.com)
    • nicolas debock twitter.com/ndebock
      likes this.
      Mon, Apr 27, 2015 4:01pm -07:00 (via brid-gy.appspot.com)
    • Dan Lyke twitter.com/danlyke
      likes this.
      Mon, Apr 27, 2015 4:01pm -07:00 (via brid-gy.appspot.com)
    • Curt Gardner twitter.com/perival
      likes this.
      Mon, Apr 27, 2015 4:00pm -07:00 (via brid-gy.appspot.com)
    • One2Ad_com www.one2ad.com
      reposts this.
      Mon, Apr 27, 2015 10:57pm +00:00 (via brid-gy.appspot.com)
    • codebear bear.im
      reposts this.
      Mon, Apr 27, 2015 10:50pm +00:00 (via brid-gy.appspot.com)
    • Tantek Çelik tantek.com
      “HTML is my API” @aaronpk on @HackerNews’s HTML vs JSON, reliability, and using #microformats2 https://aaronparecki.com/articles/2015/04/26/1/html-is-my-api
      Fri, Apr 17, 2015 3:48pm -07:00
Posted in /articles

Hi, I'm Aaron Parecki, Director of Identity Standards at Okta, and co-founder of IndieWebCamp. I maintain oauth.net, write and consult about OAuth, and participate in the OAuth Working Group at the IETF. I also help people learn about video production and livestreaming. (detailed bio)

I've been tracking my location since 2008 and I wrote 100 songs in 100 days. I've spoken at conferences around the world about owning your data, OAuth, quantified self, and explained why R is a vowel. Read more.

  • Director of Identity Standards at Okta
  • IndieWebCamp Founder
  • OAuth WG Editor
  • OpenID Board Member

  • 🎥 YouTube Tutorials and Reviews
  • 🏠 We're building a triplex!
  • ⭐️ Life Stack
  • ⚙️ Home Automation
  • All
  • Articles
  • Bookmarks
  • Notes
  • Photos
  • Replies
  • Reviews
  • Trips
  • Videos
  • Contact
© 1999-2025 by Aaron Parecki. Powered by p3k. This site supports Webmention.
Except where otherwise noted, text content on this site is licensed under a Creative Commons Attribution 3.0 License.
IndieWebCamp Microformats Webmention W3C HTML5 Creative Commons
WeChat ID
aaronpk_tv