Some thoughts on the XRay and jf2 JSON formats

tantek https://github.com/tantek • Jun 22

#8 Need use-cases section
Some thoughts on the XRay and jf2 JSON formats

April 24, 2017
Since beginning the jf2 spec, I've continued developing XRay, and its format has diverged from the original jf2. Tonight I spent a while trying to reconcile the changes to submit a PR to the spec. I was unable to come up with a short PR, and instead got drawn in to thinking about the motivations behind a simpler mf2 JSON format to begin with.

I use XRay in a number of projects for various purposes.
- My website runs every external URL through XRay to handle consuming the Microformats on the page, converting it to a simplified form. This is used whenever I reply to a post to display the reply context, as well as to fetch the post contents when I make a repost.
- Loqi uses XRay to create a one-line summary of URLs pasted into IRC.
- webmention.io uses XRay to parse the source URL of webmentions to extract useful data about the webmention, and makes this data available via an API.
- IndieNews uses XRay to parse submitted URLs to display the name and author of the posts.
- Quill uses XRay to show a preview of in-reply-to URLs.
- My rudimentary reader uses XRay to extract the h-entry data from posts to display in my reader.
There are a number of things that XRay does when extracting the mf2 data.
- Finds the author of a post following the authorship algorithm
- Follows the comments presentation algorithm to remove the name property if it's a duplicate of the content.
- Figures out the primary object on the page, or whether the page represents a list of posts, which is sometimes tricky. (some discussion on representative object)
- Is vocabulary-aware, so always returns a consistent set of properties, and doesn't return unknown properties. e.g. published is always a single string, and category is always an array.
- Sanitizes all HTML, allowing only a small subset of HTML tags and Microformats classes on the HTML elements.
- For any values that might be embedded objects, e.g. a person-tag or in-reply-to property, always returns the URL in the value and moves the embedded object to a refs object, making it easier to consume.
- The author property is a simplified h-card containing only name/photo/url properties that are single values.
As you can see, a lot of what XRay is doing is cleaning up some of the the "messy" parts of Microformats JSON. Not necessarily the specific JSON format, but more about the overall structure, such as how an author of a post can be in many different places in a parsed Microformats JSON object. This is not to place blame on Microformats, since what it's doing is creating a JSON representation of the original HTML, and allowing authors flexibility in how they publish HTML rather than prescribe specific formats is a core principle.

What this means is XRay is actually acting more as an interpreter of the Microformats JSON, in order to deliver a cleaned-up version to consumers. Most of my projects that use XRay could actually be considered "clients", such as how I use XRay to parse posts for my reader, whether that's output to me in IRC or re-rendered as a post on IndieNews.

My primary need for an alternative Microformats JSON format is actually a client-to-server serialization, where the client is getting a cleaned up version of external posts, and can assume that the server it's talking to is responsible for taking the messy data and normalizing it to something it expects. In this sense, the use case of jf2 is a client-to-server serialization, whereas the Microformats JSON is a server-to-server serialization. This would then be a core building block for Microsub, a spec that provides a standardized way for clients to consume and interact with feeds collected by a server.

The main current challenge in defining a spec for this use case is how tied to specific vocabularies it should be. For example, Microformats JSON says that every value should always be an array. However, there are a few properties for which it never makes sense to have multiple values, and creates additional complexity in consuming it, e.g. published, uid, and location. It's easier to consume these when the values can be relied upon to always be a single value. With the author of a post, the author of an h-entry may be an object or a string, making it more complicated to consume that when it can vary, so XRay's format always returns a consistent value. However this is tied to the h-entry vocabulary, since other Microformats vocabularies don't have an author property. In general, the success I've had with XRay's format is due to the fact that it makes hard decisions about what properties it returns, and is consistent about whether those properties are single- or multi-valued, in order to provide a consistent API to consumers.

I am just not sure how to balance wanting to provide that simplicity for consuming clients while also allowing flexibility in publishing, while also not hard-coding too much into a spec that might be obsoleted later.
Portland, Oregon

Mon, Apr 24, 2017 8:59pm -07:00 #jf2 #xray #indieweb

2 likes 1 reply 2 mentions
Have you written a response to this? Let me know the URL:
- Jacky Alcine v2.jacky.wtf
  
  This is something I’m running into as I’m building out Koype. Over at my site; I end up “wrapping” values if they have only one value and work off that. I think it might be a construct of the language but it’s very encouraging to the concept of everything being a list. That said, things like author and location do come up funny (even end I notice tends to be a list.
  
  One idea to this would be providing an external schema that could define how a list-value is turned into a more normalized value. Just spitballing.
  
  Thu, Feb 7, 2019 11:30am -07:00
Other Mentions
- Henrique Dias hacdias.com
  
  I’ve been diving a bit into the Microformats and JF2 formats and I was quite confused today. On my new system, I’m storing some properties as a “flattened” version of Microformats. For some reason I assumed that was JF2, but it isn’t! Here’s a nice read!
  
  Thu, Nov 4, 2021 12:58pm -07:00
- Barry Frost barryfrost.com
  
  X-Ray returns structured JSON data from any URL. Potentially useful for extracting reply contexts: it follows authorship and comments presentation rules and constructs a simplified set of data.
  
  Further background: https://aaronparecki.com/2017/04/24/15/jf2
  
  Fri, Apr 28, 2017 9:11am -07:00

Posted in /articles using quill.p3k.io

Aaron Parecki

#8 Need use-cases section

Other Mentions