Since beginning the jf2 spec, I've continued developing XRay, and its format has diverged from the original jf2. Tonight I spent a while trying to reconcile the changes to submit a PR to the spec. I was unable to come up with a short PR, and instead got drawn in to thinking about the motivations behind a simpler mf2 JSON format to begin with.
I use XRay in a number of projects for various purposes.
There are a number of things that XRay does when extracting the mf2 data.
nameproperty if it's a duplicate of the
publishedis always a single string, and
categoryis always an array.
refsobject, making it easier to consume.
authorproperty is a simplified
h-cardcontaining only name/photo/url properties that are single values.
As you can see, a lot of what XRay is doing is cleaning up some of the the "messy" parts of Microformats JSON. Not necessarily the specific JSON format, but more about the overall structure, such as how an author of a post can be in many different places in a parsed Microformats JSON object. This is not to place blame on Microformats, since what it's doing is creating a JSON representation of the original HTML, and allowing authors flexibility in how they publish HTML rather than prescribe specific formats is a core principle.
What this means is XRay is actually acting more as an interpreter of the Microformats JSON, in order to deliver a cleaned-up version to consumers. Most of my projects that use XRay could actually be considered "clients", such as how I use XRay to parse posts for my reader, whether that's output to me in IRC or re-rendered as a post on IndieNews.
My primary need for an alternative Microformats JSON format is actually a client-to-server serialization, where the client is getting a cleaned up version of external posts, and can assume that the server it's talking to is responsible for taking the messy data and normalizing it to something it expects. In this sense, the use case of jf2 is a client-to-server serialization, whereas the Microformats JSON is a server-to-server serialization. This would then be a core building block for Microsub, a spec that provides a standardized way for clients to consume and interact with feeds collected by a server.
The main current challenge in defining a spec for this use case is how tied to specific vocabularies it should be. For example, Microformats JSON says that every value should always be an array. However, there are a few properties for which it never makes sense to have multiple values, and creates additional complexity in consuming it, e.g.
location. It's easier to consume these when the values can be relied upon to always be a single value. With the
author of a post, the
author of an
h-entry may be an object or a string, making it more complicated to consume that when it can vary, so XRay's format always returns a consistent value. However this is tied to the
h-entry vocabulary, since other Microformats vocabularies don't have an
author property. In general, the success I've had with XRay's format is due to the fact that it makes hard decisions about what properties it returns, and is consistent about whether those properties are single- or multi-valued, in order to provide a consistent API to consumers.
I am just not sure how to balance wanting to provide that simplicity for consuming clients while also allowing flexibility in publishing, while also not hard-coding too much into a spec that might be obsoleted later.
A couple days ago, I switched most of my *.p3k.io domains over to individual Let's Encrypt certificates. It was relatively easy for the apps that are running on my main server. However, XRay is actually running on Google App Engine, which means my streamlined workflow for requesting and renewing certificates doesn't apply.
App Engine doesn't have an integration with Let's Encrypt yet, and there is also no API for uploading certificates, so this will require some manual work for now.
The Let's Encrypt client supports a "manual" method of requesting certificates, where it will show you the challenge text and wait for you to put the challenge response onto the server where the client expects to find it. I figured I could use this to request a certificate for my App Engine app.
I had to build a form into XRay that would let me enter the challenge text and save it to be served by App Engine. Of course I couldn't let just anyone use the form, otherwise anyone could request certs for my domain. So I had to build a login mechanism into XRay so that only I can use the form.
Since XRay is deployed from a public GitHub repository, I couldn't put any secrets in the config file, so this sounded like a great use for indieauth.com which lets me sign in without the consuming website needing any secret keys.
So now I can sign in to XRay:
And after I'm signed in, there is a form to save the challenge text from Let's Encrypt.
I wrote up full setup instructions in the XRay project.
XRay now supports the h-recipe vocabulary!
Thanks to my refactoring yesterday, this was relatively straightforward. I did consolidate a little more of the code to be able to reuse the part that extracts HTML text. Previously it was only used for the "content" property of h-entry, and now it can be used for additional properties such as h-recipe's "instructions".
Today I added the h-review vocabulary to XRay. This means you may now see objects of "type: review" show up when using XRay.
This was going to be a straightforward addition, but I realized that it would have involved duplicating a lot of code in the parsing logic. So I ended up doing quite a bit of refactoring to consolidate the logic of extracting properties from the mf2 objects. This also means it's now a lot easier to add new vocabularies as well! In fact, while adding h-review, I had to also add h-product, in order for the reviewed item to show up correctly.
The only remaining issue with this is that the PHP mf2 parser has some issues with backcompat parsing for Microformats 1's hReview, so those end up looking messy right now. Once that's fixed in the parser, XRay will work with hReview as well!
Today I closed a long-standing request on XRay to return the HTTP status code from the retrieved page, as well as parsing the
<meta http-equiv="Status" content="410 Gone"> tag in the HTML. I also now return the final URL that XRay retrieved the document from, after following any HTTP redirects that were sent.
This means XRay can now be used to know when a previously received Webmention has been deleted, even if the source website is a static HTML file that returns HTTP 200 with a meta-equiv tag indicating the delete.
The XRay response will now include two additional keys:
url- The effective URL that the document was retrieved from. This will be the final URL after following any redirects.
code- The HTTP response code returned by the URL.
These have both been documented in the README as well.
Today I made a few changes to XRay to make it easier to deploy in more kinds of environments. I also removed a bunch of CSS/JS dependencies and simplified the UI a bit.
I dropped the CSS framework I was using, and dropped jQuery. All I was using that for was the silly tab interface on the home page, and I figured it wasn't worth all that extra CSS/JS just for that effect.
I rewrote the CSS for the home page inline, to avoid needing to even fetch an external stylesheet.
I then set out to see what it would take to be able to deploy this to shared hosting, especially in a subfolder. I had to do a few things to make that work.
I wanted it to be as easy to install as "download this zip, extract to a folder on your webserver and run." Since I suspect people might not configure their web server to point to the "public" folder as the root, I had to add a new index.php file to the root of the project which just includes the "public/index.php" file where all the magic happens. I also added .htaccess files in all of the other folders to prevent those files from being run by requests. (The web server should never serve files out of the "vendor" or "views" folder directly.)
I published a zip file in the "Releases" section on GitHub which includes all the necessary composer dependencies already bundled in the file. This means you can just extract the zip and run it!
I tested this out by installing it on my Dreamhost account, and it works great!
You can download the latest version of XRay here:
Continuing yesterday's work, today I added support for parsing Twitter URLs to XRay.
There were a couple tricks to make this work. I wanted to make sure that Tweets are always expanded to include the most data possible, and also wanted to avoid needing to make a bunch of HTTP requests. Scraping from the twitter.com website wasn't an option, since some of the data isn't available or would require additional HTTP calls to fetch. (For example I would have to fetch every t.co URL to expand them.) So I set to work using the Twitter API to fetch the tweets.
I didn't want to hit Twitter rate limits by sharing all XRay access from a single account, and I also didn't want to add a database to XRay so that it can continue to be stateless. This meant that the only option was for the XRay client to pass in its own Twitter credentials when fetching twitter.com URLs. This is an acceptable compromise for me, since it keeps XRay simple, and also avoids me needing to officially get a Twitter app approved. If you want to use this feature, you can go to dev.twitter.com and create an app and access tokens for your account right there, which doesn't even involve writing any code. I've updated the XRay readme with further instructions.
Now p3k will include my Twitter credentials when making a request to XRay for a twitter.com URL, and XRay uses my Twitter credentials to fetch the tweet from the API.
So now, whenever I repost something on twitter.com, the contents are expanded and my website shows the full Tweet!
Today I added support for XRay to extract data from Instagram URLs!
This means anything that uses XRay will now return structured data when given an Instagram URL, just like how it parses h-entry and other Microformats.
Unfortunately, Instagram does not provide timezone data for the published date, only a Unix timestamp. So if the photo is tagged at a location, then XRay will look up the appropriate timezone for that location and adjust the timezone of the published date accordingly!
Here's what the parsed JSON looks like for this photo. Note that the timezone is set to East Coast because this photo was taken at MIT.
"name":"Massachusetts Institute of Technology (MIT)",
In addition to my website using this for reposts and comments, when I paste that URL into IRC, Loqi uses XRay to expand it and provide a little text preview.
Earlier this year when I launched XRay, I connected Loqi the IRC bot to it so that we would get inline IRC text previews when people linked to web pages in IRC. XRay works by finding an h-entry on the page, and getting the content and author information from it. Here's what it normally looks like in IRC.
Loqi fetches the URL and finds the h-entry, and posts a summary of it in IRC. This works great most of the time. However, when people started posting URLs to things with fragment IDs (such as the chat log permalinks themselves), then Loqi would report the name of the page rather than the summary of whatever was inside the fragment ID.
Today I updated XRay (issue #15) to handle URLs with fragment IDs. If the URL has a fragment, then it looks for an HTML element with that ID and hands off that subtree to the Microformats parser instead of parsing the whole page.
Now the Loqi summaries work for URLs with fragments!