Encountered two blockers working on this:
1) In a simple example of an img
tag inside an e-content
tag, the parsers are using the img
tag as an implied photo property. This seems wrong to me. Example This means XRay sees a post like this as a photo post, and would remove the img tag from the content, which is definitely not the right thing to do.
<div class="h-entry"><p class="e-content p-name">Hello World <img src="example.jpg"></p></div>
{
"type": [
"h-entry"
],
"properties": {
"name": [
"Hello World http://example.com/example.jpg"
],
"content": [
{
"html": "Hello World <img src=\"http://example.com/example.jpg\">",
"value": "Hello World http://example.com/example.jpg"
}
],
"photo": [
"http://example.com/example.jpg"
]
}
}
2) At the point that XRay is sanitizing the HTML value, the Microformats parser has already converted the HTML to plaintext.
For example, XRay sees this object and runs the HTML sanitizer on the HTML value:
{
"html": "Hello World <img src=\"http://example.com/example.jpg\">",
"value": "Hello World http://example.com/example.jpg"
}
This means I can't remove the img
tag from the plaintext value since it's already been parsed. I think my only solution for this is going to be to create my own plaintext value out of the sanitized HTML. Unfortunately, that is not a straightforward process, as demonstrated by this relatively long function that does this in the PHP parser. However that might be the technically better option anyway, since XRay can't be sure exactly what method was used to generate the plaintext value from the original HTML anyway.