63°F

Aaron Parecki

  • Articles
  • Notes
  • Photos
  • Day 60: Emoji Detector Library for PHP #100DaysOfIndieWeb

    February 18, 2017

    I wanted to find all emoji in a string, including info about them, for my next #100Days project. However I couldn't find a library that does this. The closes I found was iamcal's Emoji conversion library, which can replace emoji in a string with HTML tags, as well as the EmojiOne library which can replace emoji in a string with shortcodes.

    I started down a path of attempting to understand unicode encoding. A very helpful resource is this post from 2003 titled "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)". It's worth a read if you have to deal with user input at all.

    If you aren't familiar with the details of Emoji, Unicode and UTF-8 encoding, then what you probably don't realize is that an emoji character such as 👨‍👩‍👦‍👦 is actually composed of seven unicode characters. Each person is a separate character, and they are all connected with the "Zero-Width-Join" (ZWJ) character. This ends up being seven code points in total: 👨 [ZWJ] 👩 [ZWJ] 👦 [ZWJ] 👦. There are also skin tone modifiers which are their own character. So an emoji like 👍🏼 is actually two characters, the 👍 plus the skin-tone-3 modifier.

    To further complicate things, I've been talking about unicode code points, but it turns out these code points can be represented in any number of ways in a string depending on the string encoding. Typically we only need to worry about handling UTF-8 encoded strings now, so that's where I started. The UTF-8 encoding of a character like "A" is the same as the ASCII encoding of the character, using only one byte. However a character such as 👍 requires more than one byte to represent. This means actually finding meaningful emoji in a string is not as simple as reading byte by byte, and is not even as simple as reading UTF-8-character by character. 

    Thankfully, EmojiOne has done the hard work of finding the Emoji characters in a string. However their library doesn't have a way to return the Emoji found, it can only be used to replace them. I also didn't like the list of short names they use, I prefer the Slack names instead.

    What I ended up with was putting together the parsing regex from EmojiOne with the Emoji data from Slack's data set. I turned this into a library that returns the data I want to use. Here's how it works.

    Given an input string that may contain emoji characters, this function will find any emoji in the string and return an array with information about each character.

      $input = "Hello 👍🏼 World 👨‍👩‍👦‍👦";
      
    $emoji = Emoji\detect_emoji($input);
    • emoji - The emoji sequence found, as the original byte sequence. You can output this to show the original emoji.
    • short_name - The short name of the emoji, as defined by Slack's emoji data.
    • num_points - The number of unicode code points that this emoji is composed of.
    • points_hex - An array of each unicode code point that makes up this emoji. These are returned as hex strings. This will also include "invisible" characters such as the ZWJ character and skin tone modifiers.
    • hex_str - A list of all unicode code points in their hex form separated by hyphens. This string is present in the Slack emoji data array.
    • skin_tone - If a skin tone modifier was used in the emoji, this field indicates which skin tone, since the short_name will not include the skin tone.

    This package is now available on GitHub, and via Composer!

    composer require p3k/emoji-detector

    🎉👍

    Portland, Oregon
    Sat, Feb 18, 2017 2:38pm -08:00 #100daysofindieweb #emoji #p3k #unicode
    1 reply 5 mentions
    • Greg unrelenting.technology

      👍

      Sun, Feb 19, 2017 6:34am -08:00

    Other Mentions

    • Aaron Parecki aaronparecki.com
      My 2017 Year in Review
      Thu, Jan 4, 2018 2:40pm -08:00
    • Sebastiaan Andeweg seblog.nl
      Day 36: reacji
      Mon, Feb 20, 2017 11:30pm +00:00
    • Sebastiaan Andeweg seblog.nl
      🔥
      Mon, Feb 20, 2017 11:20pm +00:00
    • Aaron Parecki aaronparecki.com
      Day 62: Indexing Emoji Use in my Website #100DaysOfIndieWeb
      Mon, Feb 20, 2017 9:26am -08:00
    • 100 Days of IndieWeb aaronparecki.com/tag/100daysofindieweb
      Day 60: Emoji Detector Library for PHP #100DaysOfIndieWeb: aaronparecki.com/2017/02/18/12/…

      Sat, Feb 18, 2017 10:38pm +00:00 (via brid-gy.appspot.com)
Posted in /articles using quill.p3k.io

Hi, I'm Aaron Parecki, Director of Identity Standards at Okta, and co-founder of IndieWebCamp. I maintain oauth.net, write and consult about OAuth, and participate in the OAuth Working Group at the IETF. I also help people learn about video production and livestreaming. (detailed bio)

I've been tracking my location since 2008 and I wrote 100 songs in 100 days. I've spoken at conferences around the world about owning your data, OAuth, quantified self, and explained why R is a vowel. Read more.

  • Director of Identity Standards at Okta
  • IndieWebCamp Founder
  • OAuth WG Editor
  • OpenID Board Member

  • 🎥 YouTube Tutorials and Reviews
  • 🏠 We're building a triplex!
  • ⭐️ Life Stack
  • ⚙️ Home Automation
  • All
  • Articles
  • Bookmarks
  • Notes
  • Photos
  • Replies
  • Reviews
  • Trips
  • Videos
  • Contact
© 1999-2025 by Aaron Parecki. Powered by p3k. This site supports Webmention.
Except where otherwise noted, text content on this site is licensed under a Creative Commons Attribution 3.0 License.
IndieWebCamp Microformats Webmention W3C HTML5 Creative Commons
WeChat ID
aaronpk_tv