Apple has been iterating for years on a curious set of APIs called ARKit that has been largely ignored by the iPhone & iPad developer community, which hints that more compelling hardware is coming to take full advantage. CEO Tim Cook has openly said AR is the ‘next big thing’.
Rumors suggest Apple is developing an augmented reality (AR) glasses product. This product would likely act as an iPhone periphery, at least initially, similar to how Apple Watch once relied on a host iPhone to provide the main computational grunt. Well informed supply chain analyst Ming-Chi Kuo has said the AR glasses “will primarily take a display role offloading computing, networking, and positioning to the iPhone.”
Apple has recently introduced M1 Macs powered by Apple Silicon. These Macs are notable because they bring a marked improvement in battery life and performance. But they also bring Apple’s developer devices finally in line with the more capable hardware of their consumer devices.
Apple’s head of software engineering Craig Federigi talked with Ars Technica about the advantages of M1’s unified memory architecture:
Where old-school GPUs would basically operate on the entire frame at once, we operate on tiles that we can move into extremely fast on-chip memory, and then perform a huge sequence of operations with all the different execution units on that tile. It’s incredibly bandwidth-efficient in a way that these discrete GPUs are not.
So how would you develop apps for such a device? Let’s look at how developing software for the iPhone works today.
Developers buy Macs and install Xcode which allows them to write, compile, and deploy iPhone (and iPad, Mac, Watch & TV) apps. To actually experience the user experience of their apps, developers either push the app to their own iPhone and launch it like any other app, or they run it directly on their Mac within the Simulator. They choose which device they want to simulate and then see an interactive representation of the device’s screen running their app within a window on their Mac.
Currently this has worked by compiling the iPhone or iPad software for Intel chips, allowing the app to be run ‘natively’ on the Mac. Macs are powerful enough to run several of these simulators at once, however checking graphic intensive experiences such as 3D or animation sometimes means avoiding the simulator and trying the app directly on the target device. On the whole the simulator does a capable enough job to preview the experience of an app.
(An iPad version of Xcode has been speculated for years, been even with their improved keyboards and fancy trackpads, nothing has been released. The Mac maintains its role as the developer device for Apple’s platforms.)
How will this work for the AR glasses? Will Xcode provide an AR glasses Simulator for Mac? Would that appear as a window on screen with a preview for each eye? Or would you need to push the app to an actual device to preview?
If a simulator was provided, the pre-Apple-Silicon technology of an Intel chip and AMD GPU would not be able to reproduce the capabilities of a unified memory architecture, tiled rendering, and the neural engine. It would either run poorly, at low frame rates, or some capabilities might not even be possible at all. An Intel Mac can simulate software but it cannot simulate hardware. A Mac with related Apple Silicon hardware would allow a much better simulation experience.
Instead of seeing a preview of the AR display on your Mac’s screen, consider if the product could pair directly to your Mac. The developer could see a live preview of their work. The Mac could act as a host device instead of the iPhone, providing the computation, powerful graphics, machine learning, and networking needs of the AR glasses.
With the same set of frameworks brought over allowing iPhone & iPad apps to be installed and run on the Mac, both software and hardware will be ready to run AR-capable apps designed for iPhone. The Mac is now a superset of iPhone, and so what the iPhone can do, the Mac can also do. App makers now have a unified developer architecture.
Perhaps AR-capable apps from the iPhone App Store could even be installed by normal users directly on their Mac. With augmented reality perhaps the glasses will augment the device you currently use, whether that’s the iPhone in your pocket or the Mac on your desk. And allow switching back-and-forth as easily as a pair of AirPods (which would likely be used together with AR glasses).
There’s one last picture I want to leave you with. Swift Playgrounds works by showing a live preview of interactive UI alongside editable code. Change the code and your app immediate updates. The Simulator has been integrated into the app developer experience.
Now imagine Swift Playgrounds for AR — as I edit my code do my connected AR glasses instantly update?
I want to cover what the user experience of an AR Glasses product from Apple could look like, and how it might integrate with today’s products. First, let’s survey Apple’s current devices and their technologies for input & output:
iPhone input methods
Multitouch: use your fingers naturally for UI interactions, typing text, drawing, scrolling
Voice: request commands from Siri such as change volume or app-switching, dictate instead of typing
iPad input methods
Multitouch: use your fingers naturally for UI interactions, typing text, scrolling, drawing
External Keyboard: faster and more precise typing than multitouch
Pencil: finer control than multitouch, especially for drawing
Trackpad: finer control than multitouch, especially for UI interactions
Voice: request commands from Siri such as changing volume or app-switching, dictating text
Mac input methods
Keyboard: dedicated for typing text, also running commands
Trackpad / Mouse: move and click pointer for UI interactions, scrolling, drawing
Function keys: change device preferences such as volume, screen/keyboard brightness; app-switching
Touchbar: change volume, screen/keyboard brightness; enhances current app with quick controls
Voice: request commands from Siri such as changing volume or app-switching, dictating text
Apple Watch input methods
Touch: use your fingers for UI interactions, awkward typing text, scrolling
Digital Crown: change volume,scrolling, navigate back
Voice: request commands from Siri such as changing volume or app-switching, dictating text
AirPods input methods
Voice: request commands from Siri such as changing volume, dictating text
Tap: once to play/pause, double to skip forward, triple to skip backward
So what would a rumoured Apple Glasses product bring?
Apple’s Design Principles
Deference to Content
The most sparse approach might be to rely on voice for all input. Siri would become a central part of the experience, and be the primary way for switching apps and changing volume and dictating text. Siri currently can be activated from multiple devices, such as personal hand-held devices such as iPhone or shared devices such as HomePod. So it makes sense that the glasses would augment this experience, providing visual feedback that accompanies the current audible feedback.
Contrast Siri’s visual behaviour between iPadOS 13 and 14:
This provides a glimpse of the philosophy of the Apple Glasses. Instead of completely taking over what the user current sees, Siri will augment what you are currently doing with a discrete compact design.
This also relates to the Defer to Content design principle that has been present since iOS 7 which was the opening statement from Apple’s current design leadership. So we can imagine a similar experience with the Glasses, but where the content is everything the user sees, whether that’s digital or physical.
Content from a traditional app could be enhanced via augmentation. A photo or video in a social media feed might take over the user’s view, similar to going into full screen. Text might automatically scroll or be spoken aloud to the user. Content might take over briefly, and then be easily dismissed to allow the user to get back to their life.
Widgets such as weather or notifications such as received messages might be brought in from the outside to the centre. I can imagine a priority system from the viewer’s central vision to the extremes of their field-of-view). Content could be pinned to the periphery and be glanced at, while periodically in the background it receives updates.
If worn together with a set of AirPods, an even more immersive experience would be provided, with the AirPod’s tap input for playing and skipping. The active noise cancellation mode would probably pair well with a similar mode for the glasses, blocking the outside world for maximum immersion. Its counterpart transparency mode would allow the user to reduce the audible and visual augmentation to a minimum.
So with a Glasses product, what is the content? It’s the world around you. But what if the world sometimes is an iPhone or Mac you use regularly through your day? Do the Glasses visually augment that experience?
With AirPods you can hop from an iPhone to a Mac to an iPad, and automatically switch the device that is paired. Wouldn’t it make sense for the AirPods and Glasses to perform as synchronised swimmers and pair automatically together to the same device that someone decides to use?
Can the Glasses recognise your device as being yours and know it precise location in the Glasses’ field-of-view? That sounds like what the U1 chip that was brought to iPhone 11 would do, as 9to5Mac describes it “provides precise location and spatial awareness, so a U1-equipped device can detect its exact position relative to other devices in the same room.”
Perhaps instead of tapping your iPhone screen to wake it, you can simply rest your eyes on it for a moment and it will wake up. The eyes could be tracked by the Glasses and become an input device of their own. If precise enough they could move the cursor on an iPad or Mac. The cursor capabilities of iPadOS 13.4 brought a new design language with UI elements growing and moving as they were focused on, and subtly magnetised to the cursor as it floated across the screen.
Similar affordances could allow a Glasses user’s eyes to replace the cursor, with the realtime feedback of movement and size increase enough to let the user know exactly what is in focus. The Mac might not need touch if the eyes could offer control.
In the physical world, a similar effect to Portrait mode from iPhone could allow objects in the world to also be focused on. The targeted object would remain sharp, and everything around it would become blurred, literally putting it into focus.
AirTags could enhance physical objects by providing additional information to their neighbour. Instead of barcodes or QR codes, the product itself could advertise its attributes and make it available for purchase via Apple Pay.
Use Depth to Communicate
If the Glasses not just show you the world around you but see the world around you, then your hands gesturing signals in the air could also be a method of input. Simple gestures could play or pause, skip ahead or back, change the volume. The could also be used to scroll content or interact with UI seen through the Glasses.
These gestures would close the loop between input and output. The iPad’s multitouch display works so well because of direct manipulation: your fingers physically touch the UI your eyes see. As your fingers interact and move, the visuals move with it. The two systems of touch input and flat-panel-display output become one to the user. Hand gestures would allow direct manipulation of the content seen through the Glasses.
Speculated Apple ‘Glasses’ input methods
Voice via AirPods or nearby device: request commands from Siri such as changing volume, dictating text
Eyes: interact with devices that have a cursor, focus on elements whether digital or physical
Air Gestures: use your hands for UI interactions, scrolling, changing the volume, playing, pausing, skipping
U1: recognise nearby Apple devices and interact with them
Plus whatever device you are currently using (if any)
So the Glasses could offer a range of novel input methods from a user’s eyes to their hands, or it could simply rely on the ubiquitous voice-driven world that most Apple devices now provide. The U1 chip seems to hint at an interaction between Glasses and hand-held device, perhaps modest like simply recognising it, or perhaps augmenting its input and output allowing a new way to interact with iPhones, iPads, and Macs. The Glasses accompanies what the user already sees and interacts with every day, enhancing it visually but deferring to the outside world when it needs. It could offer an immersive experience for content such as video and games, or future formats that Apple and other AR-device-makers hope will become popular.
When you visit a new city, one thing you expect to see are landmarks. Statues, botanical gardens, theaters, skyscrapers, markets. These landmarks help us navigate around unfamiliar areas by being reference points we can see on the horizon or on a map.
As makers of the web, we can also provide landmarks to people. These aren’t arbitrary — there are eight landmarks that are part of the HTML standard:
Some of these seem obvious, but some are odd — what on earth does “contentinfo” mean? Let’s walk through what they are, why they are important to provide, and finally how we can really easily use them.
Nearly all websites have a primary navigation. It’s often presented as a row of links at the top of the page, or under a hamburger menu.
Stripe’s navigation provides links to the primary pages people want to visit. It’s clear, and follows common practices of showing only a handful of links, and placing the link to sign in up on the far right.
Most visual users would identify this as the primary navigation of the site, and so you should denote it as such in your HTML markup. Here’s what you might write for Stripe’s navigation:
Here we use HTML 5’s <nav> element, which automatically has the navigation landmark.
If there’s only one navigation landmark on a page, then people using screen readers can jump straight to it and the links inside. They can visit Stripe’s Support page in a few seconds. It’s like a city subway that connects key landmarks, allowing fast travel between them.
What if you have multiple navigations? Let’s look at GitHub for an example.
Here we have a black bar offering links to the main parts of the GitHub experience: my pull requests, my issues, the marketplace, notifications, etc.
But I am on the page for a particular repository, and it also has its own navigation: Code, Issues, Pull requests, Actions, etc.
So how do we offer both? And how do users using screen readers know the difference? By attaching labels to each navigation: the top navigation has the label Global, and the repository specific navigation has the label Repository. It’s like a city having multiple sports stadiums: here in Melbourne we have the MCG (used for football and cricket) and the Rod Laver Arena (used for tennis and music). They clearly have different names to identify them by that means people can find them easily and won’t mix them up.
Now people using screen readers or similar browser tools can see that there are two navigation to pick from, one named Global and one Repository.
Note also we have an aria-current="page" attribute on the link that represents the page the user is on. This is equivalent to a 🔴 You Are Here mark on a public map.
When watching a show on Netflix, you’ll often be presented with a Skip intro button. This fasts forwards past the intro content that is often the same every time to the part you want to watch: the new episode.
Imagine if that Skip intro button didn’t exist: what would you do? You could watch the minute-long intro every time. Or you could attempt to fast-forward to the spot where the show actually starts. One is tedious and the other is error-prone. It would be a poor user experience.
On the web, our users might find themselves in the same situation. If they use a screen reader, they’ll probably hear all the items in our navigation and header. And then eventually they’ll reach the headline or the part that’s new — the part they are interested in — just like a TV episode. They could fast-forward, but that also would be error-prone. It would be great if we could allow them to skip past the repetitive stuff to the part they are actually interested in.
Enter <main>. Use this to wrap the part of the page where your ‘episode’ actually starts. People using screen readers can then skip past the tedious navigation and other preambles.
By using <main> we have allowed users to skip the intro.
We’ve already talked about the top strip on most websites, and these also have a role. Banners hold the primary navigation and also: logo, search field, notifications, profile, or other site-wide shortcuts. The banner often acts as the consistent branding across all pages.
Here’s GitHub’s banner when I’m signed in. The part I’ve highlighted with the yellow outline is the navigation (using <nav>). The entire element uses <header>, which automatically gains the role of banner if it meets the following (via MDN):
Assistive technologies can identify the main header element of a page as the banner if is a descendant of the body element, and not nested within an article, aside, main, nav or section subsection.
So the following <header> has the role of banner:
<header> <!-- This gains the banner role -->
While this one doesn’t:
<header> <!-- This does not gain the banner role -->
And you can use multiple <header> elements for things other than banners, if you nest them inside “article, aside, main, nav or section” as MDN mentions.
Because of this, I might recommend that you add the banner role explicitly, as it will make it easier to identify and also target with CSS (e.g. header[role=banner] selector).
<header role="banner"> <!-- Add the role explicitly -->
<header> <!-- Because this is nested inside <main>, it won’t gain the banner role -->
Banner’s don’t necessarily have to be a horizontal strip. Twitter has a vertical banner:
The banner here is the entire left hand column containing Home, Explore, etc. It’s also implemented with a <header role="banner">. The HTML 5 elements are named more for their concept that their visual intention.
Search is one of the things that makes the web great. You have an idea of what you are looking for, you type it in, and in seconds you’ll likely be shown it.
Again we see a <form> with role="search". If you decide to add a search form to your site, make sure it has the search role.
If you have another form not used for search, say for signing in or creating a new document, then the form role helps out here. The built-in <form> element actually already has the form role implicitly. So what’s left to do?
First, ensure it is labelled so people know what the form is for. That way if there’s multiple forms on a page, they can tell them apart. Also, people can jump straight to the form and start filling it out.
You can add a label by adding an aria-label attribute (note: avoid title):
<form aria-label="Create a new repository">
<h2>Create a new repository</h2>
Or by identifying which heading acts as the form’s label:
<h2 id="new-repo-heading">Create a new repository</h2>
Note in both cases we still have a heading — your forms should probably have a label that is readable by all users, not just those using assistive-tech.
Ok, so the names have been pretty logical so far. And then we come to contentinfo. What on earth does that mean?
Let’s show some examples of where contentinfo has been used in the wild:
It’s a footer! With lots of links. And a copyright.
Akin to the banner role and its automatic provider <header>, we can use <footer>:
<footer> <!-- Because this is nested inside <main>, it won’t gain the contentinfo role -->
<footer role="contentinfo"> <!-- Add the role explicitly -->
And also like <header>, it only gains the role if it’s a direct child of <body>. However, it’s recommended that you add role="contentinfo" explicitly to the desired element due to long running issues with Safari and Voice Over.
Hierarchy is a core principle of visual design. Some parts of a design will be more important than others, and so it is important that the reader is aware of what they should draw their attention to, and what is less important.
Visual users are aided by size, layout, contrast — and so we need a semantic approach too for non-visual users. This might be a user using a screen-reader. Or it might be a search engine’s web crawler, or someone using the reader view available in Safari and Firefox.
A simple hierarchical relationship is primary content supported by complementary content. Some examples of these are:
Footnotes to an article
Links to related content
Comments on a post
Here’s an example article with footnotes, pull quotes, and related links:
<h1>Why penguins can’t fly</h1>
<p>Penguins are … </p>
<p>Their feathers … </p>
Penguins swim fast due to air bubbles trapped in their feathers<sup><a href="#footnote-1">1</a></sup>
<p>Speeds of … </p>
<p>They eat … </p>
<a href="https://www.nationalgeographic.com/magazine/2012/11/emperor-penguins/">National Geographic: Escape Velocity</a>
We have covered seven landmarks — what’s left? The generic landmark of region. Use it as a last resort — first reach for one of the above landmarks.
Again, HTML 5 helps us out here: we can use <section>. It’s important that you add an aria-label attribute (or aria-labelledby) to name the landmark, so a user knows why it is important and can tell it apart from other landmarks.
<section aria-label="quick summary">
In this Smashing TV webinar recording, join Léonie Watson (a blind screen reader user) as she explores the web…
This allowed Léonie (who suggested the change) to identify the summary, and skip it if she liked.
Remember, use navigation, banner, contentinfo roles (<nav>, <header>, <footer>) before using region. The HTML spec suggests for using sections:
Examples of sections would be chapters, the various tabbed pages in a tabbed dialog box, or the numbered sections of a thesis. A Web site’s home page could be split into sections for an introduction, news items, and contact information.
We’ve been using <article> in some of the examples previously — is this also a landmark? The answer is technically no, but more or less yes. Bruce Lawson goes into detail on why you should use <article> over <section>:
So a homepage with a list of blog posts would be a <main> element wrapping a series of <article> elements, one for each blog post. You would use the same structure for a list of videos (think YouTube) with each video being wrapped in an <article>, a list of products (think Amazon) and so on. Any of those <article>s is conceptually syndicatable — each could stand alone on its own dedicated page, in an advert on another page, as an entry in an RSS feed, and so on.
An article element also helps browsers such as Apple Watch or reader views know what content to jump to with their stripped-back browsers. And many screen readers will surface them as a place-of-interest.
I encourage you to view landmarks on news sites, social media such as Twitter, web apps such as GitHub, and everything in between. You’ll find that there’s a fair amount of consistency, and some will be better than others. You’ll also have a bar to meet when building your own.
These landmarks apply to all websites: landing pages, documentation, single-page-apps, and everything in between. They ensure all users can orient themselves to quickly become familiar with and navigate around your creation.
They also provide a consistent language that we can design and build around. Share this and other articles (which I’ll link to below) with developers, designers, and managers on your team. Landmarks provide familiarity, which leads to happier users.
Xerox Alto. Work on the Xerox Alto, the first GUI-oriented computer, started in November 1972 because of a bet: “Chuck said that a futuristic computer could be done ‘in three months’ and a Xerox exec bet him a case of wine that it couldn’t be done”.
Git. Linus Torvalds started working on Git on April 3 2005. It was self-hosting 4 days later. On April 20 2005, 17 days after work commenced, Linux 2.6.12-rc3 was publicly released with Git.
BankAmericard. Dee Hock was given 90 days to launch the BankAmericard card (which became the Visa card), starting from scratch. He did. In that period, he signed up more than 100,000 customers.
Amazon Prime. Amazon started to implement the first version of Amazon Prime in late 2004 and announced it on February 2 2005, six weeks later.
In a web app, state refers to local state. It only knows a subset of all information. And that information is from a certain point in time: it might have changed since.
(In a server-side web app like one built with Rails, your application usually talks to the database directly.)
So local state. It’s your web app’s state of the world at a certain time. That world depends on factors, such as which account is signed in, or perhaps whether the user is not signed in, which page they are looking at, and what capabilities their account has.
For your web app to display information, it has to first load it.
So your web app loads this initial information, and this becomes its state of the world. If five minutes later, the latest information is desired, it must load it again.
Making things more complicated is the fact that there is not just one set of information.
Take Instagram as an example. There is the feed of photos. As you scroll through the feed, more photos are loaded. For a good user experience, photos are loaded before you reach them. But still, only a subset of photos are loaded. It would make little sense to load your feed from the top to the very bottom, as it would likely be hundreds of megabytes of information in total, and be a huge strain on Instagram’s servers to collate all that information.
Instagram offers more than just a feed. Your can explore and search for photos. You can take or upload a photo. You can see a list of activity aimed at you, such as recent comments or likes on your posts. You can also look at your own profile, with the photos you have posted.
Depending on which section you look at, and how you interact within that section (scroll, tap to see details, tap to go to someone’s profile), new information will have to be loaded. The app’s state of the world expands. It goes from a small subset of data, to a larger subset of data.
Managing this state, and coordinating the loading of additional or fresher state, is one of the key skills in building web apps.
A step back. A command line app.
Any app that has an interactive user interface has state. Think of PowerPoint and the file that is being viewed, the currently active slide, whether that slide is being edited, viewed with slides listed to the side, or is being presented. All of those variables are state.
A command line app also has state. The less command efficiently reads from a file, only presenting a slice of it that fits in the terminal window. As the user scrolls and down, the app’s state is updated with the offset slice into the file.
A command line app is a useful starting point, because many operate in the same manner as web apps based on React. In React, the application is built by deriving the entire user interface from state. The application developer declares their intent for what should be displayed on screen given a certain state. If the state changes, then that codified intent is used again, but with the updated state. And so on, every user interaction usually affects the state in some way, and the user interface is updated quickly to respond. This pipeline approach is popular with games, where the displayed image is rebuilt again and again many times a second, fast enough to produce fluid motion.
The alternative is to intertwine the state with what is being presented. As new information comes in, it is not a simple change to state. It is a manual change to what’s on screen too.
A pipeline is like to correct a typo in a printed document, making the change in software, throwing the old copy away, and printing a new fresh copy. Fortunately, in an app built totally with software means that nothing is wasted.
An intertwined approach is more akin to correcting a typo by using whiteout and a fine pen. It’s much less wasteful, but takes much more effort and skill than just printing a new copy would be. The same is true for apps.