First and Last name validation for forms and databases

Ahmed Tokyo
7 min readJun 10, 2021

First name and last name validation is not trivial. How many times have you had to build a database table with a first name and last name? Let me guess, your answer is ‘All the time’.

Now, how many times have you googled “How to validate human names?”. With every project we create we face this problem. An easy solution would be to use validator.isAlpha('name', ['en-US', 'en-GB', 'pt-pt']) right out of NPM, but we always realize that we need to allow a larger subset of characters. As our project grows, our users grow; this widely varies the validation rules for the user names.

Let’s solve this problem once and for all. We will start by supporting North American, Australian and English names. We will then add support for Western European names followed by all human names covering North America, South America, Europe, Africa, Australia and Asia. We will also learn how to do fine grained control validation for different languages like Arabic or Russian.

Starting point

We will start by creating a simple NAME_REGEX regular expression in javascript which allows for "word" characters. We will also create the function isValidName which tests a string against the NAME_REGEX. This function will be run on the first name and last name separately. This will allow us more validation control and will make us compatible with most systems out there that use first name (given name) and last name (surname) fields separately like Banks.

const NAME_REGEX = /^\w+$//** Validates a name field (first or last name) */
const isValidName = (name) => NAME_REGEX.test(name)

Remove numbers and underscores

We will now fine tune our regex a bit so that it doesn’t allow for numerical digits and underscores.

const NAME_REGEX = /^[a-zA-Z]+$/

We replaced the \w with [a-zA-Z].

Allow for multiple words, hyphens and apostrophes

There are lot of first names that consists of more than one word, for example: Yuv Raj (indian) and Fatima ElZahraa (arabic). So we need to allow for spaces as well. Irish names contain apostrophes, for example: O'Neil (irish). Some names include hyphens, for example: kunis-edison.

const NAME_REGEX = /^([a-zA-Z]+[ \-']{0,1}){1,3}$/

We added [\-’]{0,1} to support 0 or 1 hyphen/apostrophe/period between and the end of the alphabetic characters of the name. Finally, we wrapped the regex with brackets and added {1,3} requiring the regex to occur a minimum of 1 and a maximum of 3 times.

An important thing to note is that we want the hyphens, apostrophes and spaces to only appear in the middle of the text, not at the end. For example: O'neil should be valid, but O' should not be valid.

const NAME_REGEX = /^[a-zA-Z]+([ \-']{0,1}[a-zA-Z]+){0,2}$/

To achieve this we replaced [\-’]{0,1} with ([ \-’]{0,1}[a-zA-Z]+){0,2}]) this forces at least a single alphabetic letter after the special character, while also allowing the name to have length 1. We replace the {1,3} with {0,2} to achieve the same result (min 1, max 3 blocks); since the regex now forces alphabet at the end, we need to replace n with n-1 to achieve the same validation (accounting for the alpha prefix).

Side-note: If your name length is at least 2, you can simplify the [a-zA-Z]+([ \-’]{0,1}[a-zA-Z]+)*to [a-zA-Z]+[ \-’]{0,1}[a-zA-Z]+.

Add period support

Also, a huge set of first names are spelled with a period like Robert Jr. in the United States.

const NAME_REGEX = /^[a-zA-Z]+([ \-']{0,1}[a-zA-Z]+){0,2}[.]{0,1}$/

We added [.]{0,1} at the end of the main regex. This allows for ending the string with 0 or 1 period.

So far we have full support for English names. We cover the regions of North America (United States and Canada), Australia and the United Kingdom.

Add extended ASCII character support for internationalization (optional)

So far we only support the basic english language characters. However, a lot of European names use a larger subset of characters. For example, The Umlaut (ü) is a common character in German names and the accent (à) is common in Portuguese and French names like Cláudia. Various Portuguese, Spanish, French, German, and Nordic names have such characters. If a user is asked to enter their names exactly like it is on their ID, a person named Müller would have to misspell their name as Muller and remember that spelling forever. Russian names are written in Cyrillic script like: Владимир. We need to support these characters via unicode support. We will use the subset \xC0-\uFFFF which covers most of the world scripts, skipping latin punctuation.

const NAME_REGEX = /^([a-zA-Z\xC0-\uFFFF]+([ \-']{0,1}[a-zA-Z\xC0-\uFFFF]+)*[.]{0,1}){1,2}$/

We added \xC0-\uFFFF after all the instances of a-zA-Z to allow for special script characters like Latin extended, and Arabic û, ü,à, á, أ, л, etc…

If your app needs do not require you to cover this wide ranged set of unicode characters, you can simply cut it down to your needs by checking the Unicode Table and updating the subset to only your needs, eg: 0400−04FF if you only support Cyrillic script like in Ukraine.

Only allow for main western characters (optional):

To allow ONLY western characters, covering the ranges of North America, South America, Australia and Western Europe, we should allow only the unicode charsets Basic Latin via [a-zA-Z], Latin-1 Supplement, Latin Extended-A and Latin Extended-B.

const NAME_REGEX = /^[a-zA-Z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u01FF]+([ \-']{0,1}[a-zA-Z\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u01FF]+){0,2}[.]{0,1}$/

We replaced \00C0-\uFFFF which covers most unicode characters with \u00C0-\u00D6\u00D8-\u00F6\u00F8-\u01FFwhich covers only the charsets of western characters without the mathematical and punctuation symbols.

Maintenance and additions

You can always add more characters/use less characters by replacing the [a-zA-Z\xC0-\uFFFFpart with whichever characters you want allowed. For example you can edit it to allow for Arabic unicode characters as well. A full list of unicode characters can be found on the Unicode Table.

Unit tests

To ensure that our regex matches all our cases and to be able to build on top of it as we go, we need to write unit tests. This example is written using mocha on NodeJS.

expect(isValidName('a')).to.be.true;
expect(isValidName('alice')).to.be.true;
expect(isValidName('Robert Downey Jr.')).to.be.true;
expect(isValidName('Mia-Downey')).to.be.true;
expect(isValidName("Mark O'neil")).to.be.true;
expect(isValidName('Thomas Müler')).to.be.true;
expect(isValidName('ßáçøñ')).to.be.true;
expect(isValidName('أحمد')).to.be.true;
expect(isValidName('فلسطين')).to.be.true;
expect(isValidName('فيليپا')).to.be.true;
expect(isValidName('Владимир')).to.be.true;
expect(isValidName('a ')).to.be.false;
expect(isValidName('a-')).to.be.false;
expect(isValidName("Mark O'")).to.be.false;
expect(isValidName('a_a')).to.be.false;
expect(isValidName('lara1')).to.be.false;
expect(isValidName('mila.eddison')).to.be.false;

Extra validations

As you’ve already noticed, the given regex doesn’t have any length restrictions. This is to allow you to easily handle length validation separately via a string.length() check.

Database storage

Keep the following considerations in mind when storing the fields in your database.

  • Did I use the right data type?
  • Does my field have enough length?
  • Does my field support the unicode characters my validation supports?
  • Did I escape the input? (to avoid DB/SQL injection attacks)
  • Is my field searchable with and without the special characters?

Supporting different languages in a standalone manner (optional)

The regex we created will work well with all unicode based names, skipping well known punctuation marks in Latin. Other languages however have their own specifications on the legal name, for example: in Arabic you can write your name with accents, but your legal name has to be without these accents. Our current regex will not check for such restrictions. As you internationalize your app, you’ll probably need to support multiple languages like Arabic, Chinese, Japanese, and Russian to name a few. I highly recommend creating separate functions to handle separate languages if you need fine control over the legal characters of the name. The reason behind this being not to overcomplicate our single regex to the point of unusability.

Arabic language support in detail: The Arabic language is spoken by over 310 million people all over the globe; it’s used in a lot of countries like Egypt, Palestine and the United Arab Emirates to name a few. To support Arabic we shall start by checking the unicode sets. The main Arabic unicode set is 0600−06FF. This range includes the digits and accents. Arabic accents, unlike english, do not exist in official documents. Accents are used to phonetically pronounce a word, which is separated from writing. So let's exclude the digits and accents. This will leave us with the sets 0620-064A and 066E-06D5. There's also another set of unicode characters called Arabic Supplement; used to spell non native Arabic names using Arabic letters 0750−077F.

const NAME_REGEX = /^[\u0600-\u06FF\u066E-\u06D5\u0750−\u077F]+([ ]{0,1}[\u0600-\u06FF\u066E-\u06D5\u0750−\u077F]+){0,2}$/

We replaced `a-zA-Z\xC0-\uFFFF` with `\u0600−\u06FF\u066E-\u06D5\u0750−\u077F`. ie: we replaced the set of supported characters with the subset `\u0600−\u06FF\u066E-\u06D5\u0750−\u077F` we also removed the period, dash and apostrophe.

expect(isValidName('أحمد')).to.be.true;
expect(isValidName('فلسطين')).to.be.true;
expect(isValidName('فيليپا')).to.be.true;
expect(isValidName('فيلAhmedيپا')).to.be.false;
expect(isValidName('عمر١')).to.be.false;
expect(isValidName('۩')).to.be.false;

To support other languages to this level of detail, you’d have to go over each language you support and find the legally allowed subset of characters. You can then call the appropriate validation function based on the user’s country, locale or choice.

Conclusion

We’ve now created a regex to validate Latin based names and added support as we go for different language scripts. This will allow a significantly larger set of our users of spelling their names correctly. This will work pretty well with all world regions. We also learned how to create modules of our function for separate languages like Arabic and how to only support western charsets.

TL;DR; here’s the final regex (in case you just scrolled here for the result, like I do.):

const NAME_REGEX = /^[a-zA-Z\xC0-\uFFFF]+([ \-']{0,1}[a-zA-Z\xC0-\uFFFF]+){0,2}[.]{0,1}$//** Validates a name field (first or last name) */
const isValidName = (name) => NAME_REGEX.test(name)

Happy hacking! ❤️

--

--