Java: how to replace whitespace with underscores?

Let’s assume you want to generate a technical idenifier from a String. The idenitifer cannot have whitespace. So you want to replace blanks or tab characters or any other case of whitespace against underscores. How can you do that? The answer is the String method replaceAll and to use regular expressions. Let’s see how to do that.

A simple, but insufficient approach: using replace()

The straightforward approach is to use the String method replace() like in

	public static void whitespace1(String input) {
		String output = input.replace(" ", "_");
		System.out.println("input=" + input + " => output=" + output);
	}

This works well for

whitespace1( "This is a test" );

input=This is a test => output=This_is_a_test

It does not work well for

whitespace1( "This  is  a  test" );

input=This  is  a  test => output=This__is__a__test

It does not work at all for other kinds of whitespace like tab characters. If you want to place a tab character into a String constant, you can use the escape sequence “\t” like in

whitespace1( "This\tis\ta\ttest" );

input=This    is    a    test => output=This    is    a    test

So in the output you can see the tabs and you can see that they were not translated into underscore.

The list of possible whitespace characters is long (see WikipediaOpens in a new tab.). You will not want to handle them all by yourself.

The solution for translating whitespace is regex (regular expression)

Fortunately, all this has already been built into the Java language. It is called regular expression or shorter “regex”. There is a special variant of the String replace method that uses regex: replaceAll.

This is a new version of the program using replaceAll:

	public static void whitespace2(String input) {
		String output = input.replaceAll("\s+", "_").replaceAll("\t+", "_");
		System.out.println("input=" + input + " => output=" + output);
	}

Here is an explanation:

  • \s is a shorthand for whitespace.
  • + stands for “one or more”
  • \t stands for the tab character. It is not included within \s
  • you can also group \s and \t together into \p{Blank}, but this is not so well known and it makes the code harder to read.

This method might still not do everything that you might want it to accomplish. For example a blank followed by a tab would still have two consecutive underscores. But for my purpose it works well enough. \p{Blank}+ would handle this correctly

Possible extension: ensure alphanumeric result

Another possible extenstion would be

output = output.replaceAll("[^a-zA-Z0-9_]", "");

This filters out absolutely everything that is not a normal alphanumeric character. Even German Umlauts will be deleted.

Here the square bracket is used to define a “character class” consisting of anything that is not (^)

  • in the range of a to z
  • or in the range of A to Z
  • or a digit
  • or the underscore that I would like to keep.

This can now translate

input=This  is 	 a  $$,&test => output=This_is___a_test

Again what I end up might be double underscores, but after all, I would say this is good enough. Let’s be satisfied with this.

There would however be a possibility to use regex to translate several consecutive underscores to just one. This is left as an exercise to the reader.

Have fun.

More Java stuff can be found here.

Recent Posts