Let’s assume you want to generate a technical idenifier from a String. The idenitifer cannot have whitespace. So you want to replace blanks or tab characters or any other case of whitespace against underscores. How can you do that? The answer is the String method replaceAll and to use regular expressions. Let’s see how to do that.
A simple, but insufficient approach: using replace()
The straightforward approach is to use the String method replace() like in
public static void whitespace1(String input) { String output = input.replace(" ", "_"); System.out.println("input=" + input + " => output=" + output); }
This works well for
whitespace1( "This is a test" );
input=This is a test => output=This_is_a_test
It does not work well for
whitespace1( "This is a test" );
input=This is a test => output=This__is__a__test
It does not work at all for other kinds of whitespace like tab characters. If you want to place a tab character into a String constant, you can use the escape sequence “\t” like in
whitespace1( "This\tis\ta\ttest" );
input=This is a test => output=This is a test
So in the output you can see the tabs and you can see that they were not translated into underscore.
The list of possible whitespace characters is long (see Wikipedia). You will not want to handle them all by yourself.
The solution for translating whitespace is regex (regular expression)
Fortunately, all this has already been built into the Java language. It is called regular expression or shorter “regex”. There is a special variant of the String replace method that uses regex: replaceAll.
This is a new version of the program using replaceAll:
public static void whitespace2(String input) { String output = input.replaceAll("\s+", "_").replaceAll("\t+", "_"); System.out.println("input=" + input + " => output=" + output); }
Here is an explanation:
- \s is a shorthand for whitespace.
- + stands for “one or more”
- \t stands for the tab character. It is not included within \s
- you can also group \s and \t together into \p{Blank}, but this is not so well known and it makes the code harder to read.
This method might still not do everything that you might want it to accomplish. For example a blank followed by a tab would still have two consecutive underscores. But for my purpose it works well enough. \p{Blank}+ would handle this correctly
Possible extension: ensure alphanumeric result
Another possible extenstion would be
output = output.replaceAll("[^a-zA-Z0-9_]", "");
This filters out absolutely everything that is not a normal alphanumeric character. Even German Umlauts will be deleted.
Here the square bracket is used to define a “character class” consisting of anything that is not (^)
- in the range of a to z
- or in the range of A to Z
- or a digit
- or the underscore that I would like to keep.
This can now translate
input=This is a $$,&test => output=This_is___a_test
Again what I end up might be double underscores, but after all, I would say this is good enough. Let’s be satisfied with this.
There would however be a possibility to use regex to translate several consecutive underscores to just one. This is left as an exercise to the reader.
Have fun.
More Java stuff can be found here.