Let's Discuss
Enquire NowWe may have encountered issues in parsing Dates in Natural Language Processing. Taking into the context that dates can be expressed in many different ways in an Expression on natural language, it is pretty difficult to do the parsing.
So we would look into libraries which can do this troublesome part for us. This article is about such Date parsers which can ease our task.
What is a Date Parser in NLP?
A date parser in programming is something which can extract the date exactly the way we want.
In Java, we use SimpleDateFormat for the parsing which will extract the date if we specify the format. But in NLP, there is no predefined format, expressions can have references to dates in any format. It can even be a small undefined part in a sentence.
This is where the NLP Date parser comes into play. We will be discussing two popular NLP date parsers in this article which is Natty & Duckling.
Natty
Natty is a natural language date parser written in Java. It can recognise almost all references to a date, in an expression. It applies standard language recognition and translation techniques to produce a list of corresponding dates with optional parse and syntax info.
- Natty makes use of ANTLR
(Another tool for Language Recognition – A powerful parser generator for reading, processing, executing or translating text or binary files) - Takes into account typical reference dates while producing a result.
Supported Formats
- Formal dates
- Relaxed dates
- Relative dates
- Date alternatives
- Prefixes
- Time
- Relative times
- Time zones
Formal Dates
Natty supports formal dates in the formats
- yyyy-mm-dd
- yyyy-dd-mm
- dd-mm-yyyy
- mm-dd-yyyy
Relaxed Dates
The dates that are represented in a loose, non-standard manner with most parts being optional
For eg:
- The first Monday of April 2000
- Mon, 21 Jan 2017
- September 22nd
- 1st April
Relative Dates
The dates that are relative to the current date context
For eg:
- next Thursday
- last Saturday
Date alternatives
It can recognise date alternatives in an expression. For example in the expression, “maybe next Thursday or Friday”, natty would be able to find the date references for next “Thursday” and “Friday”.
So Natty returns a list of dates as the result.
Prefixes
Most of the date formats that we referenced above may be prefixed with a modifier.
For eg: A day after, the Monday before etc.
Time
Also, one important property of Natty is that it helps in identifying the time that’s prefixed or suffixed with time information.
Eg:
- 12:00 hours
- noon
- evening
- afternoon
- 6:00 am
Relative times are also identified by Natty. People usually use relative times and dates along with an expression like 10 seconds ago, 5 minutes before 6:00 pm today.
With all these classifications that Natty can do, we don’t need to mention that it can extract Time zones as well.
Time Zones can be in any format, Natty can even recognize time zone offset and time zone names as well.
For eg:
- +0500
- UTC
- IST
How To Use
Natty is pretty straightforward, we don’t need to install any supporting jars or packages for its working. You can simply include Natty in your Maven project by adding the following maven dependency in the pom.xml file.
<dependency> <groupId>com.joestelmach</groupId> <artifactId>natty</artifactId> <version>0.11</version> </dependency>
After including the maven dependency, click on “import changes” (Intellij Idea) or whatever is necessary to download the jar.
It’s done, now Natty is a part of your project.
Now the next question, how to start with the date parsing?
It’s pretty simple as well
The below code explains how to parse date time from a simple expression in English.
import com.joestelmach.natty.*; Parser parser = new Parser(); List groups = parser.parse(“the day after last monday”) for(DateGroup group:groups) { List dates = group.getDates(); int line = group.getLine(); int column = group.getPosition(); String matchingValue = group.getText(); String syntaxTree = group.getSyntaxTree().toStringTree(); Map parseMap = group.getParceLocations(); boolean isRecurring = group.isRecurring(); Date recursUntil = group.getRecursUntil(); }
Note here that each ‘group’ object returns a list which means that Natty provides a list of suggested dates from the parse results.
Duckling
Duckling is a Clojure library that parses text into structured data. Duckling supports the following languages.
- English
- Spanish
- French
- Italian
- Chinese(experimental)
The advantages of Duckling is that it is Agnostic, Probabilistic and Extensible
Duckling doesn’t make any assumption on the kind of data that we want to extract.
Duckling provides multiple results for the given input string, each with a probability value assigned to it, so that the user will be able to choose the most probable one or according to the conditions required.
Supported Formats
Duckling supports the following formats
- Time
- Temperature
- Number
- Ordinal
- Distance
- Volume
- Amount of money
- Duration
- Url
- Phone number
Time
Duckling helps to identify Time references in almost all formats
For eg :
- today
- Monday, Feb 21
- last week
- thanksgiving day
- from 9:30 – 11:30 on Thursday
- 11:45 am
Temperature
Duckling is able to identify the temperatures provided in an expression.
For eg:
- 32 degree celsius
- 70 fahrenheit
- 60 degree
Number
Duckling identifies numbers in an expression, as well. It could be numerical or in a sentence.
For eg:
- eighteen
- 21
- 0.44
- 10k
Ordinal
For eg :
- first
- 2nd
Distance
Duckling can identify the distance mentioned in an expression.
For eg :
- 32 miles
- 1 kilometer
- 1km
- 33 yards
Other than these, Duckling provides support for many more features. You can find them here.
Extending Duckling
When you start using Duckling and you feel like, the accuracy of the predictions are not up to your expectation, then you can extend Duckling and add corpus to it to improve the accuracy of the predictions.
This is definitely a huge plus point of Ducking
Conclusion : Which is better?
For someone who knows only Java, Natty is the best choice. For someone with experience in Clojure, Duckling would be the better choice.
If I know both, then I would say Duckling would be a better choice.
Why? Let’s check the facts.
- Duckling is able to identify more features from the expression than Natty. Natty extracts mostly date only which is just a primary feature of Duckling.
- Duckling can be extended and the accuracy can be increased by using the custom corpus, but Natty can’t!
- Duckling supports multiple languages where Natty supports only English.
If you have a project in mind that includes NLP, connect with us here.
Disclaimer: The opinions expressed in this article are those of the author(s) and do not necessarily reflect the positions of Dexlock.