Why machine translations matter?
The amount of non-English patent documents is rapidly growing. To tackle the searchability problem, we are using machine translations to create English knowledge graphs from non-English texts.
During the spring 2020 we expanded our data sources to include tens of national patent office databases including multiple non-English databases. I took some time to do data analysis for the newly imported data. During the spring 2020 we expanded our data sources to include tens of national patent office databases including multiple non-English databases. I took some time to do data analysis for the newly imported data. The main focus was to understand the proportions between different major national patent offices and the importance of non-English documents. Following charts and document amounts were collected and calculated from our search database using only documents which contain searchable specifications. The filtering is done to provide a realistic view to our search space.
China leads the patent filling
As the Chart 1 shows, the amount of Chinese publications have been growing strongly during the 2010s. This is part of China’s industrial strategy where intellectual property is a major pillar. According to the strategy, companies qualify for government subsidies upon filing a patent application. But when studying more closely the mass of Chinese publications in Chart 2, the major part of the grow comes from the unexamined patent applications (A) and granted utility models (U) leaving the granted patent publications (B) behind. The subsidy policy has been criticized by legal experts, who warn that it could flood the system with cheap patents, but the increase in the granted publications seems to be moderate.
Amount of non-English publications is growing
On a global scale, the amount of non-English publications have grown according to the World Intellectual Property Organization (WIPO). In 2019, China passed the United States of America as the top source of international patent applications filed with WIPO. At the same time international patent applications filed via WIPO’s Patent Cooperation Treaty System (PCT) grew by 5.2%, more than half of which were accounted for by Asia-based applicants.
When inspecting PCT applications by original language, English is still the most used language with 46% share. The top 5 contain three big Asian languages: Chinese, Korean and Japanese making the precise and high quality translations important when working with PCT applications.
New machine translations to the rescue!
Since the addition of national data sources, the increase in searchable patent publications has been over 240%. Most of the national sources are from non-English countries, making the non-English documents majority in our search space with a 1.25:1 ratio compared to English documents. Because our graph-based approach to patent search currently works only with English patent texts, we required translations for all non-English documents. Luckily our data provider, IFI CLAIMS Patent Services, started a huge project in 2019 to translate all non-English documents to English with Google’s state of the art translator. While writing this post, the last newly translated documents are finally imported to our patent database and are ready to be parsed to our knowledge graph format, which has gone through some major improvements during last few weeks by our NLP experts Sakke & Sebastian.