Script identifier for Unicode strings in Perl - Linux

  • Status Closed
  • Budget $10 - $30 USD
  • Total Bids 10

Project Description

I want to build a Perl script to identify the "script" that a particular UTF-8 string is written on. For example, given the strings:

دجنبر --> Arabic

децембар --> Cyrillic

João --> Latin

נאָװעמבער --> Hebrew

กันยายน --> Thai

цембJair --> Mixed

by looking at the "Script" property of each character and checking if they all belong to the same script and in this case report the name. If the string is a mix of two scripts then it should return "Mixed".

The best way to get there would be to use the program "uniname" and echoing the string into it

echo กันยายน | uniname -b -g -c -e -r -u -n

and then process the output:









Basic Latin

to eliminate the first line (a header) and the last line (it corresponds to the LINE FEED at the end of the word). If all character belong to the same range, then report that range. If not, return the word "Mixed".

The program is available from here:

[url removed, login to view]

Paulo Ney

Get free quotes for a project like this
Skills Required

Looking to make some money?

  • Set your budget and the timeframe
  • Outline your proposal
  • Get paid for your work

Hire Freelancers who also bid on this project

    • Forbes
    • The New York Times
    • Time
    • Wall Street Journal
    • Times Online