FAQ Database Discussion Community


Tabulating characters with diacritics in R

r,unicode,nlp,linguistics
I'm trying to tabulate phones (characters) occurrences in a string, but diacritics are tabulated as characters on their own. Ideally, I have a wordlist in International Phonetic Alphabet, with a fair amount of diacritics and several combinations of them with base characters. I give here a MWE with just one...

How to tell if a string of characters makes intelligible words

java,android,statistics,tesseract,linguistics
So, I'm working on a simple mobile app project (mostly for fun) that uses an OCR library (tesseract) on Android to scan a camera picture, do some stuff with the text, and return it to the user. What I'm wondering is if anyone out there knows of a way to...

Lemmatizer supporting german language (for commercial and research purpose)

machine-learning,nlp,linguistics
I am searching for a lemmatization software which: supports the german language has a license that allows it to be used for commercial and research purpose. LGPL license would be good. should preferably be implemented in Java. Implementations in other programming languages would also be OK. Does anybody know about...

How to pass in an estimator to NLTK's NgramModel?

python,nlp,nltk,n-gram,linguistics
I am using NLTK to train a bigram model using a Laplace estimator. The contructor for the NgramModel is: def __init__(self, n, train, pad_left=True, pad_right=False, estimator=None, *estimator_args, **estimator_kwargs): After some research, I found that a syntax that works is the following: bigram_model = NgramModel(2, my_corpus, True, False, lambda f, b:LaplaceProbDist(f))...

Type/Token Ratio in R

r,if-statement,tm,corpus,linguistics
I'm working with a new corpus and want to get the type/token ratio. Does anyone know of a standard way to do this? I've been trawling around the internet and didn't find anything relevant. Even the tm package doesn't seem to have an easy way to do this. Just as...

Oracle linguistic index not used when SQL contains parameter with LIKE

oracle,performance,indexing,linguistics
My schema (simplified): CREATE TABLE LOC ( LOC_ID NUMBER(15,0) NOT NULL, LOC_REF_NO VARCHAR2(100 CHAR) NOT NULL ) / CREATE INDEX LOC_REF_NO_IDX ON LOC ( NLSSORT("LOC_REF_NO",'nls_sort=''BINARY_AI''') ASC ) / My query (in SQL*Plus): ALTER SESSION SET NLS_COMP=LINGUISTIC NLS_SORT=BINARY_AI / VAR LOC_REF_NO VARCHAR2(50) BEGIN :LOC_REF_NO := 'SPDJ1501270'; END; / -- Causes full...

Handling count of characters with diacritics in R

r,unicode,character-encoding,nlp,linguistics
I'm trying to get the number of characters in strings with characters with diacritics, but I can't manage to get the right result. > x <- "n̥ala" > nchar(x) [1] 5 What I want to get is is 4, since n̥ should be considered one character (i.e. diacritics shouldn't be...