Mail::SpamAssassin::PlUser:Contributed PMail::SpamAssassin::Plugin::TextCat(3)
NAME
Mail::SpamAssassin::Plugin::TextCat - TextCat language guesser
SYNOPSIS
loadplugin Mail::SpamAssassin::Plugin::TextCat
DESCRIPTION
This plugin will try to guess the language used in the message body
text.
You can use the "ok_languages" directive to set which languages are
considered okay for incoming mail and if the guessed language is not
okay, "UNWANTED_LANGUAGE_BODY" is triggered. Alternatively you can use
the X-Languages metadata header directly in rules.
It will always add the results to a "X-Languages" name-value pair in
the message metadata data structure. This may be useful as Bayes tokens
and can also be used in rules for scoring. The results can also be
added to marked-up messages using "add_header", with the _LANGUAGES_
tag. See Mail::SpamAssassin::Conf for details.
Note: the language cannot always be recognized with sufficient
confidence. In that case, no action is taken.
You can use _TEXTCATRESULTS_ tag to view the internal ngram-scoring, it
might help fine-tuning settings.
Examples of using X-Languages header directly in rules:
header OK_LANGS X-Languages =~ /\ben\b/
score OK_LANGS -1
header BAD_LANGS X-Languages =~ /\b(?:ja|zh)\b/
score BAD_LANGS 1
USER OPTIONS
ok_languages xx [ yy zz ... ] (default: all)
This option is used to specify which languages are considered okay
for incoming mail. SpamAssassin will try to detect the language
used in the message body text.
Note that the language cannot always be recognized with sufficient
confidence. In that case, no action is taken.
The rule "UNWANTED_LANGUAGE_BODY" is triggered if none of the
languages detected are in the "ok" list. Note that this is the only
effect of the "ok" list. It does not act as a whitelist against any
other form of spam scanning.
In your configuration, you must use the two or three letter
language specifier in lowercase, not the English name for the
language. You may also specify "all" if a desired language is not
listed, or if you want to allow any language. The default setting
is "all".
Examples:
ok_languages all (allow all languages)
ok_languages en (only allow English)
ok_languages en ja zh (allow English, Japanese, and Chinese)
Note: if there are multiple ok_languages lines, only the last one
is used.
Select the languages to allow from the list below:
af - Afrikaans
am - Amharic
ar - Arabic
be - Byelorussian
bg - Bulgarian
bs - Bosnian
ca - Catalan
cs - Czech
cy - Welsh
da - Danish
de - German
el - Greek
en - English
eo - Esperanto
es - Spanish
et - Estonian
eu - Basque
fa - Persian
fi - Finnish
fr - French
fy - Frisian
ga - Irish Gaelic
gd - Scottish Gaelic
he - Hebrew
hi - Hindi
hr - Croatian
hu - Hungarian
hy - Armenian
id - Indonesian
is - Icelandic
it - Italian
ja - Japanese
ka - Georgian
ko - Korean
la - Latin
lt - Lithuanian
lv - Latvian
mr - Marathi
ms - Malay
ne - Nepali
nl - Dutch
no - Norwegian
pl - Polish
pt - Portuguese
qu - Quechua
rm - Rhaeto-Romance
ro - Romanian
ru - Russian
sa - Sanskrit
sco - Scots
sk - Slovak
sl - Slovenian
sq - Albanian
sr - Serbian
sv - Swedish
sw - Swahili
ta - Tamil
th - Thai
tl - Tagalog
tr - Turkish
uk - Ukrainian
vi - Vietnamese
yi - Yiddish
zh - Chinese (both Traditional and Simplified)
zh.big5 - Chinese (Traditional only)
zh.gb2312 - Chinese (Simplified only)
inactive_languages xx [ yy zz ... ] (default: see below)
This option is used to specify which languages will not be
considered when trying to guess the language. For performance
reasons, supported languages that have fewer than about 5 million
speakers are disabled by default. Note that listing a language in
"ok_languages" automatically enables it for that user.
The default setting is:
bs cy eo et eu fy ga gd is la lt lv rm sa sco sl yi
That list is Bosnian, Welsh, Esperanto, Estonian, Basque, Frisian,
Irish Gaelic, Scottish Gaelic, Icelandic, Latin, Lithuanian,
Latvian, Rhaeto-Romance, Sanskrit, Scots, Slovenian, and Yiddish.
textcat_max_languages N (default: 3)
The maximum number of languages any one message can simultaneously
match before its classification is considered unknown. You can try
reducing this to 2 or possibly even 1 for more confident results,
as it's unusual for a message to contain multiple languages.
Read description for textcat_acceptable_score also, as these
settings are closely related. Scoring affects how many languages
might be matched and here we set the "false positive limit" where
we think the engine can't decide what languages message really
contain.
textcat_optimal_ngrams N (default: 0)
If the number of ngrams is lower than this number then they will be
removed. This can be used to speed up the program for longer
inputs. For shorter inputs, this should be set to 0.
textcat_max_ngrams N (default: 400)
The maximum number of ngrams that should be compared with each of
the languages models (note that each of those models is used
completely).
textcat_acceptable_score N (default: 1.02)
Include any language that scores at least
"textcat_acceptable_score" in the returned list of languages.
This setting is basically a percentile range. Any language having
internal ngram-score within N-percent of the best score is included
into results. Larger values than 1.05 are not recommended as it
can generate many false matches. A setting of 1.00 would mean a
single best scoring language is always forcibly selected, but this
is not recommended as then textcat_max_languages can't do its job
classifying language as uncertain.
Read the description for textcat_max_languages, as these are
settings are closely related.
You can use _TEXTCATRESULTS_ tag to view the internal ngram-
scoring, it might help fine-tuning settings.
perl v5.26.3 2021-0Mail::SpamAssassin::Plugin::TextCat(3)