simplechinese package

Submodules

simplechinese.nlp module

simplechinese.nlp.extract_nouns(x, isList=False, split_mode=0, extract_mode='all', token='/')[source]

Extract the nouns from a string, a pandas.Series, or a pandas.DataFrame. This function is still under developing.

Args:

x: The content to be parsed. Either a string, a pandas.Series, or a pandas.DataFrame.

isList: A boolean. If it is True, the returned value would be a list/lists, or it would be a string/strings of nouns seperated by the token.

token: The token to seperate words if isList is False.

mode: 0: No single character words. The words may be overlapped.
1: Have single character words. The words may be overlapped. 2: No single character words. The words are not overlapped. 3: Have single character words. The words are not overlapped. 4: Only single characters.
Returns:
The seperated nouns in the input data.

simplechinese.nlp.extract_nums(x, isList=False, dtype=<class 'float'>)[source]

Extract the numbers from a string, a pandas.Series, or a pandas.DataFrame.

Args:

x: The content to be parsed. Either a string, a pandas.Series, or a pandas.DataFrame.

isList: A boolean. If it is True, the returned value would be a list/lists of floats, or it would be a string/strings of numbers seperated by spaces.

Returns:
The numbers in the input data.

simplechinese.nlp.extract_words(x, isList=False, mode=0, token='/')[source]

Extract the words from a string, a pandas.Series, or a pandas.DataFrame.

Args:

x: The content to be parsed. Either a string, a pandas.Series, or a pandas.DataFrame.

isList: A boolean. If it is True, the returned value would be a list/lists, or it would be a string/strings of words seperated by the token.

token: The token to seperate words if isList is False.

mode: 0: No single character words. The words may be overlapped.
1: Have single character words. The words may be overlapped. 2: No single character words. The words are not overlapped. 3: Have single character words. The words are not overlapped. 4: Only single characters.
Returns:
The seperated words in the input data.

simplechinese.preprocessing module

simplechinese.preprocessing.clean(x)[source]

This function does the following:

  1. fillna(): Fill the N/As in a pandas.DataFrame with an empty string.
  2. toLower(): Transform alphabets to their lowercases.
  3. remove_punctuations(): Remove all the punctuations in a string or a pandas.DataFrame.
  4. remove_space(): Remove all the spaces in a string or a pandas.DataFrame.

simplechinese.preprocessing.fillna(x)[source]

Fill the N/As in a pandas.DataFrame with an empty string.

Args:
x: A pandas.DataFrame content to be parsed.
Returns:
A pandas.DataFrame without N/As, which are substituted with empty strings.

simplechinese.preprocessing.only_digits(x)[source]

Only keeps the digits in a string or a pandas.DataFrame.

Args:
x: The content to be parsed. Either a string or a pandas.DataFrame.
Returns:
A new string or a pandas.DataFrame only includes digits.

simplechinese.preprocessing.only_en(x)[source]

Only keeps English alphabets in a string or a pandas.DataFrame.

Args:
x: The content to be parsed. Either a string or a pandas.DataFrame.
Returns:
A new string or a pandas.DataFrame only includes English alphabets.

simplechinese.preprocessing.only_zh(x)[source]

Only keeps Chinese characters in a string or a pandas.DataFrame.

Args:
x: The content to be parsed. Either a string or a pandas.DataFrame.
Returns:
A new string or a pandas.DataFrame only includes Chinese characters.

simplechinese.preprocessing.remove_digits(x)[source]

Remove all the digits in a string or a pandas.DataFrame.

Args:
x: The content to be parsed. Either a string or a pandas.DataFrame.
Returns:
A new string or a pandas.DataFrame without digits.

simplechinese.preprocessing.remove_en(x)[source]

Remove all the English alphabets in a string or a pandas.DataFrame.

Args:
x: The content to be parsed. Either a string or a pandas.DataFrame.
Returns:
A new string or a pandas.DataFrame without English alphabets.

simplechinese.preprocessing.remove_punctuations(x)[source]

Remove all the punctuations in a string or a pandas.DataFrame.

Args:
x: The content to be parsed. Either a string or a pandas.DataFrame.
Returns:
A new string or a pandas.DataFrame without punctuations.

simplechinese.preprocessing.remove_space(x)[source]

Remove all the spaces in a string or a pandas.DataFrame.

Args:
x: The content to be parsed. Either a string or a pandas.DataFrame.
Returns:
A new string or a pandas.DataFrame without spaces.

simplechinese.preprocessing.remove_zh(x)[source]

Remove all the Chinese characters in a string or a pandas.DataFrame.

Args:
x: The content to be parsed. Either a string or a pandas.DataFrame.
Returns:
A new string or a pandas.DataFrame without Chinese characters.

simplechinese.preprocessing.toLower(x)[source]

Transform alphabets to their lowercases.

Args:
x: The content to be parsed. Either a string or a pandas.DataFrame.
Returns:
A new string or a pandas.DataFrame where the alphabets are in lowercases.

simplechinese.preprocessing.toUpper(x)[source]

Transform alphabets to their uppercases.

Args:
x: The content to be parsed. Either a string or a pandas.DataFrame.
Returns:
A new string or a pandas.DataFrame where the alphabets are in uppercases.

simplechinese.representation module

simplechinese.representation.nmf(x, n_components=2)[source]

Perform dimension reduction with the non-negative matrix factorization algorithm. The input data should be a pandas.Series of vectors.


simplechinese.representation.pca(x, n_components=2)[source]

Perform dimension reduction with the principal component analysis algorithm. The input data should be a pandas.Series of vectors.


simplechinese.representation.term_frequency(x, mode=0, max_features=None, return_feature_names=False)[source]

Extract the words and vectorize each element in the pandas.Series by the frequency of each word.

Args:

x: The pandas.Series to be parsed.

max_features: The maximum number of features

return_feature_names: Return the token words or not.

mode: 0: No single character words. The words may be overlapped.
1: Have single character words. The words may be overlapped. 2: No single character words. The words are not overlapped. 3: Have single character words. The words are not overlapped. 4: Only single characters.
Returns:
The vectorization result.

simplechinese.representation.tfidf(x, mode=0, max_features=None, min_df=1, return_feature_names=False)[source]

Extract the words and vectorize each element in the pandas.Series by the tfidf scores.

Args:

x: The pandas.Series to be parsed.

max_features: The maximum number of features

return_feature_names: Return the token words or not.

mode: 0: No single character words. The words may be overlapped.
1: Have single character words. The words may be overlapped. 2: No single character words. The words are not overlapped. 3: Have single character words. The words are not overlapped. 4: Only single characters.
Returns:
The vectorization result.

simplechinese.visualization module

simplechinese.visualization.wordcloud(x: pandas.core.series.Series, font_path: str = None, width: int = 400, height: int = 200, max_words=200, mask=None, contour_width=0, contour_color='white', background_color='white', relative_scaling='auto', colormap=None, return_figure=False)[source]

Module contents