simplechinese package¶
Submodules¶
simplechinese.nlp module¶
-
simplechinese.nlp.extract_nouns(x, isList=False, split_mode=0, extract_mode='all', token='/')[source]¶ Extract the nouns from a string, a pandas.Series, or a pandas.DataFrame. This function is still under developing.
- Args:
x: The content to be parsed. Either a string, a pandas.Series, or a pandas.DataFrame.
isList: A boolean. If it is True, the returned value would be a list/lists, or it would be a string/strings of nouns seperated by the token.
token: The token to seperate words if isList is False.
- mode: 0: No single character words. The words may be overlapped.
- 1: Have single character words. The words may be overlapped. 2: No single character words. The words are not overlapped. 3: Have single character words. The words are not overlapped. 4: Only single characters.
- Returns:
- The seperated nouns in the input data.
-
simplechinese.nlp.extract_nums(x, isList=False, dtype=<class 'float'>)[source]¶ Extract the numbers from a string, a pandas.Series, or a pandas.DataFrame.
- Args:
x: The content to be parsed. Either a string, a pandas.Series, or a pandas.DataFrame.
isList: A boolean. If it is True, the returned value would be a list/lists of floats, or it would be a string/strings of numbers seperated by spaces.
- Returns:
- The numbers in the input data.
-
simplechinese.nlp.extract_words(x, isList=False, mode=0, token='/')[source]¶ Extract the words from a string, a pandas.Series, or a pandas.DataFrame.
- Args:
x: The content to be parsed. Either a string, a pandas.Series, or a pandas.DataFrame.
isList: A boolean. If it is True, the returned value would be a list/lists, or it would be a string/strings of words seperated by the token.
token: The token to seperate words if isList is False.
- mode: 0: No single character words. The words may be overlapped.
- 1: Have single character words. The words may be overlapped. 2: No single character words. The words are not overlapped. 3: Have single character words. The words are not overlapped. 4: Only single characters.
- Returns:
- The seperated words in the input data.
simplechinese.preprocessing module¶
-
simplechinese.preprocessing.clean(x)[source]¶ This function does the following:
- fillna(): Fill the N/As in a pandas.DataFrame with an empty string.
- toLower(): Transform alphabets to their lowercases.
- remove_punctuations(): Remove all the punctuations in a string or a pandas.DataFrame.
- remove_space(): Remove all the spaces in a string or a pandas.DataFrame.
-
simplechinese.preprocessing.fillna(x)[source]¶ Fill the N/As in a pandas.DataFrame with an empty string.
- Args:
- x: A pandas.DataFrame content to be parsed.
- Returns:
- A pandas.DataFrame without N/As, which are substituted with empty strings.
-
simplechinese.preprocessing.only_digits(x)[source]¶ Only keeps the digits in a string or a pandas.DataFrame.
- Args:
- x: The content to be parsed. Either a string or a pandas.DataFrame.
- Returns:
- A new string or a pandas.DataFrame only includes digits.
-
simplechinese.preprocessing.only_en(x)[source]¶ Only keeps English alphabets in a string or a pandas.DataFrame.
- Args:
- x: The content to be parsed. Either a string or a pandas.DataFrame.
- Returns:
- A new string or a pandas.DataFrame only includes English alphabets.
-
simplechinese.preprocessing.only_zh(x)[source]¶ Only keeps Chinese characters in a string or a pandas.DataFrame.
- Args:
- x: The content to be parsed. Either a string or a pandas.DataFrame.
- Returns:
- A new string or a pandas.DataFrame only includes Chinese characters.
-
simplechinese.preprocessing.remove_digits(x)[source]¶ Remove all the digits in a string or a pandas.DataFrame.
- Args:
- x: The content to be parsed. Either a string or a pandas.DataFrame.
- Returns:
- A new string or a pandas.DataFrame without digits.
-
simplechinese.preprocessing.remove_en(x)[source]¶ Remove all the English alphabets in a string or a pandas.DataFrame.
- Args:
- x: The content to be parsed. Either a string or a pandas.DataFrame.
- Returns:
- A new string or a pandas.DataFrame without English alphabets.
-
simplechinese.preprocessing.remove_punctuations(x)[source]¶ Remove all the punctuations in a string or a pandas.DataFrame.
- Args:
- x: The content to be parsed. Either a string or a pandas.DataFrame.
- Returns:
- A new string or a pandas.DataFrame without punctuations.
-
simplechinese.preprocessing.remove_space(x)[source]¶ Remove all the spaces in a string or a pandas.DataFrame.
- Args:
- x: The content to be parsed. Either a string or a pandas.DataFrame.
- Returns:
- A new string or a pandas.DataFrame without spaces.
-
simplechinese.preprocessing.remove_zh(x)[source]¶ Remove all the Chinese characters in a string or a pandas.DataFrame.
- Args:
- x: The content to be parsed. Either a string or a pandas.DataFrame.
- Returns:
- A new string or a pandas.DataFrame without Chinese characters.
simplechinese.representation module¶
-
simplechinese.representation.nmf(x, n_components=2)[source]¶ Perform dimension reduction with the non-negative matrix factorization algorithm. The input data should be a pandas.Series of vectors.
-
simplechinese.representation.pca(x, n_components=2)[source]¶ Perform dimension reduction with the principal component analysis algorithm. The input data should be a pandas.Series of vectors.
-
simplechinese.representation.term_frequency(x, mode=0, max_features=None, return_feature_names=False)[source]¶ Extract the words and vectorize each element in the pandas.Series by the frequency of each word.
- Args:
x: The pandas.Series to be parsed.
max_features: The maximum number of features
return_feature_names: Return the token words or not.
- mode: 0: No single character words. The words may be overlapped.
- 1: Have single character words. The words may be overlapped. 2: No single character words. The words are not overlapped. 3: Have single character words. The words are not overlapped. 4: Only single characters.
- Returns:
- The vectorization result.
-
simplechinese.representation.tfidf(x, mode=0, max_features=None, min_df=1, return_feature_names=False)[source]¶ Extract the words and vectorize each element in the pandas.Series by the tfidf scores.
- Args:
x: The pandas.Series to be parsed.
max_features: The maximum number of features
return_feature_names: Return the token words or not.
- mode: 0: No single character words. The words may be overlapped.
- 1: Have single character words. The words may be overlapped. 2: No single character words. The words are not overlapped. 3: Have single character words. The words are not overlapped. 4: Only single characters.
- Returns:
- The vectorization result.