fastNLP.io¶

fastNLP.io.base_loader¶

class fastNLP.io.base_loader.BaseLoader[source]¶

Base loader for all loaders.

classmethod load(data_path)[source]¶: 先按行读取，去除一行两侧空白，再提取每行的字符。返回list of list of str

static load_lines(data_path)[source]¶: 按行读取，舍弃每行两侧空白字符，返回list of str

classmethod load_with_cache(data_path, cache_path)[source]¶: 缓存版的load

class fastNLP.io.base_loader.DataLoaderRegister[source]¶: Register for all data sets.

fastNLP.io.config_io¶

class fastNLP.io.config_io.ConfigLoader(data_path=None)[source]¶

Loader for configuration.

Parameters:	data_path (str) – path to the config

static load_config(file_path, sections)[source]¶

Load section(s) of configuration into the sections provided. No returns.

Parameters:	file_path (str) – the path of config file sections (dict) – the dict of `{section_name(string): ConfigSection object}`

Example:

test_args = ConfigSection()
ConfigLoader("config.cfg").load_config("./data_for_tests/config", {"POS_test": test_args})

class fastNLP.io.config_io.ConfigSaver(file_path)[source]¶

ConfigSaver is used to save config file and solve related conflicts.

Parameters:	file_path (str) – path to the config file

save_config_file(section_name, section)[source]¶

This is the function to be called to change the config file with a single section and its name.

Parameters:	section_name (str) – The name of section what needs to be changed and saved. section (ConfigSection) – The section with key and value what needs to be changed and saved.

class fastNLP.io.config_io.ConfigSection[source]¶: ConfigSection is the data structure storing all key-value pairs in one section in a config file.

fastNLP.io.dataset_loader¶

class fastNLP.io.dataset_loader.Conll2003Loader[source]¶

Loader for conll2003 dataset

More information about the given dataset cound be found on https://sites.google.com/site/ermasoftware/getting-started/ne-tagging-conll2003-data

convert(parsed_data)[source]¶

Optional operation to build a DataSet.

Parameters:	data – inner data structure (user-defined) to represent the data.
Returns:	a DataSet object

load(dataset_path)[source]¶

Load data from a given file.

Parameters:	path (str) – file path
Returns:	a DataSet object

class fastNLP.io.dataset_loader.ConllLoader[source]¶

loader for conll format files

convert(data)[source]¶

Optional operation to build a DataSet.

Parameters:	data – inner data structure (user-defined) to represent the data.
Returns:	a DataSet object

load(data_path)[source]¶

Load data from a given file.

Parameters:	path (str) – file path
Returns:	a DataSet object

static parse(lines)[source]¶

Parameters:	lines (list) – a list containing all lines in a conll file.
Returns:	a 3D list

class fastNLP.io.dataset_loader.ConllxDataLoader[source]¶: 返回“词级别”的标签信息，包括词、词性、（句法）头依赖、（句法）边标签。跟``ZhConllPOSReader``完全不同。

class fastNLP.io.dataset_loader.DataSetLoader[source]¶

Interface for all DataSetLoaders.

convert(data)[source]¶

Optional operation to build a DataSet.

Parameters:	data – inner data structure (user-defined) to represent the data.
Returns:	a DataSet object

load(path)[source]¶

Load data from a given file.

Parameters:	path (str) – file path
Returns:	a DataSet object

class fastNLP.io.dataset_loader.DummyCWSReader[source]¶

Load pku dataset for Chinese word segmentation.

convert(data)[source]¶

Optional operation to build a DataSet.

Parameters:	data – inner data structure (user-defined) to represent the data.
Returns:	a DataSet object

load(data_path, max_seq_len=32)[source]¶

Load pku dataset for Chinese word segmentation. CWS (Chinese Word Segmentation) pku training dataset format: 1. Each line is a sentence. 2. Each word in a sentence is separated by space. This function convert the pku dataset into three-level lists with labels <BMES>. B: beginning of a word M: middle of a word E: ending of a word S: single character

Parameters:	data_path (str) – path to the data set. max_seq_len – int, the maximum length of a sequence. If a sequence is longer than it, split it into several sequences.
Returns:	three-level lists

class fastNLP.io.dataset_loader.DummyClassificationReader[source]¶

Loader for a dummy classification data set

convert(data)[source]¶

Optional operation to build a DataSet.

Parameters:	data – inner data structure (user-defined) to represent the data.
Returns:	a DataSet object

load(data_path)[source]¶

Load data from a given file.

Parameters:	path (str) – file path
Returns:	a DataSet object

static parse(lines)[source]¶

每行第一个token是标签，其余是字/词；由空格分隔。

Parameters:	lines – lines from dataset
Returns:	list(list(list())): the three level of lists are words, sentence, and dataset

class fastNLP.io.dataset_loader.DummyLMReader[source]¶

A Dummy Language Model Dataset Reader

convert(data)[source]¶

Optional operation to build a DataSet.

Parameters:	data – inner data structure (user-defined) to represent the data.
Returns:	a DataSet object

load(data_path)[source]¶

Load data from a given file.

Parameters:	path (str) – file path
Returns:	a DataSet object

class fastNLP.io.dataset_loader.DummyPOSReader[source]¶

A simple reader for a dummy POS tagging dataset.

In these datasets, each line are divided by ” “. The first Col is the vocabulary and the second Col is the label. Different sentence are divided by an empty line.

E.g:

Tom label1
and label2
Jerry   label1
.   label3
(separated by an empty line)
Hello   label4
world   label5
!   label3

In this example, there are two sentences “Tom and Jerry .” and “Hello world !”. Each word has its own label.

convert(data)[source]¶: Convert lists of strings into Instances with Fields.

load(data_path)[source]¶

Return data:	three-level list Example: [ [ [word_11, word_12, ...], [label_1, label_1, ...] ], [ [word_21, word_22, ...], [label_2, label_1, ...] ], ... ]

class fastNLP.io.dataset_loader.NaiveCWSReader(in_word_splitter=None)[source]¶

这个reader假设了分词数据集为以下形式, 即已经用空格分割好内容了例如:

这是 fastNLP , 一个 非常 good 的 包 .

或者,即每个part后面还有一个pos tag 例如:

也/D  在/P  團員/Na  之中/Ng  ，/COMMACATEGORY

load(filepath, in_word_splitter=None, cut_long_sent=False)[source]¶

允许使用的情况有(默认以或空格作为seg): 这是 fastNLP , 一个非常 good 的包 .
和: 也/D 在/P 團員/Na 之中/Ng ，/COMMACATEGORY

如果splitter不为None则认为是第二种情况, 且我们会按splitter分割”也/D”, 然后取第一部分. 例如”也/D”.split(‘/’)[0]

Parameters:	filepath – in_word_splitter – cut_long_sent –
Returns:

class fastNLP.io.dataset_loader.NativeDataSetLoader[source]¶

A simple example of DataSetLoader

load(path)[source]¶

Load data from a given file.

Parameters:	path (str) – file path
Returns:	a DataSet object

class fastNLP.io.dataset_loader.PeopleDailyCorpusLoader[source]¶

人民日报数据集

convert(data)[source]¶

Optional operation to build a DataSet.

Parameters:	data – inner data structure (user-defined) to represent the data.
Returns:	a DataSet object

load(data_path, pos=True, ner=True)[source]¶

Parameters:	data_path (str) – 数据路径 pos (bool) – 是否使用词性标签 ner (bool) – 是否使用命名实体标签
Returns:	a DataSet object

class fastNLP.io.dataset_loader.RawDataSetLoader[source]¶

A simple example of raw data reader

convert(data)[source]¶

Optional operation to build a DataSet.

Parameters:	data – inner data structure (user-defined) to represent the data.
Returns:	a DataSet object

load(data_path, split=None)[source]¶

Load data from a given file.

Parameters:	path (str) – file path
Returns:	a DataSet object

class fastNLP.io.dataset_loader.SNLIDataSetReader[source]¶

A data set loader for SNLI data set.

convert(data)[source]¶

Convert a 3D list to a DataSet object.

Parameters:

data –

A 3D tensor. Example:

[
    [ [premise_word_11, premise_word_12, ...], [hypothesis_word_11, hypothesis_word_12, ...], [label_1] ],
    [ [premise_word_21, premise_word_22, ...], [hypothesis_word_21, hypothesis_word_22, ...], [label_2] ],
    ...
]

Returns: A DataSet object.

load(path_list)[source]¶

Parameters:	path_list (list) – A list of file name, in the order of premise file, hypothesis file, and label file.
Returns:	A DataSet object.

class fastNLP.io.dataset_loader.ZhConllPOSReader[source]¶

读取中文Conll格式。返回“字级别”的标签，使用BMES记号扩展原来的词级别标签。

load(path)[source]¶

返回的DataSet, 包含以下的field: words：list of str, tag: list of str, 被加入了BMES tag, 比如原来的序列为[‘VP’, ‘NN’, ‘NN’, ..]，会被认为是[“S-VP”, “B-NN”, “M-NN”,..]

假定了输入为conll的格式，以空行隔开两个句子，每行共7列，即

 编者按     编者按     NN      O       11      nmod:topic
 ：       ：       PU      O       11      punct
 7月      7月      NT      DATE    4       compound:nn
 12日     12日     NT      DATE    11      nmod:tmod
 ，       ，       PU      O       11      punct

 这       这       DT      O       3       det
 款       款       M       O       1       mark:clf
 飞行      飞行      NN      O       8       nsubj
 从       从       P       O       5       case
 外型      外型      NN      O       8       nmod:prep

fastNLP.io.dataset_loader.add_seg_tag(data)[source]¶

Parameters:	data – list of ([word], [pos], [heads], [head_tags])
Returns:	list of ([word], [pos])

fastNLP.io.dataset_loader.convert_seq2seq_dataset(data)[source]¶

Convert list of data into DataSet.

Parameters:	data – list of list of strings, [num_examples, *]. Example: [ [ [word_11, word_12, ...], [label_1, label_1, ...] ], [ [word_21, word_22, ...], [label_2, label_1, ...] ], ... ]
Returns:	a DataSet.

fastNLP.io.dataset_loader.convert_seq2tag_dataset(data)[source]¶

Convert list of data into DataSet.

Parameters:	data – list of list of strings, [num_examples, *]. Example: [ [ [word_11, word_12, ...], label_1 ], [ [word_21, word_22, ...], label_2 ], ... ]
Returns:	a DataSet.

fastNLP.io.dataset_loader.convert_seq_dataset(data)[source]¶

Create an DataSet instance that contains no labels.

Parameters:	data – list of list of strings, [num_examples, *]. Example: [ [word_11, word_12, ...], ... ]
Returns:	a DataSet.

fastNLP.io.dataset_loader.cut_long_sentence(sent, max_sample_length=200)[source]¶

将长于max_sample_length的sentence截成多段，只会在有空格的地方发生截断。所以截取的句子可能长于或者短于max_sample_length

Parameters:	sent – str. max_sample_length – int.
Returns:	list of str.

fastNLP.io.embed_loader¶

class fastNLP.io.embed_loader.EmbedLoader[source]¶

docstring for EmbedLoader

static fast_load_embedding(emb_dim, emb_file, vocab)[source]¶

Fast load the pre-trained embedding and combine with the given dictionary. This loading method uses line-by-line operation.

Return embedding_matrix:
Parameters:	emb_dim (int) – the dimension of the embedding. Should be the same as pre-trained embedding. emb_file (str) – the pre-trained embedding file path. vocab (Vocabulary) – a mapping from word to index, can be provided by user or built from pre-trained embedding
	numpy.ndarray

static load_embedding(emb_dim, emb_file, emb_type, vocab)[source]¶

Load the pre-trained embedding and combine with the given dictionary.

Return (embedding_tensor, vocab):
Parameters:	emb_dim (int) – the dimension of the embedding. Should be the same as pre-trained embedding. emb_file (str) – the pre-trained embedding file path. emb_type (str) – the pre-trained embedding format, support glove now vocab (Vocabulary) – a mapping from word to index, can be provided by user or built from pre-trained embedding
	embedding_tensor - Tensor of shape (len(word_dict), emb_dim); vocab - input vocab or vocab built by pre-train

fastNLP.io.logger¶

fastNLP.io.logger.create_logger(logger_name, log_path, log_format=None, log_level=20)[source]¶

Create a logger.

Parameters:	logger_name (str) – log_path (str) – log_format – log_level –
Returns:	logger

To use a logger:

logger.debug("this is a debug message")
logger.info("this is a info message")
logger.warning("this is a warning message")
logger.error("this is an error message")

fastNLP.io.model_io¶

class fastNLP.io.model_io.ModelLoader[source]¶

Loader for models.

static load_pytorch(empty_model, model_path)[source]¶

Load model parameters from “.pkl” files into the empty PyTorch model.

Parameters:	empty_model – a PyTorch model with initialized parameters. model_path (str) – the path to the saved model.

static load_pytorch_model(model_path)[source]¶

Load the entire model.

Parameters:	model_path (str) – the path to the saved model.

class fastNLP.io.model_io.ModelSaver(save_path)[source]¶

Save a model

Parameters:	save_path (str) – the path to the saving directory.

Example:

saver = ModelSaver("./save/model_ckpt_100.pkl")
saver.save_pytorch(model)

save_pytorch(model, param_only=True)[source]¶

Save a pytorch model into “.pkl” file.

Parameters:	model – a PyTorch model param_only (bool) – whether only to save the model parameters or the entire model.