Network | 代写network | Algorithm | Machine learning | 作业project | unity | database作业 – Deep Learning for Classical Japanese Literature

Deep Learning for Classical Japanese Literature

Network | 代写network | Algorithm | Machine learning | 作业project | unity | database作业 – 这个题目属于一个Algorithm的代写任务, 是比较典型的Network/network/Algorithm/Machine learning/unity/database等代写方向, 该题目是值得借鉴的project代写的题目

机器学习代写 代做机器学习 ai代做 machine learning代写 ML代做

Tarin Clanuwat Mikel Bober-Irizar

Center for Open Data in the Humanities Royal Grammar School, Guildford

Asanobu Kitamoto Alex Lamb

Center for Open Data in the Humanities MILA, Universit de Montral

Kazuaki Yamamoto David Ha

National Institute of Japanese Literature Google Brain


Much of  Machine learning research focuses on producing models which perform
well on benchmark tasks, in turn improving our understanding of the challenges

associated with those tasks. From the perspective of ML researchers, the content of

the task itself is largely irrelevant, and thus there have increasingly been calls for

benchmark tasks to more heavily focus on problems which are of social or cultural

relevance. In this work, we introduce Kuzushiji-MNIST, a dataset which focuses
onKuzushiji(cursive Japanese), as well as two larger, more challenging datasets,

Kuzushiji-49 and Kuzushiji-Kanji. Through these datasets, we wish to engage the

machine learning comm unity into the world of classical Japanese literature.

1 Introduction

Recorded historical documents give us a peek into the past. We are able to glimpse the world before

our time; and see its culture, norms, and values to reflect on our own. Japan has very unique historical

pathway. Historically, Japan and its culture was relatively isolated from the West, until the Meiji
restoration in 1868 where Japanese leaders reformed its education system to modernize its culture.
This caused drastic changes in the Japanese language, writing and printing systems. Due to the

modernization of Japanese language in this era, cursive Kuzushiji () script is no longer

taught in the official school curriculum. Even though Kuzushiji had been used for over 1000 years,

most Japanese natives today cannot read books written or published over 150 years ago. [10, 20]

Figure 1: Most Japanese cannot read books over 150 years old, written in cursiveKuzushijistyle.
The 10 classes of Kuzushiji-MNIST, first column showing the modernHiraganacounterpart (left).

Example of a Kuzushiji literature scroll,Genjimonogatari Uta Awase[ 21 ] (right).

Corresponding author:[email protected], Center for Open Data in the Humanities, Tokyo, Japan.
32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montral, Canada.

arXiv:1812.01718v1 [cs.CV] 3 Dec 2018

Figure 2: The difference between a text printed in 1772 and one printed in 1900.Onna Daigaku
[ 13 ] is a book for women in the Edo period (left).Shinpen Sh ushinkyouten Vol.3
[6] is a textbook right after the standardization of Japanese in 1900 (right).
According to the General Catalog of National Books [ 19 ] there have been over 1.7 million books

written or published in Japan prior to 1867. In addition to the number of registered books in the national catalog, we estimate that in total there are over 3 million unregistered books and a billion

historical documents preserved nationwide. Despite ongoing efforts to create digital copies of these

documentsa safeguard against fires, earthquakes, and tsunamismost of the knowledge, history, and culture contained within these texts remains inaccessible to the general public. While we have many digitized copies of manuscripts and books, only a small number of people with Kuzushiji education are able to read them and work on them, leading to a huge dataset of Japanese cultural

works which cannot be read by non-experts.

KuzushijiKanji PredictionPixel PredicitonsStroke ModernKanji Figure 3: Our domain transfer experiment, generating Modern Kanji from the Kuzushiji Kanji for

unseen characters. (Section 3.2).

In this paper we introduce a dataset specifically made for machine learning research to engage the

community to the field of Japanese literature. In this work, we release three easy-to-use preprocessed

datasets: Kuzushiji-MNIST, a dataset which focuses onKuzushiji(cursive Japanese), as well as two larger, more challenging datasets, Kuzushiji-49 and Kuzushiji-Kanji. Kuzushiji-MNIST is

designed as a drop-in replacement for the MNIST [ 16 ] dataset. In addition, we present baseline

classification results on Kuzushiji-MNIST and Kuzushiji-49 using recent models, and also apply generative modelling to a domain transfer task between unseen Kuzushiji Kanji and Modern Kanji (See Figure 3). Through these datasets and experiments, we wish to intoducethe machine learning

community into the world of classical Japanese literature.^2

2 Kuzushiji Dataset

The Kuzushiji dataset is created by the National Institute of Japanese Literature (NIJL), and is curated by the Center for Open Data in the Humanities (CODH). In 2014, NIJL and other institutes begun a national project to digitize about 300,000 old Japanese books, transcribing some of them, and sharing them as open data for promoting international collaboration. During the transcription

process, a bounding box was created for each character, but literature scholars did not think they were
worth sharing. From a machine learning perspective, CODH suggested to make a separate dataset for
bounding boxes on a page, because that can be used as the basis for many machine learning challenges
and working towards automated transcription. As a result, thefullKuzushiji dataset was released in
November 2016, and now the dataset contains 3,999 character types and 403,242 characters [5].

(^2) Location of dataset with instructions:

Figure 4: In addition to archives, classical books are circulated in manuscript bookstores, online

auctions and the annual manuscript auction event held in Jimbocho, Tokyo. Ohya Shobo bookstore in
Jimbocho (left). Edo books sold at the Sunday flea market at the Hanazono Shrine, Tokyo (right).
Our hope is that through releasing datasets in familiar formats, we can encourage dialog between the

ML and Japanese literature communities. We pre-processed characters scanned from 35 classical

books printed in the 18thcentury and organized the dataset into 3 parts:(1)Kuzushiji-MNIST, a
drop-in replacement for the MNIST [ 16 ] dataset,(2)Kuzushiji-49, a much larger, but imbalanced
dataset containing 48 Hiragana characters and one Hiragana iteration mark, and(3)Kuzushiji-Kanji,
an imbalanced dataset of 3832Kanjicharacters, including rare characters with very few samples.
Hiragan a Unicode Samples Sample Image s
 (o) U+^30 4A^7000
 (ki) U+^30 4D^7000
 (su) U+^3059 
 (tsu) U+^3064 
 (na) U+ 30 6A 7000
Hiragan aUnicode Samples Sample Image s
 (ha) U+ 30 6F 7000
 (ma) U+^30 7E^7000
 (ya) U+ 3084 7000
 (re) U+^30 8C^7000
 (wo) U+ 3092 7000
Figure 5: The 10 classes of Kuzushiji-MNIST. Train and test set sizes are 6,000 and 1,000 per class.
Figure 6: Thehentaiganafor Ka () can be written using 12 different root characters (jibo, in
red) [ 15 ], with some of these root characters themselves having multiple ways of being written. Many
of the characters in our datasets have multiple ways of being written, so successful models need to be
able to model the multi-modal distribution of each class, making the problem more challenging.

Since MNIST restricts us to 10 classes, much fewer than the 49 needed to fully represent Kuzushiji Hiragana, we chose one character to represent each of the 10 rows of Hiragana when creating Kuzushiji-MNIST. One characteristic of classical Japanese which is very different from modern

one is that Classical Japanese containsHentaigana().Hentaiganaorvariant kana, are

Hiragana characters that have more than one form of writing, as they were derived from different

Kanji. Therefore, one Hiragana class of Kuzushiji-MNIST or Kuzushiji-49 may have many characters
mapped to it. For instance, as seen in Figure 5, there are 3 different ways to writebecause this
character was derived from different Kanji (and).
Another example of this many-to-one mapping is shown in Figure 6. Even though Kuzushiji-MNIST

was created as drop-in replacement for the MNIST dataset, the characteristics of Hentaigana and Arabic numbers are completely different, and is one reason why we believe the Kuzushiji-MNIST

dataset is more challenging than MNIST.
Figure 7: Examples of some of the 3832 classes in Kuzushiji-Kanji.

The high class imbalance in Kuzushiji-49 and Kuzushiji-Kanji is due to the appearance frequency in the real source books, and kept that way to represent the real data distribution. Kuzushiji-49, as the name suggests, has 49 classes (266,407 images) and Kuzushiji-Kanji has a total of 3832 classes

(140,426 images), ranging from 1,766 examples to only a single example per class. Kuzushiji-MNIST
and Kuzushiji-49 consist of grayscale images of 28×28 pixel resolution, consistent with the MNIST
dataset, while the Kuzushiji-Kanji images are of a larger 64×64 pixel resolution.
Hiragana Unicode Samples Sample Image s
 (a) U+^3042 
 (i) U+^3044 
 (u) U+^3046 
 (e) U+^3048 
 (o) U+^30 4A^7000
 (ka) U+^30 4B^7000
 (ki) U+^30 4D^7000
 (ku) U+ 30 4F 7000
 (ke) U+^3051 
 (ko) U+^3053 
 (sa) U+ 3055 7000
 (shi) U+^3057 
 (su) U+^3059 
 (se) U+^30 5B^4843
 (so) U+^30 5D^4496
 (ta) U+^30 5F^7000
 (chi) U+^3061 
 (tsu) U+^3064 
 (te) U+ 3066 7000
 (to) U+^3068 
 (na) U+^30 6A^7000
 (ni) U+ 30 6B 7000
 (nu) U+^30 6C^2399
 (ne) U+^30 6D^2850
 (no) U+^30 6E^7000
Hiragana Unicode Sample s Samples Image s
 (ha) U+^30 6F^7000
 (hi) U+^3072 
 (fu) U+^3075 
 (he) U+^3078 
 (ho) U+^30 7B^2317
 (ma) U+^30 7E^7000
 (mi) U+^30 7F^3558
 (mu) U+ 3080 1998
 (me) U+^3081 
 (mo) U+^3082 
 (ya) U+ 3084 7000
 (yu) U+^3086 
 (yo) U+^3088 
 (ra) U+^3089 
 (ri) U+^30 8A^7000
 (ru) U+^30 8B^7000
 (re) U+^30 8C^7000
 (ro) U+^30 8D^2487
 (wa) U+ 30 8F 2787
 (i) U+^3090 
 (e) U+^3091 
 (wo) U+ 3092 7000
 (n) U+^3093 


####### U+ 30 9D 4097

Figure 8: Kuzushiji-49 description. Training/Test split is^67 and^17 of each class respectively.

In all three datasets, the characters in the train and test sets are sampled from the same 35 books, meaning the data distributions of each class are consistent between the two sets. While Kuzushiji- MNIST is balanced across classes, Kuzushiji-49 has several rare characters with a small number of

samples (such aswhich has only400 samples).

On the other hand, Kuzushiji-Kanji is a highly imbalanced dataset due to the natural frequency of Kanji appearing in the Kuzushiji literature. In Kuzushiji-Kanji, the number of samples range from

over a thousand to only one sample. This dataset is created for more creative experimental tasks

rather than merely for classification and character recognition benchmarks.
Our design of a drop-in replacement for MNIST was inspired by the popular Fashion-MNIST [ 25 ], a

dataset of fashion items that is considerably more difficult than the original MNIST dataset, while

maintaining ease of use. One aspect of Fashion-MNIST that we believe decreases model performance

compared to MNIST is that many fashion items, such as shirts, T-shirts, or coats look very similar at 28×28 pixel resolution in grayscale, making many samples ambiguous even for humans (Human

performance on Fashion-MNIST is only 83.5% [ 24 ]). A characteristic of Kuzushiji-MNIST that

makes it more difficult compared to MNIST is that there are in fact multiple very different ways to

write certain characters, while each way of writing is still unambiguous at 28×28 pixel resolution for

human readers, meaning we believe there is less of a performance cap. Another difference is that

while fashion trends come and go, and what constitute a shirt may be different a hundred years from
now, Kuzushiji will always remain Kuzushiji. We believe both Fashion-MNIST and Kuzushiji-MNIST
will be useful companions to the original MNIST dataset for the research community.

3 Experiments

3.1 Classification Baselines for Kuzushiji-MNIST and Kuzushiji-
Model MNIST [16] Kuzushiji-MNIST Kuzushiji-
4-Nearest Neighbour Baseline 97.14% 91.56% 86.01%
Keras Simple CNN Benchmark [4] 99.06% 95.12% 89.25%
PreActResNet-18 [11] 99.56% 97.82% 96.64%
PreActResNet-18 + Input Mixup [26] 99.54% 98.41% 97.04%
PreActResNet-18 + Manifold Mixup [22] 99.54% 98.83% 97.33%
Table 1: Test set accuracy, computed as mean of per-class accuracies to address class imbalance.

We present baseline classification results on Kuzushiji-MNIST and Kuzushiji-49 in Table 1. We

consider 4 different baselines: A simple 4-nearest neighbours algorithm, a small 2-layer convolutional
network, an 18-layer ResNet [ 11 ], and a ResNet that incorporates a manifold mixup regularizer [ 22 ].

For the training setup details, please refer to the GitHub repository that contains the dataset. By

comparing the performance numbers to the original MNIST dataset using various different approaches,
we hope these results will provide a sense of the relative difficulty of our dataset.
3.2 Domain Transfer from Kuzushiji-Kanji to Modern Kanji

In addition to classification, we are interested in more creative uses of our dataset. While existing

work [ 3 , 12 , 17 , 23 ] on domain transfer focuses on pixel images, we explore instead the transfer from

pixel images tovectorimages, across two different domains. Our proposed model aims to generate

Modern Kanji versions of a given Kuzushiji-Kanji input, in both pixel and stroke-based formats.

(^12) 3 4 5 (^12) 4 3 5 6 (^78) 9 2314 65 78 9181011121314151617 (^2021192223) 24 2 1 5 3 4 8 6 7 1 2 34 5 68 7 (^123) (^456) (^78109) 11 1 2 (^345) 7 6 8 (^12) 3 45 6 7 8 (^910) 11 12 13

Figure 9: Kuzushiji-Kanji 64x64px samples (Top) and stroke-based Modern Kanji versions (Bottom).
We employ KanjiVG [ 1 ], a font for Modern Kanji in a stroke-ordered format. Variational Autoen-
coders [ 14 , 18 ] provide a latent space for both Kuzushiji-Kanji and a pixel version of KanjiVG. A
Sketch-RNN [ 8 ] model is then trained to generate Modern Kanji strokes, conditioned on the VAEs
latent space. Predicting pixel versions of Modern Kanji using a VAE also aids human transcribers as
the blurry regions of the output can be interpreted as uncertain regions to focus on. In addition to the
earlier Figure 3, see Figure 10 below for a demonstration of our model on test set examples.
Kuzushiji Kanji(Pixels)
Modern KanjiPrediction
Modern KanjiPredictions
Modern Kanji(Strokes)
Modern Kanji(Pixels)
Figure 10: More domain transfer examples including the VAE pixel reconstructions for both domains.
In Figure 11, we present an overall diagram of our approach. We first train two separate Convolutional
Variational Autoencoders, one on the Kuzushiji-Kanji dataset, and also a second on a pixel version of

KanjiVG dataset rendered to 64×64 pixel resolution for consistency. The architecture for the VAE

is identical to [ 9 ] and both datasets are compressed into their own respective 64-dimensional latent
space,zoldandznew. As in previous work [ 8 ], we do not optimize the KL loss term below a certain
threshold, ensuring some information capacity while enforcing the Gaussian prior onz.
Kuzushiji KanjiVAE Encoder
Kuzushiji KanjiVAE Decoder
Modern KanjiVAE Encoder
Modern KanjiVAE Decoder
Modern Kanji InputSamples (Pixels)
Modern KanjiVAE Decoder
Mixture Density NetworkDomain Transfer^ Sketch-RNNConditional
Sketch-RNN conditioned on ztrained on Modern Kanji stroke datanew Sketch-RNN generates Modern Kanjiversion of Kuzushiji Kanji, stroke-by-stroke
Kuzushiji Kanji InputSamples (Pixels)
Modern Kanji OutputSamples (Pixels) Kuzushiji Kanji OutputSamples (Pixels) Predictions (Pixels)Modern Kanji
(1) Schematic of Sketch-RNN
conditioned on Latent Space of
Variational Autoencoder trained
on pixel version of Modern Kanji.
(2) Schematic of Sketch-RNN conditioned on predicted Latent
Space of Modern Kanji, given latent space of Kuzushiji Kanji.
Figure 11: Overview of our approach.(1)We first train a VAE on pixel version of KanjiVG (Modern

Kanji), and a Sketch-RNN model to generate stroke versions of KanjiVG conditioned on the latent

space,znew.(2)We train a VAE on Kuzushiji-Kanji, and train a Mixture Density network [ 2 ] to
predictP(znew|zold). We generate stroke versions of Modern Kanji based on the predicted znew.
Algorithm 1Summary of training procedure in domain transfer experiment.
1. Train two separate Variational Autoencoders [14, 18] on pixel version of KanjiVG and Kuzushiji-Kanji.
2. Train Mixture Density Network [2] to modelP(znew|zold)as mixture of Gaussians.
3. Train Sketch-RNN [8] to generate KanjiVG strokes conditioned on eitherzneworz newP(znew|zold).
We then train a Mixture Density Network (MDN) [ 2 ] with 2 hidden layers to model the density
function ofP(znew|zold)approximated as a mixture of Gaussians. We can then sample a latent vector
znewin the domain of Modern Kanji, given a latent vectorzoldencoded from Kuzushiji-Kanji. We
note that training two separate VAE models on each dataset is much more efficient and achieves better
results compared to training a single model end-to-end, which in our experience does not work well,
and might explain why previous works [3, 12, 17, 23] require the use of an adversarial loss.
Previous work [ 7 , 27 ] utilized MDN-RNN to generate stroke-based Chinese characters. In our last
step, we train a Sketch-RNN [ 8 ] decoder model to generate Modern Kanji conditioned on znew. There

are around 3,600 overlapping Kanji characters between the two datasets. For characters that are not

in Kuzushiji-Kanji, we condition the model on theznewencoded from KanjiVG data to generate the
stroke data also from KanjiVG, see (1) in Figure 11. For characters that are in the overlapping 3,
set, we use thez newsampled from the MDN conditioned onzold, to generate the stroke data also from

KanjiVG, as per (2) in Figure 11. By doing this, the Sketch-RNN training procedure can fine tune aspects of the VAEs latent space that may not capture well parts of the data distribution of Modern

Kanji when trained only on pixels, by training it again on the stroke version of the dataset.

4 Future Directions

We believe the Kuzushiji datasets will not only serve as a benchmark to advance classification algorithms, but also contribute to more creative areas such as generative modelling, adversarial

examples, few-shot learning, transfer learning and domain adaptation. To foster community building,

we plan to organize machine learning competitions using Kuzushiji datasets to encourage further development of these research areas. We are also working on expanding the size of the dataset, and

by next year, the size of the full Kuzushiji dataset will expand to over a million character images. We

hope these efforts will encourage further collaboration between different research fields and at the

same time, help preserve the cultural knowledge and heritage of Japanese history.


[1] U. Apel et al. KanjiVG, 2009.
[2] C. M. Bishop. Mixture Density Networks.Technical Report, 1994.
[3]K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan. Unsupervised pixel-level
domain adaptation with generative adversarial networks. InThe IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), page 7, 2017.
[4] F. Chollet et al. Keras, 2015.
[5]C. for Open Data in the Humanities. Kuzushiji dataset, 2016.
[6] Fukyusha. Shinpen Shushinkyouten Vol.3 ( ). Fuky usha, 1900.
[7]D. Ha. Recurrent Net Dreams Up Fake Chinese Characters in Vector Format with TensorFlow,
[8]D. Ha and D. Eck. A Neural Representation of Sketch Drawings. InInternational Conference
on Learning Representations, 2018.
[9]D. Ha and J. Schmidhuber. Recurrent World Models Facilitate Policy Evolution.arXiv preprint
arXiv:1809.01999, 2018.
[10]Y. Hashimoto, Y. Iikura, Y. Hisada, S. Kang, T. Arisawa, and D. Kobayashi-Better. The
Kuzushiji Project: Developing a Mobile Learning Application for Reading Early Modern
Japanese Texts. DHQ: Digital Humanities Quarterly, 11(1), 2017.http://dh2016.adho.
[11]K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. InEuropean
conference on computer vision, pages 630645. Springer, 2016.
[12]P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-Image Translation with Conditional
Adversarial Networks. In2017 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), pages 59675976. IEEE, 2017.
[13]E. Kaibara.Onna Daigaku (). Shinsaibashijunkeicho, 1772. http://www.wul.
[14]D. Kingma and M. Welling. Auto-Encoding Variational Bayes. InInternational Conference on
Learning Representations, 2014.
[15] K. Kodama.Kuzushiji Y orei Jiten. Kondo Shuppansha, 1980.
[16]Y. LeCun. The MNIST database of handwritten digits, 1998.
[17]M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In
Advances in Neural Information Processing Systems, pages 700708, 2017.
[18]D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic Backpropagation and Approximate
Inference in Deep Generative Models. InInternational Conference on Machine Learning, pages
12781286, 2014.
[19] I. Shoten.General Catalog of National Books (). Iwanami Shoten, 2002.
[20]K. Takashiro. Notation of the Japanese Syllabary seen in the Textbook of the Meiji first Year.
The bulletin of Jissen Womens Junior College, 34:109119, mar 2013.
[21]Unknown.Scroll Genjimonogatari Uta Awase (). National Institute of
Japanese Literature, c. 1500.
[22]V. Verma, A. Lamb, C. Beckham, A. Najafi, A. Courville, I. Mitliagkas, and Y. Bengio. Manifold
Mixup: Learning Better Representations by Interpolating Hidden States.ArXiv e-prints, June
[23]L. Wolf, Y. Taigman, and A. Polyak. Unsupervised Creation of Parameterized Avatars. In
Computer Vision (ICCV), 2017 IEEE International Conference on, pages 15391547. IEEE,
[24]H. Xiao et al. Fashion-MNIST: A MNIST-like fashion product database, 2017. https:
[25]H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking
machine learning algorithms.arXiv preprint arXiv:1708.07747, 2017.
[26]H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond Empirical Risk
Minimization. InInternational Conference on Learning Representations, 2018.
[27]X. Zhang, F. Yin, Y. Zhang, C. Liu, and Y. Bengio. Drawing and Recognizing Chinese
Characters with Recurrent Neural Network.CoRR, abs/1606.06539, 2016.